For the first time in history! "AI vs. Human" job performance evaluation, and the results don't look good for humans

For the first time in history! "AI vs. Human" job performance evaluation, and the results don't look good for humans

```

OpenAI's newly released GDPval-v0 evaluation tool has, for the first time, quantified AI's ability to perform economically valuable work tasks, and the results show that AI is rapidly catching up to, or even approaching, the level of human professionals. Barclays stated that the most advanced AI models have already reached human-expert level in many occupational tasks, and the speed of this capability improvement is accelerating.

According to a previous article by Jianshi, OpenAI has just released a brand-new evaluation tool called GDPval-v0, covering about 1,300 specific work tasks across 44 professions in nine major business sectors that account for a large proportion of the US GDP, ranging from legal documents to engineering blueprints to nursing plans and other real work deliverables.

The results show that the most advanced AI models now possess abilities comparable to human professionals in many occupational tasks, and the rate of progress is accelerating. On October 5, according to Hard AI news, Barclays stated in its latest research report that Anthropic's Claude Opus 4.1 achieved a "win or tie" rate of 47.6% in head-to-head comparison with human experts, ranking first.

Barclays analysts believe that the "win rate" of AI models has increased linearly by about four times in the past 15 months, and AI is expected to surpass humans in most work-related tasks in the next 12-24 months. The analysis suggests that this breakthrough provides key data support for evaluating AI's return on investment.

Innovative Advances in Evaluation Standards: Simulating the Complexity of Real Work

According to Barclays' research report, the core innovation of the GDPval benchmarking test lies in its authenticity and complexity.

This evaluation was designed by seasoned professionals with an average of over 14 years of industry experience, covering 1,230 professional tasks in industries such as technology services, financial insurance, healthcare, information, and manufacturing.

Unlike traditional benchmarks, GDPval's tasks are not simple text Q&A, but include complex scenarios with reference documents and context, requiring AI to deliver a diverse range of products including documents, slides, charts, and spreadsheets. Barclays points out that this design more closely mirrors the complexity of real working environments.

The evaluation adopts a blind testing method, with industry experts ranking AI and human-generated work outputs comprehensively based on dimensions such as difficulty, representativeness, completion time, and overall quality.

AI Performance Approaching Human Expert Level

Barclays analysis shows that the most advanced AI models are already approaching or reaching human expert level in multiple fields. Claude Opus 4.1 leads with a win rate of 47.6%, followed closely by GPT-5-high at 38.8%, with o3 high at 34.1%.

By industry dimension, AI outperformed human experts in retail trade (56% win rate), wholesale trade (53%), and government sectors (52%), but performed relatively weaker in the information technology sector (39%).

At the occupational level, AI performed best in tasks for counter and rental clerks (80%), transportation, receiving and inventory clerks (76%), and software developers (70%), but poorly in industrial engineers (17%) and film and video editors (17%).

Each model showed different characteristics: Claude Opus 4.1 excelled in aesthetics (format and layout), while GPT-5 was the most precise in following instructions and carrying out accurate calculations.

Astonishing Speed of Capability Improvement

The Barclays report particularly emphasized the speed of AI capability improvement.

The report noted that the performance of OpenAI models in GDPval testing improved more than threefold in 15 months, and this linear growth trend suggests AI is likely to surpass human experts across the board in the short term.

Analysis of GPT-5's errors shows that while the model still makes some catastrophic errors (2.7%), 47.7% of the mistakes were classified as "acceptable but not good," while in 22.9% of cases the model even outperformed humans.

Barclays analysts believe that the raw intelligence of AI models, especially GPT-5, has already reached the level of surpassing human experts. With further post-training (fine-tuning, reinforcement learning), the era of AI comprehensively surpassing industry experts is not far away.

 

This article is from the WeChat official account "Hard AI". For more cutting-edge AI information, please check here

Risk Disclosure and DisclaimerThe market carries risks and investments should be made cautiously. This article does not constitute personal investment advice, nor does it take into account the specific investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are appropriate for their own circumstances. Investment decisions made accordingly are at your own risk. ```