Humans should be cautious! OpenAI has conducted a comprehensive evaluation of AI's replacement of jobs across various industries.

```

A recent assessment from OpenAI shows that AI is rapidly catching up to, and even approaching, the level of human professionals in performing economically valuable work tasks.

Reportedly, OpenAI released a brand-new evaluation tool on Thursday called GDPval-v0. This tool is designed to assess the performance of AI models in completing “real work deliverables,” such as legal documents, engineering blueprints, and nursing plans.

The study covers nine business sectors with high proportions in U.S. GDP, involving about 1,300 specific work tasks across 44 professions. The results show that today's top AI models perform many professional tasks at a level comparable to human professionals—and the pace of improvement is accelerating.

Following the release of GDPval-v0, former OpenAI Policy Director and Anthropic co-founder Jack Clark published a comprehensive review of GDPval's research process and results in his latest blog post, “Eval the world economy; singularity economics; and Swiss sovereign AI.”

GDPval may become a new standard for measuring AI’s economic value

According to the article, the GDPval benchmark covers 1,230 professional tasks spanning industries like tech services, finance and insurance, healthcare, information, manufacturing, and more. Each task was carefully designed and reviewed by senior professionals with over 14 years of industry experience on average.

Clark points out that this list covers nearly all the key knowledge-intensive positions in the modern economy, indicating that AI firms are systematically testing their systems’ adaptability across economic “niches.”

The article also notes that another outstanding feature of this benchmark is that it involves multiple response formats and aims to handle the complexities inherent in the real world.

To simulate the complexity of real-world work, GDPval’s tasks are not simple text Q&As, but include reference materials and context. The required AI deliverables are also diverse, including documents, slide decks, charts, and spreadsheets.

The evaluation results directly quantify the boundaries of AI’s capabilities. Data show that Claude Opus 4.1 ranked first among models in comparison with human experts, achieving a 47.6% “win or draw” rate. Next were GPT-5-high (38.8%) and o3 high (34.1%).

These numbers indicate that AI has reached or even exceeded the quality level of experienced humans in handling complex professional knowledge work.

Clark believes that the advent of GDPval provides a key benchmark for evaluating the broad economic impact of AI, similar in significance to what SWE-Bench is for the programming field.

Public information shows that SWE-Bench was launched in November 2024 to evaluate the programming capabilities of AI models. The benchmark uses over 2,000 real programming problems extracted from the public GitHub repositories of 12 different Python projects as its evaluation basis.

Below are excerpts from Clark’s blog post, assisted by AI translation tools:

Evaluating the world economy; singularity economics; and Swiss sovereign AI

By: Jack Clark

OpenAI has built an evaluation system whose significance for the broad economy is like what SWE-Bench means for code: …GDPval is a very strong benchmark with exceptional significance…

OpenAI has built and released GDPval, a well-made benchmark to test how well AI systems perform the kinds of tasks people do in the real-world economy. In terms of evaluation, GDPval’s significance for assessing the broad, real-world economic impact might be equivalent to what SWE-Bench means for coding impact—a big deal!

What it is: GDPval “measures models’ performance on tasks drawn directly from the real world, involving knowledge work by experienced professionals from different industries, providing a clearer picture of model performance on economically valuable tasks.”

This benchmark spans 44 professions across 9 industries, covering 1,230 professional tasks, “each carefully created and reviewed by experienced professionals with over 14 years of experience on average.” The dataset “includes 30 fully reviewed tasks for each profession (the full set), as well as 5 tasks for each profession in our open-source gold set.”

Another outstanding feature of this benchmark is that it involves a diversity of response formats and seeks to reflect the complexity of the real world. They write: “GDPval’s tasks are not simple text prompts. They come with reference documents and context, and the expected deliverables cover documents, slide decks, charts, spreadsheets, and multimedia. This realism makes GDPval a more realistic test of how models can support professionals.”

“To evaluate model performance on GDPval tasks, we rely on expert ‘graders’—a group of experienced professionals from the same occupations represented in the dataset. These graders blindly compare the model-generated deliverables with those produced by the task’s author (without knowing which is AI and which is human), and provide critiques and rankings. They then rank the human and AI deliverables, and categorize each AI deliverable as ‘better,’ ‘just as good,’ or ‘not as good’ as the other,” the authors write.

Results: “We found that today’s best frontier models are already approaching industry-expert work quality,” the authors write. Claude Opus 4.1 ranks first, with a total win-or-draw rate of 47.6% compared to human work, followed by GPT-5-high at 38.8% and o3 high at 34.1%.

Faster, cheaper: More importantly, “We found that frontier models complete GDPval tasks about 100 times faster and at about 100 times less cost than industry experts.”

What types of work are included in GDPval?

• Real Estate and Rental: Concierges; property, real estate, and community association managers; real estate sales agents; realtors; counter and rental clerks.

• Government Sector: Recreational workers; compliance officers; first-line supervisors of police and detectives; administrative services managers; child, family, and school social workers.

• Manufacturing: Mechanical engineers; industrial engineers; buyers and purchasing agents; shipping, receiving, and inventory clerks; first-line supervisors of production and operating workers.

• Professional, Scientific, and Technical Services: Software developers; lawyers; accountants and auditors; computer and information systems managers; project management specialists.

• Healthcare and Social Assistance: Registered nurses; nurse practitioners; medical and health services managers; first-line supervisors of office and administrative support workers; medical secretaries and administrative assistants.

• Finance and Insurance: Customer service representatives; financial and investment analysts; financial managers; personal financial advisors; securities, commodities, and financial services sales agents.

• Retail Trade: Pharmacists; first-line supervisors of retail sales workers; general and operations managers; private detectives and investigators.

• Wholesale Trade: Sales managers; order clerks; first-line supervisors of non-retail sales workers; wholesale and manufacturing sales representatives (excluding technical and scientific products); wholesale and manufacturing sales representatives (technical and scientific products).

• Information: Audio and video technicians; producers and directors; news analysts, reporters, and journalists; film and video editors; editors.

Why this matters—AI companies are building systems to enter every part of the economy: At this point I want the reader to imagine me standing in central Washington D.C., holding a giant sign that says: AI companies are building benchmarks designed to test their systems on every kind of job in the economy—and they’re already very good at it!

This is not normal!

We are testing systems through ecologically effective benchmarks across an incredibly broad range of behaviors. These benchmarks ultimately tell us how well these systems can adapt to around 44 different “eco-economic niches” in the world, and we’re finding that they are already very close to matching human performance—and this is just with today’s models. Soon, they’ll surpass many humans in these tasks. Then what? Nothing happens? No! The economy will undergo profoundly strange changes!

Risk Warning and DisclaimerThe market has risks, and investment needs caution. This article does not constitute personal investment advice and does not take into account the individual investment objectives, financial situation, or needs of the users. Users should consider whether any opinions, views, or conclusions in the article are suitable to their specific circumstances. Any investment based on this article is at their own risk. ```