This document provides a detailed breakdown of key terms and technologies in the broader field of artificial intelligence (AI).

Research into Artificial Intelligence (AI) started as far back as the 1950s, and since that time it has branched out into a wide range of different field and techniques. However, for most of the general public, it only really registered as something particularly significant in 2022, with the first publicly available Large Language Models (LLMs) and Generative Pre-Trained Transformers (GPTs) in the form of ChatGPT, Gemini etc.

Diagram of the taxonomy of AI Over the next three years to 2025, the use and impact of GPTs has expanded massively, to the extent that, of all the many different fields of AI, GPT now occupy about 98% of total human thinking compared to the rest. So much so that people directly equate the terms AI and GPTs – in other words, to most people’s thinking, AI is ChatGPT, Gemini, Copilot, etc. where as technically it is just one part of one branch, several branches down the tree – as you can see in this diagram.

The purpose of this document is to briefly explain the wider field of AI, of which GPTs are just one part (albeit now a massively important part). It is also important to note that the gravitational effect of the massive focus on GPTs has drawn other AI technology branches into our understanding and development of GPTs, and so the branches described below are no longer as well defined and separate as they once might have been.

AI in General

This category includes broad, interdisciplinary fields and foundational concepts that form the basis of AI but are not confined to a single subfield like traditional machine learning or neural networks.

Robotics and Control Systems:

Simultaneous Localization and Mapping (SLAM): Algorithms that allow a robot to build a map of its environment while simultaneously keeping track of its own location within it.
Sensor Fusion: Techniques to combine data from multiple sensors (e.g., cameras, LiDAR, GPS) to get a more accurate and reliable understanding of the environment.
Path Planning: Algorithms (e.g., A*, Dijkstra) that find the most efficient route for a robot to travel.

Other Notable Technologies:

Expert Systems: Early AI systems that used human-defined rules and knowledge bases to solve complex problems in specific domains, like medical diagnosis.
Evolutionary Algorithms: Optimization techniques inspired by natural evolution, such as Genetic Algorithms, which use concepts like mutation and selection to find optimal solutions to problems.
Fuzzy Logic: A form of logic that deals with degrees of truth rather than a binary “true or false.” It’s used in control systems and decision-making to handle uncertainty.
And Machine Learning >>>

Machine Learning

This broad category covers foundational and classical approaches to AI. These are algorithms that learn from data to make predictions or decisions without being explicitly programmed for every scenario.

Supervised Learning:

Linear/Logistic Regression: Used for predicting a numerical value (regression) or classifying data into two categories (e.g., spam vs. not spam).
Support Vector Machines (SVMs): Finds the optimal hyperplane to separate data points into different classes.
Decision Trees: Creates a tree-like model of decisions and their possible consequences to solve classification or regression problems.
Random Forests: An ensemble method that builds multiple decision trees and combines their outputs to improve accuracy.
Naïve Bayes: A classification algorithm based on the Bayes’ theorem, often used for text classification and sentiment analysis.
K-Nearest Neighbors (K-NN): Classifies a data point based on the majority class of its “k” nearest neighbors.

Unsupervised Learning:

K-Means Clustering: An algorithm that groups unlabeled data into a specified number of clusters based on similarity.
Principal Component Analysis (PCA): A dimensionality reduction technique used to simplify complex datasets.

Reinforcement Learning:

Q-Learning: A classic RL algorithm that learns a table of “Q-values” to determine the best action to take in any given state.
Policy Gradient Methods: A family of algorithms (e.g., A2C, PPO) that directly optimize a policy, which is the agent’s strategy for taking actions.

Classical NLP Techniques:

Word2Vec/GloVe: Models that create numerical representations (embeddings) of words, capturing their meaning and relationships.
Rule-Based Systems: Older NLP systems that rely on manually created rules and grammar to understand and process language.
Bag-of-Words/TF-IDF: Statistical models that represent text as a collection of words, often used for simple text classification and information retrieval.

And Neural Networks >>>

Neural Networks

This section focuses on deep learning architectures that use interconnected nodes (neurons) to process data. These models are particularly powerful for pattern recognition in complex datasets, such as images, audio, and text.

Recurrent Neural Networks (RNNs) and LSTMs: Architectures designed to process sequential data like text. They maintain an internal “memory” and were the dominant deep learning approach for language tasks before transformers.
Convolutional Neural Networks (CNNs): A deep learning architecture that has been the dominant force in computer vision. It is adept at processing pixel data through convolutional layers.
U-Net: A CNN architecture commonly used in medical imaging for precise segmentation of objects.
You Only Look Once (YOLO): A very fast and popular object detection model that uses a single CNN to identify objects.
Region-based Convolutional Neural Networks (R-CNN): A family of models that use a two-step process to first identify regions of interest and then classify objects within them using a CNN.
Deep Q-Network (DQN): An extension of Q-Learning that uses a neural network to handle more complex environments and states.
And Large Language Models >>>

Large Language Models

This category represents the modern state-of-the-art in Natural Language Processing (NLP). These models are primarily built on the Transformer architecture and are trained on vast amounts of text data, allowing them to understand and process human language at a highly sophisticated level.

BERT (Bidirectional Encoder Representations from Transformers): A model developed by Google that focuses on understanding the context of words by looking at the words that come before and after them simultaneously. This makes it highly effective for tasks like question-answering and sentiment analysis.
RoBERTa (Robustly Optimized BERT Pre-training Approach): An optimization of BERT developed by Facebook. It uses a modified training approach, including a larger dataset and longer training time, to improve performance on various NLP tasks without changing the model’s architecture.
T5 (Text-to-Text Transfer Transformer): A model by Google that reframes all NLP problems into a text-to-text format. This means that tasks like classification, summarization, and translation are all treated as inputting text and outputting text.
ULMFiT (Universal Language Model Fine-tuning): An approach that enables a pre-trained language model to be fine-tuned for a specific task using a smaller dataset, making transfer learning more accessible.
And Generative Pre-trained Transformers >>>

Generative Pre-trained Transformers

This category describes a powerful class of large language models known for their exceptional ability to generate human-like text. They are built on the Transformer architecture and are characterized by a two-stage process: a massive pre-training phase followed by a fine-tuning phase for specific tasks. Their success is largely attributed to their decoder-only structure, which is optimized for sequential text generation.

Pre-training and Fine-tuning: These models are first trained on an enormous dataset of text to learn the general rules of language, grammar, and a vast amount of world knowledge. This is the pre-training Afterward, they can be further trained on smaller, task-specific datasets in a fine-tuning phase to improve their performance on tasks like question answering, summarization, or translation.
Decoder-only Architecture: Unlike models like BERT which use a bi-directional encoder to understand context, GPT-style models use a unidirectional decoder. This architecture processes text sequentially, predicting the next word in a sequence based on all the previous words, making it ideal for creative and conversational generation.
In-context Learning: A key feature of these models is their ability to perform tasks without explicit fine-tuning. By providing a few examples of a task in the prompt itself, the model can learn and follow the desired pattern. This capability, also known as few-shot learning, allows for immense flexibility and is a hallmark of modern LLMs.

Measuring the performance of our models on real-world tasks

Source: OpenAI – 3rd September 2025

We’re introducing GDPval, a new evaluation that measures model performance on economically valuable, real-world tasks across 44 occupations.

Read the paper

Visit evals.openai.com

Our mission is to ensure that artificial general intelligence benefits all of humanity. As part of our mission, we want to transparently communicate progress on how AI models can help people in the real world. That’s why we’re introducing GDPval: a new evaluation designed to help us track how well our models and others perform on economically valuable, real-world tasks. We call this evaluation GDPval because we started with the concept of Gross Domestic Product (GDP) as a key economic indicator and drew tasks from the key occupations in the industries that contribute most to GDP.

People often speculate about AI’s broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing. History shows that major technologies—from the internet to smartphones—took more than a decade to go from invention to widespread adoption. Evaluations like GDPval help ground conversations about future AI improvements in evidence rather than guesswork, and can help us track model improvement over time.

Previous AI evaluations like challenging academic tests and competitive coding challenges have been essential in pushing the boundaries of model reasoning capabilities, but they often fall short of the kind of tasks that many people handle in their everyday work.

To bridge this gap, we’ve been developing evaluations that measure increasingly realistic and economically relevant capabilities. This progression has moved from classic academic benchmarks like MMLU (exam-style questions across dozens of subjects), to more applied evaluations like SWE-Bench (software engineering bug-fixing tasks), MLE-Bench (machine learning engineering tasks such as model training and analysis), and Paper-Bench (scientific reasoning and critique on research papers), and more recently to market-based evaluations like SWE-Lancer (freelance software engineering projects based on real payouts).

GDPval is the next step in that progression. It measures model performance on tasks drawn directly from the real-world knowledge work of experienced professionals across a wide range of occupations and sectors, providing a clearer picture on how models perform on economically valuable tasks. Evaluating models on realistic occupational tasks helps us understand not just how well they perform in the lab, but how they might support people in the work they do every day.

What GDPval measures

GDPval, the first version of this evaluation, spans 44 occupations selected from the top 9 industries contributing to U.S. GDP. The GDPval full set includes 1,320 specialized tasks (220 in the gold open-sourced set), each meticulously crafted and vetted by experienced professionals with over 14 years of experience on average from these fields. Every task is based on real work products, such as a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan.

GDPval is distinctive both in its realism and diversity of tasks being evaluated. Unlike other evaluations tied to economic value which concentrate on specific domains (e.g., SWE-Lancer), GDPval covers many tasks and occupations. And unlike benchmarks which involve synthetically creating tasks in the style of an academic exam or test (e.g., Humanity’s Last Exam or MMLU), GDPval focuses on tasks based on deliverables that are either an actual piece of work or product that exists today or are a similarly constructed piece of work product.

Unlike traditional benchmarks, GDPval tasks are not simple text prompts. They come with reference files and context, and the expected deliverables span documents, slides, diagrams, spreadsheets, and multimedia. This realism makes GDPval a more realistic test of how models might support professionals.

GDPval is an early step that doesn’t reflect the full nuance of many economic tasks. While it spans 44 occupations and hundreds of knowledge work tasks, it is limited to one-shot evaluations, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts. Future versions will extend to more interactive workflows and context-rich tasks to better reflect the complexity of real-world knowledge work (see more in our Limitations section below).

How we chose occupations

GDPval covers tasks across 9 industries and 44 occupations, and future versions will continue to expand coverage. The initial 9 industries were chosen based on those contributing over 5% to U.S. GDP, as determined by data from the Federal Reserve Bank of St. Louis. Then, we selected the 5 occupations within each industry that contribute most to total wages and compensation and are predominantly knowledge work occupations, using wage and employment data from the May 2024 US Bureau of Labor Statistics (BLS) occupational employment report⁠(opens in a new window). To determine if the occupations were predominantly knowledge work, we used task data from O*NET⁠(opens in a new window), a database of U.S. occupational information sponsored by the U.S. Department of Labor. We classified whether each task for each occupation in O*NET was knowledge work or physical work/manual labor (requiring actions to be taken in the physical world). An occupation qualified overall as “predominantly knowledge work” if at least 60% of its component tasks were classified as not involving physical work or manual labor. We chose this 60% threshold as a starting point for the first version of GDPval, focusing on occupations where AI could have the highest impact on real-world productivity.

This process yielded 44 occupations for inclusion.

Real estate and rental and leasing

Concierges
Property, real estate, and community association managers
Real estate sales agents
Real estate brokers
Counter and rental clerks

Government

Recreation workers
Compliance officers
First-line supervisors of police and detectives
Administrative services managers
Child, family, and school social workers

Manufacturing

Mechanical engineers
Industrial engineers
Buyers and purchasing agents
Shipping, receiving, and inventory clerks
First-line supervisors of production and operating workers

Professional, scientific, and technical services

Software developers
Lawyers
Accountants and auditors
Computer and information systems managers
Project management specialists

Health care and social assistance

Registered nurses
Nurse practitioners
Medical and health services managers
First-line supervisors of office and administrative support workers
Medical secretaries and administrative assistants

Finance and insurance

Customer service representatives
Financial and investment analysts
Financial managers
Personal financial advisors
Securities, commodities and financial services sales agents

Retail trade

Pharmacists
First-line supervisors of retail sales workers
General and operations managers
Private detectives and investigators

Wholesale trade

Sales managers
Order clerks
First-line supervisors of non-retail sales workers
Sales representatives, wholesale and manufacturing, except technical and scientific products
Sales representatives, wholesale and manufacturing, technical and scientific products

Information

Audio and video technicians
Producers and directors
News analysts, reporters, and journalists
Film and video editors
Editors

GDPval spans 44 knowledge work occupations across 9 sectors, from software developers and lawyers to registered nurses and mechanical engineers. These occupations were selected for their economic significance and represent the types of day-to-day work where AI can meaningfully assist professionals.

How we built the dataset

For each occupation, we worked with experienced professionals to create representative tasks that reflect their day-to-day work. These professionals averaged 14 years of experience, with strong records of advancement. We deliberately recruited a breadth of experts—such as lawyers from different practice areas and firms of different sizes—to maximize representativeness.

Each task went through a multi-step review process to ensure it was representative of real work, feasible for another professional to complete, and clear for evaluation. On average, each task received 5 rounds of expert review, including checks from other task writers, additional occupational reviewers, and model-based validation.

The resulting dataset includes 30 fully reviewed tasks per occupation (full-set) with 5 tasks per occupation in our open-sourced gold set, providing a robust foundation for evaluating model performance on real-world knowledge work.

Examples of GDPval tasks

Prompt + task context

This is June 2025 and you are a Manufacturing Engineer, in an automobile assembly line. The product is a cable spooling truck for underground mining operations, and you are reviewing the final testing step. In the final testing step, a big spool of cable needs to be reeled in and reeled out 2 times, to ensure the cable spooling works as per requirement. The current operation requires 2 persons to work on this test. The first person needs to bring and position the spool near the test unit, the second person will connect the open end of the cable spool to the test unit and start the reel in step. While the cable is being unreeled from the spool, and onto the truck, the first person will need to rotate the spool in order to facilitate the unreeling. When the cable is fully reeled onto the truck, the next step is to perform the operation in reverse order, so the cable gets reeled out of the truck and back onto its own reel. This test is done another time to ensure functionality. This task is complicated, has associated risks, requires high labor and makes the work area cluttered. Your manager has requested you to develop a jig/fixture to simplify reel in and reel out of the cable reel spool, so the test can be done by one person. Attached to this request is an information document which provides basic details about the cable reel drum size, information to design the cable reel spooling jig and to structure the deliverable. The deliverable for this task will be a preliminary concept design only. Separate tasks will be done to calculate design foundations such as stress, strength, cost benefit analysis, etc. Design a jig using 3d modelling software and create a presentation using Microsoft PowerPoint. As part of the deliverable, upload only a pdf document summarizing the design, using snapshots of the 3d design created. The 3d design file is not required for submission.

Cable reel project requirements.pdf

Experienced human deliverable

Exploded view of a design for a cable reel

Each task in GDPval is designed by an experienced professional and reflects real knowledge work from their occupation. The prompt is a realistic work assignment created by a domain expert, and the gold deliverable is the expert’s own solution.

How we grade model performance

To evaluate model performance on GDPval tasks, we rely on expert “graders”—a group of experienced professionals from the same occupations represented in the dataset. These graders blindly compare model-generated deliverables with those produced by task writers (not knowing which is AI versus human generated), and offer critiques and rankings. Graders then rank the human and AI deliverables and classify each AI deliverable as “better”, “as good as”, or “worse than” one another.

Task writers also created detailed scoring rubrics for their occupations, which add consistency and transparency to the grading process. We also built an “automated grader”, an AI system trained to estimate how human experts would judge a given deliverable. In other words, instead of running a full expert review every time, the automated grader can quickly predict which output people would likely prefer. We’re releasing this tool through at evals.openai.com as an experimental research service, but it’s not yet as reliable as expert graders, so we don’t use it to replace them.

Early results

We found that today’s best frontier models are already approaching the quality of work produced by industry experts. To test this, we ran blind evaluations where industry experts compared deliverables from several leading models—GPT‑4o, o4-mini, OpenAI o3, GPT‑5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4—against human-produced work. Across 220 tasks in the GDPval gold set, we recorded when model outputs were rated as better than (“wins”) or on par with (“ties”) the deliverables from industry experts, as shown in the bar chart below. Claude Opus 4.1 was the best performing model in the set, excelling in particular on aesthetics (e.g., document formatting, slide layout), and GPT‑5 excelled in particular on accuracy (e.g., finding domain-specific knowledge). We also see clear progress over time on these tasks. Performance has more than doubled from GPT‑4o (released spring 2024) to GPT‑5 (released summer 2025), following a clear linear trend.

In addition, we found that frontier models can complete GDPval tasks roughly 100x faster and 100x cheaper than industry experts. However, these figures reflect pure model inference time and API billing rates, and therefore do not capture the human oversight, iteration, and integration steps required in real workplace settings to use our models. Still, especially on the subset of tasks where models are particularly strong, we expect that giving a task to a model before trying it with a human would save time and money.

Expert graders compared deliverables from leading models to human experts. Today’s frontier models are already approaching the quality of work produced by industry experts. Claude Opus 4.1 produced outputs rated as good as or better than humans in just under half the tasks.

From GPT‑4o to GPT‑5, performance on GDPval tasks more than tripled in a year.

Finally, we incrementally trained an internal, experimental version of GPT‑5 to assess if we could improve performance on GDPval. We found this process improved performance, creating a pathway for further potential improvement. Other controlled experiments back this up: increasing model size, encouraging more reasoning steps, and giving richer task context each led to measurable gains.

You can read the full results in our paper. We’re also releasing a gold subset of GDPval tasks and a public grading service so other researchers can build on this work.

The future of work and AI

As AI becomes more capable, it will likely cause changes in the job market. Early GDPval results show that models can already take on some repetitive, well-specified tasks faster and at lower cost than experts. However, most jobs are more than just a collection of tasks that can be written down. GDPval highlights where AI can handle routine tasks so people can spend more time on the creative, judgment-heavy parts of work. When AI complements workers in this way it can translate into significant economic growth. Our goal is to keep everyone on the “up elevator” of AI by democratizing access to these tools, supporting workers through change, and building systems that reward broad contribution.

Limitations and what’s next

GDPval is an early step. While it covers 44 occupations and hundreds of tasks, we are continuing to refine our approach to expand the scope of our testing and make the results more meaningful. The current version of the evaluation is also one-shot, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts—for example, revising a legal brief after client feedback or iterating on a data analysis after spotting an anomaly. Additionally, in the real world, tasks aren’t always clearly defined with a prompt and reference files; for example, a lawyer might have to navigate ambiguity and talk to their client before deciding that creating a legal brief is the right approach to help them. We plan to expand GDPval to include more occupations, industries, and task types, with increased interactivity, and more tasks involving navigating ambiguity, with the long-term goal of better measuring progress on diverse knowledge work.

Get involved

If you’re an industry expert interested in contributing to GDPval, please show your interest here.
If you’re a customer working with OpenAI and you’d like to contribute to a future round of GDPval, please express interest here.

Community participation is essential—we’re excited to build GDPval together with researchers, practitioners, and organizations who share our goal of making AGI more useful for people at work.

A Comprehensive Taxonomy of AI