GPT-5.5: OpenAI's Smartest Model Redefines Agentic Work
OpenAI launches GPT-5.5 on April 23, 2026, in standard and Pro variants. It leads in agentic coding, computer use, and scientific research with unprecedented token efficiency.
AI DayaHimour Team
April 23, 2026
OpenAI launched GPT-5.5 on April 23, 2026, describing it as the most intelligent and intuitive model the company has released to date. The release marks a genuine shift in the concept of agentic work: the model can understand complex tasks, plan their execution, and carry them out across multiple tools without requiring constant human intervention. GPT-5.5 moves well beyond answering questions to become an active partner in computer-based work, from writing code to analyzing data and managing documents.
The release comes in two variants: the standard GPT-5.5 and GPT-5.5 Pro, aimed at more demanding workloads. Both operate within ChatGPT and Codex, with API access planned for a later date. What distinguishes this release is not merely higher intelligence scores, but the ability to maintain response latency matching GPT-5.4 despite a substantial increase in capabilities.
Enhanced Technical Architecture for Agentic Work
GPT-5.5 arrives with a context window of up to one million tokens, enabling it to ingest large codebases or lengthy financial documents within a single session. In Codex, the context window stands at 400,000 tokens — sufficient to cover most large-scale code repositories. Despite this increase in capacity, the model records per-token latency that matches its predecessor GPT-5.4 under real-world service conditions.
Token efficiency represents another notable advancement. OpenAI data indicates that GPT-5.5 uses up to 40% fewer tokens to complete the same Codex tasks compared to GPT-5.4. This reduction in token consumption translates to lower operational costs for users, even as the per-token price in the API is higher.
The model was designed and trained in collaboration with NVIDIA GB200 and GB300 NVL72 systems. This deep hardware-software integration reflects OpenAI’s efforts to build a world-class infrastructure for agentic AI. The collaboration reached a striking milestone when GPT-5.5 and Codex assisted the OpenAI team in improving inference infrastructure directly: Codex analyzed production traffic patterns and wrote custom load-balancing algorithms that boosted token generation speed by more than 20%.
Benchmark Performance
GPT-5.5 posted record numbers across agentic coding, knowledge work, and computer use categories. The figures below are drawn from official OpenAI data and comparative tables released by the company.
Benchmark Results — April 2026
The results show a clear advantage on Terminal-Bench 2.0, which tests complex command-line workflows, where the gap stands at 7.6 percentage points over GPT-5.4 and 13.3 points over Claude Opus 4.7. On SWE-Bench Pro, which measures real-world GitHub issue resolution, Claude Opus 4.7 still holds the lead at 64.3% against GPT-5.5’s 58.6%.
In computer use, GPT-5.5 achieves 78.7% on OSWorld-Verified, edging out Claude Opus 4.7 at 78.0%. On FrontierMath Tier 4 for advanced mathematics, the model reaches 35.4% — a score that surpasses competitors by a substantial margin. The Pro variant pushes this figure to 39.6%.
Comparison with Competitors
The strongest models in the market are currently competing directly in the agentic intelligence category, with GPT-5.5 emerging as a direct rival to Claude Opus 4.7 from Anthropic and Gemini 3.1 Pro from Google. The new model leads across most agentic coding and advanced mathematics benchmarks, though competition remains fierce in certain areas.
| Benchmark | GPT-5.5 | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 69.4% | 68.5% |
| SWE-Bench Pro | 58.6% | 64.3% | 54.2% |
| GDPval | 84.9% | 80.3% | 67.3% |
| OSWorld-Verified | 78.7% | 78.0% | — |
| FrontierMath Tier 4 | 35.4% | 22.9% | 16.7% |
| CyberGym | 81.8% | 73.1% | — |
On BrowseComp, which tests multi-step web research, Gemini 3.1 Pro leads at 85.9% against the standard GPT-5.5’s 84.4%, though the Pro variant reverses the outcome at 90.1%. On MCP Atlas, administered by Scale AI, Claude Opus 4.7 scores 79.1%, ahead of GPT-5.5 at 75.3%.
On cost, Artificial Analysis data indicates that GPT-5.5 delivers frontier-level coding intelligence at roughly half the cost of competing models. This combination of high performance and token efficiency gives it a competitive edge in the API marketplace.
Optimal Use Cases
Agentic Coding and Software Engineering
GPT-5.5 stands out in long-horizon programming tasks that require planning and execution over hours. In OpenAI’s internal Expert-SWE benchmark, the average task duration represents 20 hours of human work. The model surpasses its predecessor on this measure with 73.1% against GPT-5.4’s 68.5%, while using fewer tokens to reach the result.
Engineers who tested the model confirmed this shift. Dan Shipper, founder and CEO of Every, described it as “the first coding model with serious conceptual clarity.” In a practical test, the model successfully restructured a comment system within a collaborative markdown editor, returning 12 nearly complete code modifications with minimal human direction.
Knowledge Work and Scientific Research
GPT-5.5 extends beyond coding into early-stage scientific research. On the new GeneBench benchmark, which focuses on multi-stage scientific data analysis across genetics and quantum biology, the model demonstrates measurable improvement over its predecessor. It also achieved strong performance on BixBench for bioinformatics.
The most notable scientific result came when an internal version of GPT-5.5, paired with a dedicated system, helped discover a new proof concerning Ramsey numbers in combinatorics — a branch of mathematics studying how discrete objects interconnect. The proof was subsequently verified using Lean, making it a concrete example of the model contributing original mathematical arguments rather than merely generating code or explanations.
Computer Use and Administrative Tasks
In Codex, GPT-5.5 can see what is on screen, click, type, and navigate interfaces with precision. OpenAI teams have already put these capabilities to practical use: the finance team relied on Codex to review 24,771 K-1 tax forms totaling 71,637 pages, accelerating the task by two weeks compared to the previous year. The communications team used the model to analyze six months of speaking-request data and build an evaluation and risk framework.
Availability and Pricing
GPT-5.5 is currently available to Plus, Pro, Business, and Enterprise subscribers in ChatGPT and Codex. The Pro variant is limited to Pro, Business, and Enterprise tiers. In Codex, the model comes with a 400,000-token context window and an optional Fast Mode that generates tokens 1.5x faster at 2.5x the cost.
For developers, API access will be available shortly at $5 per million input tokens and $30 per million output tokens. The Pro variant will be priced at $30 per million for input and $180 per million for output. Batch and Flex pricing are available at half the standard rate, while Priority tier requires 2.5x the standard price.
Safety Measures and Preparedness
OpenAI classified GPT-5.5’s capabilities in cybersecurity and biology as “high” within its preparedness framework. While the model did not reach the “critical” threshold in cybersecurity, its capabilities exceeded those of GPT-5.4. In response, the company deployed the strictest safeguards to date, including tighter filters on high-risk activity and protections against repeated harmful use.
OpenAI simultaneously expanded trusted access for defensive security through the Trusted Access for Cyber program, allowing organizations responsible for protecting critical infrastructure to apply for access to less restricted models after meeting stringent security requirements.
Conclusion
GPT-5.5 represents a strategic shift for OpenAI from conversational models toward genuine agentic work systems. The model does not simply deliver smarter answers — it takes responsibility for executing tasks across multiple tools while maintaining efficiency and speed. The shift is visible in the numbers: leadership on Terminal-Bench 2.0, FrontierMath, and GDPval, alongside token efficiency that reduces costs despite a higher unit price.
Competition, however, remains intense. Claude Opus 4.7 retains the top position on SWE-Bench Pro, and Gemini 3.1 Pro leads in certain web research tasks. What gives GPT-5.5 its real advantage is the combination of high intelligence, operational efficiency, and deep integration with NVIDIA infrastructure. With API access approaching, the extent to which this model can translate benchmark figures into real impact across coding and scientific workflows will become clear.