OpenAI has launched a new family of models called GPT-4.1, including GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, all of which excel at coding and instruction following. Released on April 14, these new models are available exclusively through OpenAI's application programming interface (API) and outperform the company's most advanced GPT-4o model across the board.
The models feature improved context understanding, supporting up to 1 million tokens (approximately 750,000 words) and come equipped with refreshed knowledge up to June 2024. On SWE-bench Verified, a measure of real-world software engineering skills, GPT-4.1 completes 54.6% of tasks, compared to 33.2% for GPT-4o. This reflects improvements in the model's ability to explore code repositories, finish tasks, and produce code that both runs and passes tests.
"We've optimized GPT-4.1 for real-world use based on direct feedback to improve in areas that developers care most about: frontend coding, making fewer extraneous edits, following formats reliably, adhering to response structure and ordering, consistent tool usage, and more," an OpenAI spokesperson explained. "These improvements enable developers to build agents that are considerably better at real-world software engineering tasks."
The smaller variants offer different performance-cost tradeoffs. GPT-4.1 mini and nano are more efficient and faster at the cost of some accuracy, with OpenAI saying GPT-4.1 nano is its speediest and cheapest model ever. Pricing varies significantly across the lineup: GPT-4.1 costs $2 per million input tokens and $8 per million output tokens, GPT-4.1 mini is $0.40/million input tokens and $1.60/million output tokens, and GPT-4.1 nano is just $0.10/million input tokens and $0.40/million output tokens.
In evaluations beyond coding, OpenAI tested GPT-4.1 using Video-MME, which measures a model's ability to understand content in videos. GPT-4.1 reached 72% accuracy on the "long, no subtitles" video category, claiming the top position in this benchmark.
This release aligns with OpenAI's broader ambitions in the coding space. OpenAI CFO Sarah Friar recently discussed the company's vision of creating an "agentic software engineer" that can program entire apps end-to-end. "It can literally build an app for you — and not only build it, but also do its own quality assurance, bug testing, and documentation writing," Friar stated.
The AI coding model space is becoming increasingly competitive. Google's Gemini 2.5 Pro currently tops the SWE-bench Verified benchmark at 63.8%, while Anthropic's Claude 3.7 Sonnet scores 62.3% in standard mode and up to 70.3% in extended thinking mode. Despite these impressive benchmarks, OpenAI acknowledges that even the best models today struggle with tasks that wouldn't trip up experts. Many studies have shown that code-generating models often fail to fix, and even introduce, security vulnerabilities and bugs. GPT-4.1 also becomes less reliable the more input tokens it has to process.