`cleverhack.com`

Name: AI Coding Landscape 2025/2026
Creator: Joy Larkin
Published: 2025-07-01T00:00:00+00:00
License: https://creativecommons.org/licenses/by-nc/4.0

AI Coding Landscape GitHub repo ⇒ AI Coding Models ↴

`AI Coding Landscape`

September 2025 (Updated July 2026)

Note: Since everything is moving so fast, I wanted a create a knowledge framework about AI coding models and the associated agent, IDE, and software tooling ecosystem used for AI-assisted coding and/or vibe coding. This page continues to evolve as a market view of what is being mentioned and is an obvious ongoing work in progress.

`Listing AI coding agents, CLIs, IDEs, app builders, open source versions, devtools, and leaderboards`

`AI Coding Agents/CLI Tools`

Claude Code - Anthropic's coding agent for the terminal, desktop, and Web. Runs Opus 4.8, Sonnet 4.6 and Fable 5; supports subagents, agent teams, skills, hooks, and a plugin marketplace (Code with Claude 2026)

OpenAI Codex - OpenAI's coding agent for app, CLI, IDE, and Web, powered by GPT-5.3-Codex

GitHub Copilot - Pair-programming assistant

Gemini Code Assist - Google AI coding assistant

Jules - Google Asynchronous Coding Agent

Cognition - Devin - An autonomous AI software engineer that can write, run and test code

Amazon Q Developer - AWS code-gen & refactor

Cursor AI - Agent baked into Cursor IDE

Goose - Model + agent API

Amp - Sourcegraph coding agent (CLI / VS Code)

Reflection AI - Asimov - Enterprise code research agent

Conductor - Run a bunch of Claude Codes in parallel

Scout - Calls itself the most curious coding and research agent

Blackbox AI - New Autonomous AI Coding Agent

Forge Code - An AI software engineering agent that runs in your terminal

Factory - Delegate software development tasks to agents called Droids

Replit Agent - Set up and create apps from scratch, works with any framework; Agent 4 (March 2026) adds parallel task forking

JetBrains Junie - Your smart coding agent

Slate - A purpose built agent designed to work with you for long and hard coding tasks

GitHub Copilot CLI - The power of GitHub Copilot coding agent directly to your terminal

Codebuff - Works in your terminal to help you write and deeply understand your code

CTO.new - Completely free AI code agent

Kimi-CLI - A new CLI agent that can help you with your software development tasks and terminal operations

Antigravity CLI - Google's new terminal agent that replaced Gemini CLI at I/O 2026; Gemini 3-backed, async multi-agent workflows, SDK for custom agents

Cursor CLI - Cursor's terminal agent launched January 2026, includes Cloud Handoff between local and cloud

Grok Build - xAI's terminal coding agent and CLI, launched May 2026 in early beta for SuperGrok and X Premium Plus subscribers; powered by grok-build-0.1, with plan mode, up to 8 parallel subagents, 2M-token context, MCP support, and compatibility with AGENTS.md, plugins, hooks, and skills

`Open Source AI Coding Agents/CLI Tools`

Aider - Terminal pair-programming

Continue - IDE extensions + CLI

Cline - Autonomous IDE agent

Roo Code - Cline fork, VS Code extension

Kilo Code - AI coding agent for VS Code and JetBrains

Gemini CLI - An open-source AI agent for Google Gemini (being retired June 18, 2026 in favor of Antigravity CLI)

OpenAI Codex CLI - Open‑source command‑line agent for OpenAI

OpenHands - Multi-tool coding agent

Qwen Code - A command-line AI workflow tool for Qwen3-Coder

Ruler - Central AI agent rule registry

OpenCode - OSS terminal assistant; 150K+ GitHub stars and 6.5M+ MAU as of April 2026, with GitHub Copilot auth partnership (Jan 2026)

Vibe Kanban - Orchestrate multiple agents

Charm - A charming terminal agent, your new coding bestie

Goose - An open source, extensible AI agent that goes beyond code suggestions

DeepCode - Transforms research papers and natural language into production-ready code

Mistral Vibe CLI - Mistral Vibe is a command-line coding assistant powered by Mistral's models

CodeWhale - Open-source terminal coding agent (formerly DeepSeek TUI); edits files, runs shell commands, and manages git with approval gates or full auto mode; DeepSeek-first with multi-provider routing, MCP support, subagents, LSP diagnostics, and a headless HTTP/SSE API for CI and editor integration

Crush - Charm's TUI-first Go-based agent, multi-model with MCP + LSP support, source-available under FSL-1.1-MIT

CLIProxyAPI - Wraps Gemini CLI, Antigravity, Codex, Claude Code, Grok Build as OpenAI/Gemini/Claude/Codex-compatible API, lets you use OAuth subscriptions through any SDK

`Desktop IDEs`

Visual Studio Code

IntelliJ IDEA / PyCharm / WebStorm

Xcode

Eclipse, NetBeans

Atom - Atom community fork

Blackbox IDE

`Cloud & AI‑Powered IDEs`

Google Antigravity - Agentic development platform; Antigravity 2.0 (I/O 2026) co-developed with Gemini 3.5 Flash, includes new SDK and Google Workspace integration

Cursor - AI-first VS Code fork; Cursor 3 (April 2026) replaces Composer sidebar with full-screen Agents Window for parallel multi-agent execution across local/cloud/SSH/worktrees, adds Design Mode and Composer 2 model

Windsurf - Agentic IDE acquired by Cognition (Dec 2025, ~$250M); Windsurf 2.0 (April 2026) adds Agent Command Center, Spaces, and Devin Cloud integration; Wave 13 brought free SWE-1.5

Zed - High-performance Rust editor with AI chat

Amp - VS Code Extension

Trae - ByteDance AI IDE

Augment Code - Developer AI platform that helps you understand code, debug issues, and ship faster

Warp - An agentic development environment

Kiro - Helps you do your best work by bringing structure to AI coding with spec-driven development

Nimbalyst - Open-source visual workspace for AI coding agents (Claude Code, Codex; OpenCode + Copilot in alpha); MIT-licensed desktop + iOS apps with kanban session management, inline AI diff review, parallel sessions with worktree isolation

ZCode - Free desktop coding harness from Z.ai, optimized for GLM-5.2; macOS, Windows, and Linux; supports Goals for long-running multi-step tasks, bot control from WeChat, Feishu, or Telegram, and 20+ coding tools

`AI App Builders`

Bolt - Browser-based AI app builder

Lovable - Chat-to-app builder; hit $300M ARR by January 2026

Replit - Cloud IDE w/ Ghostwriter; $400M Series D at $9B valuation (March 2026)

v0.dev - Vercel text-to-UI generator

Nectry - Responsible vibe coding for the enterprise

Reflex - From prompt to production, build and deploy Python apps

Superblocks - Build secure internal apps with AI

vybe - Build internal apps 10X faster

Emergent - YC-backed, build ambitious apps with agentic vibe-coding

orchids v2 - YC-backed, the worlds first AI Full Stack Engineer

Same - YC-backed, build fullstack web apps by prompting

Aura - Generate beautiful designs in seconds and export to HTML or Figma

21st.dev - Build products that reflect the team's own taste

Base44 - Lets you build fully-functional apps in minutes with just your words; SOC 2 Type II and ISO 27001 certified (Feb 2026)

VibeFlow - YC backed, transform your AI-generated frontend mockups into fully functional applications

Blink.new - The world's first vibe coding platform that builds agentic AI apps

a0 - YC backed, ship mobile apps to the App Store and Google Play with AI

Anything - Create powerful apps & websites by chatting with AI

Rocket - Think It. Type It. Launch It.

Google AI Studio - Build your ideas with Gemini; I/O 2026 added Workspace integrations and new vibe coding capabilities

Variant - Gives your ideas room to grow...to branch, remix, and become what they're meant to be

sleek.design - Design mobile apps in minutes

`Mobile AI App Builders`

Rork - Builds complete, cross-platform mobile apps using AI and React Native

Vibecode - Create native apps in seconds with AI

bitrig - Build apps for your phone, on your phone

Spielwork - The Tiktok for vibecoded mini games!

Gizmo - A new way to make playful, personal software—right from your phone

Hivemind - The fastest & easiest way to chat & code with any AI in one app

Bloom - YC backed, go from idea to native mobile app on your phone without writing a single line of code

Vibe Code Go - YC backed, code from your phone, a mobile app for software engineers

`Open Source AI App Builders`

Hugging Face DeepSite - Access the most simple and powerful AI Vibe Code Editor to create your next project

Dyad - A local, open-source AI app builder

Open Lovable - Clone and recreate any website as a modern React app in seconds

bolt.diy - Bolt.new OSS version, AI-powered full-stack web dev for NodeJS based apps, choose the LLM you use for each prompt

app.build - An open-source AI agent that builds full-stack apps

ToolJet - An open-source low-code framework to build and deploy internal tools

Adorable - Another open source Lovable version

Vercel - OSS Vibe Coding Platform

Cloudflare VibeSDK - Run an entire vibe coding platform end-to-end, with just one click

Pythagora - VS Code-native AI dev platform with 14 specialized agents for planning, coding, debugging, and one-click deployment of React + Node apps

`Other Useful AI DevTools`

Ollama - Chat & build with open models

LM Studio - Run gpt-oss, Qwen, Gemma, DeepSeek on your computer

Open WebUI - Self-hosted AI platform designed to operate entirely offline

SillyTavern - A locally installed UI for text, image, and voice LLMs

Unsloth - An open-source framework for LLM fine-tuning and reinforcement learning

n8n - Flexible AI workflow automation for technical teams

Firecrawl - Turn websites into LLM-ready data

Agents.md - A simple, open format for guiding coding agents, used by over 20k open-source projects

Vercel AI Gateway - A gateway to access hundreds of models with zero markup on tokens (including BYOK)

OpenRouter - A unified API providing access to hundreds of AI models through a single endpoint

Fabric - An open-source modular system for solving specific problems using crowdsourced AI prompts that can be used anywhere

Vibetunnel - VibeTunnel proxies your terminals right into the browser, so you can vibe-code anywhere

Anannas - Single API to access any LLM - Seamlessly connect to multiple models through a single gateway with failproof routing, cost control, and instant usage insights

CodeRabbit - AI code reviews - cut code review time & bugs in half

Giga AI - Giga's context engineering improves quality and understanding — so your AI works right the first time, and you build faster

Gas Town - Multi-agent orchestrator for Claude Code. Track work with convoys; sling to agents

Claude Code Plugin Marketplace - Anthropic's official directory; plugins bundle skills, MCP servers, hooks, and slash commands. SKILL.md format adopted by OpenAI Codex CLI as an open standard

Anthropic Agent Skills - Public Anthropic skills repo for PDF, Excel, PowerPoint, Word, MCP generation, and more

wshobson/agents - Multi-harness plugin marketplace with 82 plugins, 191 agents, 155 skills, 102 commands; one source-of-truth, runs natively on Claude Code, Codex CLI, Cursor, OpenCode, and Gemini CLI

Superpowers - Agentic skills framework + software dev methodology by Jesse Vincent; enforces brainstorming → planning → TDD → subagent-driven implementation → review; works across Claude Code, Codex CLI, Codex App, Factory Droid, Gemini CLI, OpenCode, Cursor, and GitHub Copilot CLI

claudemarketplaces.com - Community-curated directory of Claude Code skills, plugins, and MCP servers

SkillsMP - Community marketplace for agent skills across Claude Code, Codex CLI, and ChatGPT

claude-devtools - Free open-source desktop app reading ~/.claude/ session logs; per-turn token attribution across 7 context categories, subagent execution trees, syntax-highlighted diffs, multi-session compare

Chrome DevTools MCP - Google's official MCP server giving AI agents access to Chrome DevTools for page inspection, performance insights, and debugging

Browserbase Agents - Managed browser agent platform from Browserbase (June 2026); create an agent from a natural language prompt and run it with a single API call, no script per site; runs on infrastructure behind 35M+ browser sessions a month; agents tune over time to run faster, debug their own failed runs, and refactor as tasks grow

Datadog MCP Server - GA March 2026; live logs, metrics, and traces into Claude Code, Cursor, Codex, and Copilot

Honeycomb MCP - Observability platform with expanded MCP integrations across AI coding tools (March 2026)

AgentOps - YC W24 open-source agent engineering platform; session replay, cost tracking, failure detection across CrewAI, LangGraph, OpenAI Agents SDK

Sim - YC X25 ($7M Series A) open-source (Apache 2.0) visual drag-and-drop agent workflow builder; 27K+ GitHub stars; AI-native alternative to n8n and Langflow

`Coding Benchmarks & Leaderboards`

`Software Engineering`

SWE-Bench Pro (Private Dataset) - A new benchmark designed to provide a rigorous and realistic evaluation of AI agents for software engineering [OpenAI states SWE-Bench Pro is saturated as of July 2026.]

SWE-Bench Pro (Public Dataset) - Designed to provide a rigorous and realistic evaluation of AI agents for software engineering; SEAL leaderboard with standardized scaffolding, 1,865 multi-language tasks including private proprietary codebases (contamination-resistant); developed to address several challenges: data contamination, limited task diversity, oversimplified problems, and unreliable and irreproducible testing

[Deprecated] SWE-bench Verified - SWE-bench evaluates LLM performance on real world software issues collected from GitHub (the "Verified" subset is a specific version of the dataset designed to be more reliable)

SWE-bench - SWE-bench evaluates LLM performance on real world software issues collected from GitHub

SWE-bench Multilingual - 300 curated SWE-bench style tasks from 42 repositories representing 9 programming languages

SWE-bench-Live - Monthly-refreshed benchmark with 50 new verified issues added per month; now includes Windows split and Multi-Lang (C/C++/C#/Go/JS/TS/Rust)

SWE-rebench - A Continuously Evolving and Decontaminated Benchmark for Software Engineering LLMs

SWE-DEV - Evaluating and Training Autonomous Feature-Driven Software Development

Multi-SWE-bench - A Multilingual Benchmark for Issue Resolving

Live-SWE-agent - Can Software Engineering Agents Self-Evolve on the Fly?

DeepSWE (Datacurve) - Open-source long-horizon software engineering benchmark from Datacurve (May 2026); 113 tasks across 91 repositories in TypeScript, Go, Python, JavaScript, and Rust; contamination-free with tasks written from scratch; solutions require 5.5× more code than SWE-bench Pro prompts

MirrorCode - Epoch AI × METR, a new long-horizon SWE benchmark measuring AI performance on weeks-long coding tasks; released April 2026 with early results showing frontier models can already complete some multi-week software engineering work

SWE-Marathon - Ultra-long-horizon software engineering benchmark from Abundant AI (June 2026); 20 multi-hour tasks including full-stack product clones, library rewrites, ML engineering, and optimization; 1,300 logged trials; no configuration above 19% resolution rate; Claude Opus 4.8 tops the leaderboard at 26%

FrontierCode - Benchmark from Cognition (June 2026) measuring code mergeability, not just correctness; 150 tasks across 36 open-source repositories, each built by the repo's own maintainer at 40+ hours per task; evaluates behavioral correctness, test quality, scope discipline, style, and codebase adherence; 81% lower false positive rate than SWE-Bench Pro; Claude Opus 4.8 leads at 13.4% on the hardest Diamond subset

PR Arena - Software engineering agents head to head

Modu Merge Rate Leaderboard - Real-world success rates: Ranking top coding agents by their pull request merge performance on Modu

Aider - Aider polyglot coding leaderboard

`General Coding & Model Rankings`

Artificial Analysis Intelligence Index - Composite leaderboard independently measuring AI models across agents, coding, scientific reasoning, and general knowledge; updated daily with live API performance data including speed and latency

Code Arena - Community-voted coding leaderboard with 200k+ votes ranking models on agentic coding tasks; covers web dev, React, HTML, game development, data visualization, and image-to-code generation

LiveCodeBench Pro - A benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination

LiveCodeBench - Holistic and Contamination Free Evaluation of Large Language Models for Code

BigCodeArena - A human-in-the-loop platform for evaluating code through execution

OpenBench Coding - An open-source framework for standardized, reproducible benchmarking of large language models (LLMs)

OpenRouter - Model, Market Share, Use Case Categories, and App Rankings

Repo Bench - Measuring large context reasoning, file editing precision, and instruction adherence

Context-Bench - A benchmark for agentic context engineering

`Agentic & Real-World Task Completion`

GDPval-AA - Artificial Analysis' evaluation of OpenAI's GDPval dataset; tests AI agents on real-world knowledge work tasks across 44 occupations and 9 industries using shell and web access; models produce actual deliverables like documents, slides, and spreadsheets — ELO-ranked via blind pairwise comparisons

AutomationBench - Open benchmark from Zapier (April 2026) measuring whether AI agents can complete real business workflows end-to-end; 47 simulated SaaS tools across Sales, Marketing, Operations, Support, Finance, and HR; scored on proof of outcome — did the work get done correctly, or didn't it

APEX-Agents - The AI Productivity Index for Agents (APEX-Agents) measures whether frontier AI agents can execute long-horizon, cross-application tasks across three jobs in professional services

τ-bench / τ2-bench - Benchmarking AI agents in collaborative real-world scenarios

Vending-Bench 2 - Measuring AI model performance on running a business over long time horizons

OSWorld - Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Terminal-Bench@2.0 - A benchmark measuring the capabilities of AI agents in a terminal environment

Terminal-Bench - A benchmark measuring the capabilities of AI agents in a terminal environment

CORE-Bench Hard - The agent is given the codebase of a published scientific paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper

Blueprint-Bench 2 - Agentic spatial reasoning benchmark from Andon Labs (May 2026); agents process ~20 interior photos of 50 apartments and generate accurate 2D floor plans; tests room connectivity inference, scale reasoning, and layout reconstruction; scores normalized so random baseline = 0 and perfect = 1

MCP Atlas - Evaluates how well language models handle real-world tool use through the Model Context Protocol (MCP)

`Reasoning`

ARC-AGI-2 - Stress testing the efficiency and capability of state-of-the-art AI reasoning systems

ARC-AGI-3 - First fully interactive benchmark in the ARC-AGI series (March 2026); hundreds of turn-based game environments with no instructions, rules, or stated goals; agents must explore, infer objectives, and adapt across difficulty levels; humans score 100%, frontier AI scored below 1% at launch

`Cybersecurity`

ExploitBench - Cybersecurity capability benchmark from Carnegie Mellon and Bugcrowd (May 2026); decomposes exploitation into 16 measurable flags across a five-tier ladder from crash to arbitrary code execution; scores Cap% (fraction of capability ladder achieved); open source on GitHub

`Developer Surveys`

Sonar State of Code Developer Survey 2026 - n=1,149; Copilot 75%, ChatGPT 74%, Claude 48%, Gemini 31%, Cursor 21%

JetBrains AI Pulse Survey + Developer Ecosystem Survey 2026 - 10,000+ professional developers (January 2026); tracks Claude Code, Cursor, JetBrains AI Assistant, Junie, GitHub Copilot, Codex, and Antigravity for awareness, adoption, and satisfaction

The state of AI coding in 2025: Adoption, proficiency, and transformation - The Modern Software Developer, December 2025

AI in Practice Survey 2025 - Theory Ventures, December 2025

Stack Overflow 2025 Developer Survey - 84% adoption or planning; trust metrics dropping (33% trust, 46% distrust AI output)

`Coding Model Timeline (foundation / open‑weight / frontier)`

Noteworthy releases, some entries may be updated model versions or model families.

`July 2026`

Inkling - Thinking Machines Lab, July 15, 2026; first model release from the lab, open-weights, mixture-of-experts transformer with 975B total / 41B active parameters, up to 1M-token context, pretrained on 45 trillion tokens across text, images, audio, and video; supports controllable thinking effort and native multimodal reasoning rather than optimizing for a single benchmark; scores 77.6% on SWE-Bench Verified, 54.3% on SWE-Bench Pro Public, and 63.8% on Terminal-Bench 2.1; reaches Nemotron 3 Ultra's Terminal-Bench score at roughly one-third the tokens; ranks among open-weight models on Design Arena's blinded Agentic Web Dev leaderboard; also releasing a preview of Inkling-Small, a 276B/12B-active sibling that matches or exceeds it on several benchmarks; full weights, including an NVFP4 checkpoint, on Hugging Face; available for fine-tuning on the lab's Tinker platform at launch, with API access via Together, Fireworks, Modal, Databricks, and Baseten

GPT-5.6 (Sol, Terra, Luna) - OpenAI, July 9, 2026; three-tier model family with the number marking the generation and each name marking a durable capability tier that can advance on its own schedule; Sol is the flagship for complex coding, cybersecurity, and long-running agentic work, priced at $5/$30 per million tokens; Terra is the balanced everyday tier, competitive with GPT-5.5 at about half the cost, priced at $2.50/$15; Luna is the fastest and cheapest tier, priced at $1/$6; Sol with max reasoning sets a new state of the art on the Artificial Analysis Coding Agent Index at 80, also sets new highs on Terminal-Bench 2.1 and DeepSWE; introduces max reasoning effort for Sol and an ultra mode that runs subagents concurrently; all three tiers are classified at OpenAI's "High" risk level for cyber and biological/chemical capability

Muse Spark 1.1 - Meta Superintelligence Labs, July 9, 2026; multimodal reasoning model built for agentic tool use, computer use, and coding, with a 1M-token context window and support for orchestrating multi-agent systems as either a primary agent or subagent; handles large, complex codebases including bug diagnosis, feature implementation, and large-scale migrations; scores 61.5% on SWE-Bench Pro, 80.0% on Terminal-Bench 2.1, and 53.3% on DeepSWE 1.1 for coding; leads its benchmark comparison on several agentic evals, including 88.1% on MCP Atlas, 54.7% on JobBench, and 62.1% on Humanity's Last Exam with tools, ahead of Gemini 3.1 Pro, Claude Opus 4.8, and GPT-5.5 on those three; released alongside the public preview of Meta's new Meta Model API, its first paid developer-facing model offering, priced at $1.25/$4.25 per million tokens; also live in Meta AI's Thinking mode; early API partners include Replit, Cline, and Box

Grok 4.5 - Cursor and SpaceXAI, July 8, 2026; mixture-of-experts model trained jointly by the two companies, Cursor's first built for more than software engineering, spanning data science, finance, legal work, and other knowledge domains; trained on trillions of tokens of Cursor usage data plus reinforcement learning on difficult problems built by a distributed agent system; new safeguards added for cybersecurity capabilities, with detection that blocks bad actors outright rather than silently downgrading intelligence; available today in Cursor across desktop, web, iOS, CLI, and SDK; base model priced at $2/$6 per million tokens, fast variant at $4/$18; distinct model class from Composer 2.5, which Cursor continues to offer separately

SWE-1.7 - Cognition, July 8, 2026; proprietary software-engineering model trained on a Kimi K2.7 base with a refined reinforcement-learning recipe, optimized for long-horizon asynchronous coding tasks; scores 42.3% on Cognition's own FrontierCode benchmark and about 81.5% on Terminal-Bench 2.1, which the company positions as within a few points of Claude Opus 4.8 and GPT-5.5; runs at 1,000 tokens per second via Cerebras at $1.97 per task; available today inside Devin (Web, Desktop, and CLI); not open source, and benchmark and cost claims come from Cognition's own launch materials rather than independent testing

Hy3 - Tencent, July 6, 2026; official release following the April 23 preview; 295B total / 21B active parameter MoE model with an additional 3.8B-parameter MTP layer, built on a hybrid fast-and-slow-thinking architecture; strengthens agent capabilities over predecessor Hy2; 57.9% SWE-bench Pro, 68.3% SWE-bench Multilingual, 45.6% NL2repo, 71.7% Terminal Bench 2.1; Tencent says the model is competitive with GLM-5.1 and GLM-5.2; positioned as a cost-efficient model for software development, office work, financial modeling, front-end design, and game production, with two to five times better parameter efficiency than flagship models; 256K context; Apache 2.0 license, available day one on Hugging Face and ModelScope, with progressive rollout to OpenRouter, Cline, Kilo, and other developer platforms; API live on Tencent Cloud TokenHub

Claude Fable 5 - Anthropic, redeployed July 1, 2026 after a brief export-control suspension; Mythos-class model made safe for general use via safety classifiers that fall back to Opus 4.8 on cybersecurity, biology/chemistry, and distillation queries; state-of-the-art on software engineering, vision, knowledge work, and long-horizon agentic tasks; scored highest on Cognition's FrontierCode evaluation among frontier models; priced at $10/$50 per million tokens

Claude Mythos 5 - Anthropic, redeployed July 1 after a brief export-control suspension; same underlying model as Fable 5 with all safeguards lifted (cybersecurity, biology/chemistry, distillation); not generally available, restricted to vetted Project Glasswing cyberdefense and critical infrastructure partners; a separate trusted access program for biomedical researchers is planned; priced at $10/$50 per million tokens

`June 2026`

Claude Sonnet 5 - Anthropic, June 30, 2026; most agentic Sonnet yet, with performance close to Opus 4.8 at a lower price; default model for Free and Pro plans, also available in Claude Code and the Claude Platform; cyber safeguards enabled by default but substantially weaker cyber capabilities than Opus 4.8 and Mythos 5; introductory pricing $2/$10 per million tokens through August 31, 2026, then $3/$15

LongCat-2.0 - Meituan, June 30, 2026; 1.6T-parameter MoE coding and agentic model with ~48B activated parameters per token, a 1M-token context window via LongCat Sparse Attention, and 35T+ pretraining tokens; trained entirely on a 50,000-card domestic AI ASIC cluster with no Nvidia hardware, a first at this scale for non-Nvidia training; previously ran anonymously on OpenRouter as "Owl Alpha," where it led developer usage charts for two months; 59.5 SWE-Bench Pro, 70.8 Terminal-Bench 2.1, 77.3 SWE-Bench Multilingual; MIT licensed; API priced at $0.75/$2.95 per million tokens with no charge for cache hits

Qwen-AgentWorld - Qwen (Alibaba), June 23, 2026; not a coding model but a world model for simulating coding and agent environments; predicts the next environment state given an agent's action and history across seven domains including Terminal, SWE, Web, and OS; built on Qwen3.5-35B-A3B-Base, 35B total / 3B active, 262K context, Apache 2.0; a larger 397B-A17B variant tops its own AgentWorldBench; useful for training and evaluating coding agents rather than writing code directly

Sakana Fugu - Sakana AI, June 22, 2026; not a standalone model but a multi-agent orchestration system delivered as one OpenAI-compatible API, shipping in two tiers (Fugu and Fugu Ultra); dynamically assembles and coordinates a pool of existing frontier models per task, drawing on two ICLR 2026 papers (TRINITY and Conductor) on learned model orchestration; reports SWE-Bench Pro 73.7 and TerminalBench 2.1 82.1 for Fugu Ultra, though benchmarks compare the orchestrated ensemble against single frontier models and the underlying model pool is proprietary and undisclosed; positioned explicitly as frontier capability without export-control exposure; Fugu Ultra priced at $5/$30 per million tokens; not available in the EU/EEA

GLM-5.2 - Z.ai (Zhipu), launched June 13, 2026 with MIT open weights released June 16; flagship long-horizon and coding model in the GLM-5 family; 744B-parameter MoE with ~40B active per token and a solid 1M-token context window; introduces IndexShare, which reuses one indexer across every four sparse attention layers to cut per-token FLOPs by 2.9x at 1M context, plus an improved MTP layer that lifts speculative-decoding acceptance length by up to 20%; multiple thinking-effort levels (High and Max); 62.1 SWE-bench Pro, 81.0 Terminal-Bench 2.1, 63.7 ProgramBench; no regional limits

Kimi K2.7 Code - Moonshot AI, June 12, 2026; coding-focused agentic model built on Kimi K2.6; 1T total / 32B active MoE with 256K context, native INT4 quantization, and multimodal input (image and video) via a 400M-parameter MoonViT vision encoder; improves long-horizon coding task completion while cutting thinking-token usage roughly 30% versus K2.6; open weights under Modified MIT License

North Mini Code 1.0 - Cohere Labs, June 9, 2026; 30B total / 3B active open-weight MoE model optimized for code generation, agentic software engineering, and terminal tasks; 256K context window with 64K max output; decoder-only sparse MoE using interleaved sliding-window and global attention; post-trained with cascaded SFT followed by reinforcement learning with verifiable rewards (RLVR); released under Apache 2.0

Nemotron 3 Ultra - NVIDIA, June 4, 2026; 550B total parameters, 55B active; open-weight hybrid Mamba-Transformer MoE built for long-running agentic coding and orchestration workflows; scores 48 on the Artificial Analysis Intelligence Index, making it the highest-scoring open-weight model from a US lab at release; 65–70.4% on SWE-bench Verified; up to 5x higher inference throughput than comparable open models; weights, training recipes, and quantized variants all released publicly

MiniMax M3 - MiniMax, June 1, 2026; first open-weight model to combine frontier-level coding, a 1M-token context window, and native multimodal input (text, image, video) in a single architecture; uses new MiniMax Sparse Attention (MSA) design; 59.0% SWE-Bench Pro; API live at launch, open weights committed to Hugging Face and GitHub within ten days

`May 2026`

Claude Opus 4.8 - Anthropic, May 28, 2026; improvements across coding, agentic tasks, and long-horizon professional work; 88.6% SWE-bench Verified, 69.2% SWE-bench Pro; introduces dynamic workflows in Claude Code enabling hundreds of parallel subagents in a single session; pricing held flat from Opus 4.7 at $5/$25 per million tokens

Qwen3 Coder Next - Alibaba Cloud, listed May 18, 2026 on LLM Gateway; open-weight 80B MoE with 3B active parameters designed for coding agents and local deployment; originally released February 4, 2026

Qwen 3.7 Max - Alibaba Cloud, announced May 20, 2026 at the Alibaba Cloud Summit; API-only flagship with 1M-token context, native extended-thinking mode, and benchmark wins on SWE-Pro (60.6) and Terminal-Bench 2.0 (69.7); no open weights at launch

Gemini 3.5 Flash - Google DeepMind, announced at Google I/O 2026 (May 19); Google's strongest agentic and coding model at launch, scoring 76.2% on Terminal-Bench 2.1; available at roughly half the cost of comparable frontier models; co-developed with Antigravity and powers Antigravity 2.0

Cursor Composer 2.5 - Cursor, May 18, 2026; agentic coding model fine-tuned on Moonshot AI's open-weight Kimi K2.5 checkpoint; 79.8% SWE-Bench Multilingual, 69.3% Terminal-Bench 2.0; priced at $0.50 per million input tokens, roughly one-tenth the cost of Claude Opus 4.7 at launch

Gemini 3.1 Flash Lite - Google DeepMind, hit AI gateways May 8, 2026; default-tier successor

Zyphra ZAYA1-8B - Zyphra, May 6, 2026; 8B MoE with ~760M active parameters trained entirely on AMD Instinct MI300X hardware; released under Apache 2.0; targets coding, mathematics, and reasoning at unusually high intelligence density per active parameter

GPT-5.5 Instant - OpenAI, became ChatGPT default May 5, 2026; replaced GPT-5.4 Instant as the default-tier model hundreds of millions of users interact with daily

`April 2026`

DeepSeek V4 Pro - DeepSeek, released April 24; 1.6T total / 49B active parameters, new architecture and largest DeepSeek model to date; hybrid thinking/non-thinking; 1M token context window; #2 open-weight reasoning model on Artificial Analysis Intelligence Index; leads open-weight models on GDPval-AA agentic work tasks; MIT license

DeepSeek V4 Flash - DeepSeek, released April 24; 284B total / 13B active parameters; DeepSeek's first two-tier lineup with Flash positioned for fast, low-cost inference; one of the cheapest small frontier models available; MIT license

GPT-5.5 - OpenAI, released April 23; smartest and most intuitive model yet, excels at agentic coding, computer use, and knowledge work; more token-efficient than GPT-5.4 for most Codex tasks while matching per-token latency; 82.7% on Terminal-Bench 2.0; available to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex; GPT-5.5 Pro variant available for Pro/Business/Enterprise

Kimi K2.6 - Moonshot AI, released April 20; open-weight 1T-parameter MoE with 32B active parameters; first open-weight model to beat GPT-5.4 (xhigh) on SWE-Bench Pro; 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro; supports 4,000+ tool calls and 12+ hour continuous execution; strong cross-language generalization across Rust, Go, Python, and front-end; 256K context window

Claude Opus 4.7 - Anthropic, most capable generally available model; scores 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro; excels at long-horizon agentic tasks, complex multi-step coding, and professional knowledge work; adds high-resolution vision (2,576px) for improved accuracy on charts, documents, and UIs; new adaptive thinking and configurable effort levels including xhigh mode for demanding coding tasks; 1M token context window; first Opus model with automated cybersecurity safeguards and a Cyber Verification Program for security professionals

MiniMax M2.7 - MiniMax, released April 13; self-evolving capabilities via production feedback loops; 56.22% on SWE-Bench Pro; most cost-efficient option in its performance tier at ~$0.30/M input tokens; 200K context window

Meta Muse Spark - Meta, released April 8 from Meta Superintelligence Labs; Meta's first proprietary (non-open-source) model; scores 52 on the Artificial Analysis Intelligence Index (vs. Llama 4 Maverick's 18); leads all models on CharXiv Reasoning at 86.4%; available free on meta.ai in Instant and Thinking modes

GLM-5.1 - Z.ai (Zhipu AI), released April 7; 754B MoE; 58.4% on SWE-Bench Pro; Code Arena Elo of 1,530, third globally on agentic web development leaderboard; strong front-end component generation; MIT license

Gemma 4 - Google DeepMind, four open-weight variants (E2B through 31B Dense) released April 2 under Apache 2.0 for the first time — removing prior commercial restrictions; 31B Dense scores 80% on LiveCodeBench v6 and 89.2% on AIME 2026; natively multimodal with function calling and agentic workflow support; runs locally from phones to workstations; Codeforces ELO jumped from 110 in Gemma 3 to 2,150

`March 2026`

GPT-5.4 - OpenAI, most capable and efficient frontier model for professional work; first mainline model to incorporate GPT-5.3-Codex's coding capabilities into a general-purpose model; adds native computer use (75% on OSWorld, surpassing the 72.4% human expert baseline); 1M token context window; scores 83% on GDPval knowledge work benchmark and ~80% on SWE-bench Verified; rolled out March 5 across ChatGPT, the API, and Codex

GPT-5.4 mini & nano - OpenAI, smaller efficient variants released March 17; mini approaches GPT-5.4-level coding performance at ~6x lower cost and is available as a subagent in Codex; nano targets classification, data extraction, and lightweight coding subagents; both optimized for fast iteration in coding workflows

NVIDIA Nemotron 3 Super - NVIDIA, 120B parameter open-weight hybrid Mamba-Transformer MoE (12B active per token) released at GTC March 11; sets new open-weight record of 60.47% on SWE-bench Verified; 1M token context window; 5x higher throughput than prior generation; optimized for multi-agent systems with entire-codebase-in-context capability; fully open — weights, datasets, and training recipes released under NVIDIA Nemotron Open Model License

Mistral Small 4 - Mistral AI, 119B parameter MoE (6B active per token) released March 16; unifies four previously separate products — Magistral (reasoning), Pixtral (multimodal vision), Devstral (agentic coding), and Mistral Small (instruct) — into a single model; configurable reasoning effort per request; 256K context window; 40% faster and 3x higher throughput vs. predecessor; Apache 2.0 license

GLM-5.1 - Z.ai (Zhipu AI), coding-optimized iteration of GLM-5 (744B MoE, 40B active) released March 27; achieves 94.6% of Claude Opus 4.6's coding benchmark performance; MIT license; GLM Coding Plan starts at $3/month — a significant cost-to-performance disruption for coding-heavy workloads

`February 2026`

Gemini 3.1 Pro - Google DeepMind, upgraded core intelligence for the Gemini 3 series with improved reasoning — scores 77.1% on ARC-AGI-2 (more than double Gemini 3 Pro) and 80.6% on SWE-bench Verified; 1M token context window with 65K output; leads most general benchmarks and ranked #1 on Artificial Analysis Intelligence Index at launch; available via Gemini API, Vertex AI, AI Studio, and Google Antigravity

Claude Sonnet 4.6 - Anthropic, leads GDPval-AA Elo benchmark for real expert-level work with 1,633 points; preferred over previous Sonnet 70% of the time in Claude Code testing; 1M token context window (beta); default model on Claude.ai free and pro plans and powering GitHub Copilot's coding agent

GPT-5.3-Codex-Spark - A research preview of OpenAI's first model designed for real-time, ultra-fast coding. Powered by Cerebras Wafer Scale Engine 3, it delivers more than 1,000 tokens per second with near-instant responsiveness, optimized for interactive work like making targeted logic edits or refining interfaces. While smaller than the full GPT-5.3-Codex, it demonstrates strong agentic performance on SWE-Bench Pro and Terminal-Bench 2.0 (58.4% accuracy) in a fraction of the time. Features a 128k context window and a lightweight working style that prioritizes minimal, high-speed edits to keep developers in a tight interactive loop

Qwen3.5 - Alibaba Cloud, 397B parameter native vision-language model with only 17B active per forward pass via hybrid linear attention (Gated Delta Networks) and sparse MoE architecture; strong across reasoning, coding, agent capabilities, and multimodal understanding; 76.4% on SWE-bench Verified, 52.5% on Terminal-Bench 2; 1M context window; multilingual support expanded from 119 to 201 languages; 8.6x decoding throughput improvement

Zhipu AI GLM-5 - A flagship Mixture-of-Experts (MoE) model with 745B total parameters (44B active) designed for "Agentic Engineering." It achieves state-of-the-art performance for open-source models, narrowing the gap with Claude Opus 4.5 in complex system refactoring and deep debugging. Features a 200k token context window and is released under a permissive MIT license. Notably trained independently of US hardware, utilizing Huawei Ascend infrastructure and the MindSpore framework

MiniMax 2.5 - A peak-performance model optimized specifically for end-to-end developer workflows, including multi-file edits and test-validated repairs. It leads industry leaderboards with an 80.2% score on SWE-Bench and operates 37% faster than comparable frontier models. Supports a 200k context window and a specialized "thinking mode" for complex logic. Designed for high-efficiency agent loops, it offers a significantly lower cost-to-performance ratio for long-running autonomous sessions

Claude Opus 4.6 - Anthropic's smartest model with improved coding skills including better planning, sustained agentic tasks, operation in larger codebases, and enhanced code review and debugging to catch its own mistakes. First Opus-class model with 1M token context window (beta). Applies capabilities to everyday work tasks including financial analyses, research, and document/spreadsheet/presentation creation. Achieves state-of-the-art performance on Terminal-Bench 2.0 (agentic coding), Humanity's Last Exam (multidisciplinary reasoning), GDPval-AA (knowledge work tasks), and BrowseComp (information retrieval). Maintains industry-leading safety profile with low rates of misaligned behavior

GPT-5.3-Codex - OpenAI's most capable agentic coding model, combining the coding performance of GPT-5.2-Codex with GPT-5.2's reasoning capabilities in a single model that's 25% faster. Handles long-running tasks involving research, tool use, and complex execution. You can steer and interact with it mid-task without losing context. First OpenAI model to help create itself

`January 2026`

SERA-32B - Ai2, the first model in Ai2's Open Coding Agents series, a state-of-the-art open-source coding agent that achieves 49.5% on SWE-bench Verified, matching the performance of frontier open models like Devstral-Small-2 (24B) and larger models like GLM-4.5-Air (110B); trained using Soft Verified Generation (SVG), a simple and efficient method that is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance with a total cost for data generation and training of approximately $2,000 (40 GPU-days)

Kimi K2.5 - Moonshot AI, Open-Source Visual Agentic Intelligence. Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%); Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%); Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion; Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup

GLM-4.7-Flash - Z.ai, a local coding and agentic assistant setting a new standard for the 30B class, balancing high performance with efficiency, making it the perfect lightweight deployment option; also recommended for creative writing, translation, long-context tasks, and roleplay

`December 2025`

M2.1 - MiniMax, a new open-source AI model with 10 billion activated parameters (230 billion total) democratizing high-performance agentic capabilities, scoring 74.0 on SWE-bench Verified and 91.5 on VIBE-Web benchmarks. It excels in multi-language programming (Rust, Java, Go, C++, TypeScript, etc.), UI development, and complex real-world office workflows while offering full transparency and accessibility through both HuggingFace weights and API access

GLM-4.7 - Z.ai, optimized for AI coding assistance, this updated model shows major improvements over GLM-4.6 across coding tasks (including 5.8% gain on SWE-bench and 12.9% on multilingual coding), UI/webpage generation, tool usage, and complex reasoning with better performance in chat, creative writing, and role-play scenarios

GPT-5.2-Codex - OpenAI, the most advanced agentic coding model yet for complex, real-world software engineering. An optimized version of GPT‑5.2 ⁠for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities

Gemini 3 Flash - Google, delivers high-speed, pro-grade reasoning and outperforms even the Pro model in coding benchmarks, making it an ideal tool for low-latency agentic workflows and complex multimodal tasks like video analysis and real-time data extraction

GPT‑5.2 Thinking - OpenAI, sets a new state of the art of 55.6% on SWE-Bench Pro, a rigorous evaluation of real-world software engineering. This model can more reliably debug production code, implement feature requests, refactor large codebases, and ship fixes end-to-end with less manual intervention

Devstral 2 - Mistral AI, our next-generation coding model family available in two sizes: Devstral 2 (123B) and Devstral Small 2 (24B). Devstral sets the open state-of-the-art for code agents. Devstral 2 ships under a modified MIT license, while Devstral Small 2 uses Apache 2.0. Both are open-source and permissively licensed to accelerate distributed intelligence

rnj-1-instruct - Essential AI, trained from scratch and optimized for code and STEM with capabilities on par with SOTA open-weight models, performs well across a range of programming languages and boasts strong agentic capabilities (e.g., inside agentic frameworks like mini-SWE-agent), while also excelling at tool-calling

`November 2025`

Claude Opus 4.5 - Anthropic, intelligent, efficient, and the best model in the world for coding, agents, and computer use, also meaningfully better at everyday tasks like deep research and working with slides and spreadsheets

GPT-5.1-Codex-Max - OpenAI, an update to our foundational reasoning model, which is trained on agentic tasks across software engineering, math, research, and more, faster, more intelligent, and more token-efficient

Gemini 3 - Google, our most intelligent model that can help bring any idea to life, delivers unparalleled results across every major AI benchmark compared to previous versions, also surpasses 2.5 Pro at coding, mastering both agentic workflows and complex zero-shot tasks

Doubao-Seed-Code - ByteDance Volcengine, achieve breakthroughs in performance, price, and migration cost, and deeply integrated with the TRAE development environment

GPT-5-Codex-Mini - OpenAI, allows roughly 4x more usage than GPT-5-Codex, at a slight capability tradeoff due to the more compact model

Mercury Coder - Inception Labs, dLLM optimized to accelerate coding workflows, streaming, tool use, and structured output with 128K context window

`October 2025`

Composer - Cursor, 4x faster than similarly intelligent models and built for low-latency agentic coding

SWE-1.5 - Windsurf Cognition, a fast-agent frontier-size model with hundreds of billions of parameters that achieves near-SOTA coding performance, 6x faster than Haiku 4.5 and 13x faster than Sonnet 4.5

CoDA-1.7B - Salesforce AI Research, diffusion-based language model designed for powerful code generation and bidirectional context understanding

KAT-Dev-72B-Exp - Kawaipilot, an open-source 72B-parameter model for software engineering tasks, achieves 74.6% accuracy on SWE-Bench Verified when evaluated strictly with the SWE-agent scaffold

`September 2025`

Code World Model (CWM) - AI at Meta, CWM is an LLM for code generation and reasoning that has been trained to better represent and reason how code and commands affect the state of a program or system

DeepSeek-V3.2-Exp - DeepSeek, experimental sparse-attention upgrade that halves inference cost while retaining strong code-generation and long-context reasoning

GLM-4.6 - Z.ai, features a longer context window, superior coding performance, advanced reasoning, more capable agents, and refined writing versus GLM-4.5

Claude Sonnet 4.5 - Anthropic, the strongest model for building complex agents, the best model at using computers, it shows substantial gains on tests of reasoning and math

Qwen3-Max-Instruct - Alibaba Cloud, the official release further elevates its capabilities — particularly in coding and agent performance

GPT‑5-Codex - OpenAI, a version of GPT‑5 further optimized for agentic coding in Codex and trained with a focus on real-world software engineering work

Kimi K2-Instruct-0905 - Moonshot AI, updated SOTA model with improved agentic and frontend capabilities and increased context length

`August 2025`

GPT-5 - OpenAI, flagship model

GPT-5-mini - OpenAI, fast/cost efficient

GPT-5-nano - OpenAI, faster/cost efficient

Claude Opus 4.1 - Anthropic, a drop-in replacement for Opus 4

Mistral Medium 3.1 - Mistral AI, aka Mistral-Medium-2508 - enterprise-grade model excels in coding tasks

Grok Code Fast 1 - xAI, a speedy and economical reasoning model that excels at agentic coding, efficient code generation, and execution

`July 2025`

Qwen3-Coder - Alibaba Cloud, agentic code model

Qwen3-Coder-Flash - Alibaba Cloud, streamlined non thinking agentic code model

Kimi K2 - Moonshot AI, 1 T-param MoE

GLM-4.5 - Z.ai, An open-source LLM designed for intelligent agents

Codestral 25.08 - Mistral AI, code model for high-precision fill-in-the-middle (FIM) completion

Devstral Medium 2507 - Mistral × All Hands AI, high-quality and cost-effective model

Devstral Small 1.1 2507 - Mistral × All Hands AI, agentic model

Grok 4 - xAI, trained with reinforcement learning for native tool use, including code interpreters, making it highly capable for coding and advanced reasoning tasks

`June 2025`

Gemini 2.5 Pro - Google DeepMind, flagship model

Gemini 2.5 Flash - Google DeepMind, fast/cost efficient with thinking capabilities

`May 2025`

Claude Opus 4 - Anthropic, pushes the frontier in coding, agentic search, and creative writing

Claude Sonnet 4 - Anthropic, improves on Claude Sonnet 3.7 across a variety of areas, especially coding

DeepSeek-R1-0528 - DeepSeek, OSS reasoning model

`April 2025`

o3 - OpenAI, preview reasoning model

o4-mini - OpenAI, compact model

GPT-4.1 - OpenAI, flagship model with 1M token context window

Llama 4 Maverick - Meta, code-tuned model

Llama 4 Scout - Meta, open-weight model

Mellum - JetBrains, 4-B param OSS model

`March 2025`

DeepSeek-V3-0324 - DeepSeek, improved V3 version

`February 2025`

Gemini 2.0 Flash - Google DeepMind, multimodal for high-volume high-frequency tasks

Claude 3.7 Sonnet - Anthropic, first hybrid reasoning model and state-of-the art for coding

Grok 3 - xAI, coding capable model

`Menu`

`About The Author`

`More AI Writing`

⇒ HOME

⇒ ABOUT

Joy Larkin is a technologist in Silicon Valley. She likes robots and is excited for Superintelligence. LinkedIn: /in/joylarkin ◦◦◦ Twitter: @joy

The Challenges of Building Agentic AI For Business The Urgency of Open Source AI