

Your on-call team Is burning out: here's how to see it coming
Introducing On-Call Health, an open-source way of detecting responder overload.
February 18, 2026
4 minutes
Anthropic released Claude Sonnet-4.6, and we ran it through SRE-skills-bench the same day. It tests models on the tasks SREs actually do: understanding infrastructure code, reasoning about cloud configurations, and mapping code diffs to real-world pull requests.



Anthropic released Claude Sonnet-4.6, and we ran it through SRE-skills-bench the same day. It tests models on the tasks SREs actually do: understanding infrastructure code, reasoning about cloud configurations, and mapping code diffs to real-world pull requests.
We also test frontier models in the context of the AI SRE we are building. We experimented with our agentic workflows: investigating incidents, correlating signals, and reasoning through causal chains. Claude's 4.6 model family introduces adaptive thinking, which changes the equation in ways raw accuracy scores don't capture.
Here is what we found.
The biggest story here is the Sonnet tier. Sonnet-4.6 gained over 4 points on Sonnet-4.5 at the same price. Meanwhile Opus barely moved between generations. The 4.6 release clearly improved Sonnet more than Opus.
| Model | SRE-skills-bench Score | Output Cost (per M) |
|---|---|---|
| opus-4.6 | 94.7% | $25.00 |
| opus-4.5 | 94.6% | $25.00 |
| sonnet-4.6 | 90.4% | $15.00 |
| sonnet-4.5 | 85.9% | $15.00 |
For context, on SWE-bench Verified (the industry-standard software engineering benchmark), Sonnet-4.6 sits just 1 point behind Opus-4.6. On SRE-skills-bench, that gap widens to over 4 points. SRE knowledge tasks spanning cloud infrastructure, Kubernetes, networking, and security are clearly harder to close than general coding.
Sonnet-4.6 edges out Opus on general SRE knowledge and matches it on AWS networking. On Kubernetes and compute, the gap is small enough that the 40% cost savings easily justify Sonnet. But on S3 security and IAM, where questions involve fine-grained policy evaluation and permission boundaries, Opus pulls away significantly.
| Task | sonnet-4.6 | opus-4.6 | Gap |
|---|---|---|---|
| GMCQ (General SRE) | 88.0% | 87.0% | +1.0 |
| Azure Compute | 92.6% | 95.6% | −3.0 |
| Azure Storage | 92.2% | 96.1% | −3.9 |
| Kubernetes | 94.5% | 97.3% | −2.8 |
| AWS Compute | 94.3% | 96.6% | −2.3 |
| AWS Network | 97.1% | 97.1% | 0.0 |
| AWS IAM | 85.2% | 92.2% | −7.0 |
| AWS S3 | 75.7% | 91.9% | −16.2 |
This is a useful signal for teams building AI SRE tools: you don't necessarily need a single model for everything. An AI SRE that routes IAM and S3 policy questions to Opus while handling Kubernetes, compute, and general infrastructure work with Sonnet could get near-Opus accuracy at a significantly lower average cost. Model routing by domain isn't just a cost optimization. It's an accuracy optimization too.
When an incident fires, our AI SRE works the problem end to end: it pulls metrics and logs from observability tools, traces the fault across service boundaries, and narrows down to a root cause before recommending a remediation.
We maintain a separate internal evaluation suite built on real production incidents from our platform, spanning misconfigurations, resource contention, OS-level faults, metastable failures, and concurrent multi-service cascades. This eval is not currently open-source.
On this agentic benchmark, the picture changes. Sonnet-4.6 perfromed similarly to Opus-4.6 on root cause accuracy, and in a few cases even beat it. Both models comfortably outperformed Opus-4.5 on our hardest investigations, but Sonnet-4.6 does it at about 40% less per token.
The reason: adaptive thinking allocates reasoning budget dynamically, minimal overhead during data collection, full depth when building a diagnosis. That variable reasoning depth is exactly what static benchmarks can't measure.
Claude 4.6 generation introduces adaptive thinking with four effort levels (low, medium, high, max), where LLMs choose how deeply to reason. We've found that during investigations, the model spends minimal reasoning budget on data collection but shifts into deeper analysis when building causal hypotheses across services. Exactly how an experienced SRE works.
Beyond adaptive thinking, a few other 4.6 capabilities matter for AI SRE work. The 1M token context window means we don’t need to truncate log output as much before feeding it to the model. Context compaction automatically summarizes older context as conversations grow, so our AI SRE can run longer investigation trajectories without manual context management. And Anthropic reports improved prompt injection resistance compared to Sonnet-4.5, which matters when your agent operates on untrusted data like log lines, error messages, and webhook payloads.
We run every frontier model through SRE-skills-bench the day it launches. You can run it yourself, and check sreskillsbench.com for the full leaderboard. Follow us on LinkedIn and X for updated results.