Anthropic released Claude 3.7 Sonnet today, to the great excitement of AI-assister coders everywhere. Sonnet has consistently been the most capable coding model out there, and the 3.7 vintage is no exception.
Claude 3.7 scores at 60% in the Aider Polyglot Benchmark with no thinking tokens, and 65% with 32k thinking tokens (the new state of the art). Anecdotally, I wrote a little code with Claude 3.7 Sonnet today and it did very well.
The rumors of a scaling ceiling have been greatly exaggerated. In just four months, Claude delivered another major improvement to coding performance over its 10-22-24 release. Perhaps we should let things settle for a year or two before we start calling the end of any kind of technological trend.
Notably, these results were achieved without a wholesale switch to inference-time scaling. In fact, the gains from using reasoning tokens seem surprisingly limited for coding tasks. With a score of 65% vs. 60% at twice the cost on the above benchmark, I suspect I'll omit the thinking tokens more often than not.
One interesting data point from the benchmark: the new model seems less compliant. While 3.5 Sonnet returned correctly formatted diffs an astonishing 99.6% of the time, 3.7 Sonnet is down to 94%. I wonder what other implications weaker instructions following will have across use cases.
Claude 3.7 Sonnet is available today via Anthropic, Google Vertex, Amazon Bedrock, and of course OpenRouter. You can use it with Aider with aider --model anthropic/claude-3-7-sonnet-20250219
. Happy coding!