Top Frontier AI Models Top Out At C+ … Barely Better Than Old Models

My latest at Forbes:

The latest, most expensive AI models from OpenAI and Anthropic boast incredible intelligence, but a recent study by Pearl, an AI systems company, throws a bucket of cold water on those gaudy claims. Testing 25 leading models, including GPT-5.5 and Claude Opus 4.7, across 510 novel questions in five diverse domains (business, health, law, pets, tech), the results were judged not by traditional benchmarks, but by real licensed professionals. The highest score? A mere 72.7% for GPT-5.5, with Claude Opus close behind at 71.9%. That’s roughly a C+.

As Pearl CEO Andy Kurtzig aptly put it, “Benchmarks measure whether a model can pass a test. We are asking whether a professional would trust the answer, and right now the answer is no. Almost right is still wrong.” Interestingly, the study revealed that simply delivering more inference-time compute only yielded a marginal 1-2.6% improvement, and occasionally even led to worse answers. While models performed reasonably well in business (80.9%), performance plummeted to around 20% in critical fields like law and health. For executives at companies like Cisco and Meta, considering shedding human workers, this insight is stark: “AI makes real mistakes in every domain, and serious errors in high-impact areas.” Perhaps, just perhaps, we cannot let go of all the humans just yet.

Read the full article on Forbes →

Top Frontier AI Models Top Out At C+ … Barely Better Than Old Models

Subscribe to my Substack