Benchmarking 16 LLMs: 165x Cheaper, Same Accuracy?

Which language model should your organization actually use? After spending over €500 on API costs and testing 16 leading models on thousands of real Dutch exam questions, we discovered something remarkable: some of the best-performing models cost 165 times less than the most expensive ones, for the same accuracy.

Why Dutch exam benchmarks matter

We used official Dutch high school exam questions across six subjects: Dutch literature, mathematics, and more, subjects that mirror real-world applications of language understanding and reasoning.

Why exams? Because they test genuine comprehension, not pattern matching. And because working with standardized test datasets that may already be contaminated (models may have seen the test questions during training) gives unreliable results.

The results: performance rankings

We tested 16 models across three provider categories: frontier (OpenAI, Anthropic, Google, xAI), open source (DeepSeek, Mistral, Llama), and mid-range options. The cheapest models cost 165 times less than the most expensive ones, for the same accuracy.

Top 5 performers

1. GPT-5 (OpenAI)79.5% · $29.30
2. GPT-5 Mini (OpenAI)79.5% · $5.48
3. DeepSeek-R1 (Open Source)78.7% · $10.93
4. Grok-3 (Open Source)76.8% · $43.17
5. Gemini 2.5 Pro (Google)76.7% · $31.86

GPT-5 Mini stands out as the clear winner, achieving the same top-tier 79.5% accuracy as the full GPT-5 model at a fraction of the cost: $5.48 per run versus $29.30.

The €500 Anthropic bill

Running Claude 3.5 Opus and Claude 3.5 Sonnet through our benchmark was remarkably expensive. Claude 3.5 and Opus 4.1 cost $270.74 for just 73.1% accuracy, roughly 50 times more expensive than GPT-5 Mini for worse results.

The budget champions? GPT-5 Nano at $1.64, GPT-OSS-120B at $2.06, and GPT-5 Mini at $5.48. When you can get top-tier accuracy for five dollars, paying hundreds feels like a strategic error.

Open source is closing the gap

Two of the top five performing models, DeepSeek-R1 and Grok-3, are fully open source. This challenges the assumption that you need expensive proprietary APIs for high-quality results.

The implications extend beyond cost savings. European organizations concerned about data sovereignty, regulatory compliance, or supply chain resilience have viable alternatives that don't require compromising on quality. The performance gap between proprietary and open-source models continues narrowing.

Practical recommendations

Choose models based on your domain

A model that excels at English coding tasks might struggle with Dutch literature comprehension. Always test on your specific use case.

Consider open source for privacy-critical applications

If the model weights can run on your own machine, you're less vulnerable to silent updates, outages, or data exposure.

Don't commit to one provider

When identical performance costs $1.64 or $270.74 depending on your provider, vendor lock-in is an expensive mistake.

Conclusion

The LLM landscape in 2025 is more competitive, more affordable, and more open than ever. Open source models now match proprietary ones at a fraction of the cost.

Originally published on

Read the full article on LinkedIn

From benchmarking to production: we help you make the right AI choices for your organization.

Get expert advice