What new method are Spotify engineers suggesting for testing AI?

Spotify engineers are proposing a 'funnel' approach. This means using AI evaluations first to check language models before doing A/B tests with real users.

How will this 'funnel' method help Spotify?

This method aims to make testing AI more efficient. By using AI to filter out weaker models early, Spotify can save time and resources on A/B tests that might not work.

What is the role of AI evaluations in this new method?

AI evaluations will act like automated judges. They will check AI-generated content for quality, relevance, and correctness before it goes to live user testing.

How does this connect to Spotify's other AI work?

This idea fits with Spotify's ongoing use of AI in development, like their 'Honk' system for coding agents. They have already used AI for over 1,500 code changes.

What happens after the AI evaluations?

After the AI evaluations filter the best options, those models will be tested with real users through A/B tests. This helps confirm the AI's findings and catch any issues missed by automated checks.

Spotify Engineers Suggest New Way to Test AI

A Shift from Competing Methods to Complementary Processes

Recent commentary from Spotify's engineering circles advocates for a redefinition of how large language model (LLM) evaluations and traditional A/B testing are implemented. The prevailing idea is to move away from viewing these as separate or conflicting methodologies towards a cohesive "funnel" system. This approach suggests that LLM evaluations should serve as a preliminary filtering mechanism, preceding actual user-facing experiments.

Better Experiments with LLM Evals — A funnel, not a fork | Spotify Engineering - 1

The core proposition is that LLM evaluations, acting as automated judges, should rigorously assess potential LLM outputs for relevance, coherence, and quality. This process is designed to discard unpromising candidates before they consume valuable resources in live A/B testing. This pre-screening is intended to significantly improve the efficiency of experimentation, ensuring that only well-vetted options reach the stage of real-world user validation.

Better Experiments with LLM Evals — A funnel, not a fork | Spotify Engineering - 2

Calibrating Judges and Closing the Loop

A key aspect of this proposed "funnel" model is the establishment of a continuous feedback loop. By running LLM evaluations on the data generated from A/B tests, the automated judges can be progressively calibrated. This calibration aims to make both the evaluation system and the experiments themselves more astute over time.

Better Experiments with LLM Evals — A funnel, not a fork | Spotify Engineering - 3

The process would work as follows:

LLM Evaluations First: These automated systems, which assess aspects like relevance, coherence, and tone, are applied to a pool of potential LLM-generated content. Their purpose is to weed out weaker options.
A/B Testing for Validation: The remaining, more promising candidates are then subjected to A/B tests. These experiments serve to confirm whether real users react as predicted by the evaluations and, crucially, to catch any regressions in secondary metrics that automated evaluations might miss.

This integration creates a cycle where insights from live user behavior feed back into refining the automated evaluation criteria, leading to smarter decisions in subsequent development cycles.

Better Experiments with LLM Evals — A funnel, not a fork | Spotify Engineering - 4

This discussion about LLM evaluation methodologies emerges within a broader context of Spotify's increasing integration of AI in its development processes. The company has been vocal about its work with AI agents, particularly through an internal system named Honk.

Previous reports from Spotify's official technology blog have detailed the company's experiences with these "Background Coding Agents." These include explorations into context engineering for such agents, ensuring predictable results through strong feedback loops, and using AI to supercharge dataset migrations. One article highlighted that over 1,500 pull requests have been merged, demonstrating significant adoption of AI-generated code. Another piece mentioned the use of Claude Code within the Honk system to enhance coding efficiency and speed up product deployment.

The concept of "LLM-as-a-Judge" itself is not entirely novel, with various sources discussing its potential and limitations. Considerations around reducing bias in these automated judges and the accuracy of computed evaluation scores compared to human judgment are ongoing areas of research and debate in the broader AI community. This Spotify proposal appears to be a practical implementation strategy for leveraging such evaluation techniques within a product development framework.

Spotify Engineers Suggest New Way to Test AI

A Shift from Competing Methods to Complementary Processes

Calibrating Judges and Closing the Loop

Frequently Asked Questions

NewsRadar

The Present

Search Records

Explore

Spotify Engineers Suggest New Way to Test AI

A Shift from Competing Methods to Complementary Processes

Calibrating Judges and Closing the Loop

Background and Related Developments at Spotify

Frequently Asked Questions

Know What Changed

Gigabyte RX 9070 XT 16GB Price Drops to Record Low on May 19, 2026

Google and Blackstone Launch $5 Billion AI Compute Venture

NVIDIA RTX 50 GPUs Now Selling Near Suggested Prices in UK/US

Ryanair Secures Cheap Fuel But Warns of Higher Flight Prices

Muck Rack Adds STAT Health News to Its Platform

NewsRadar

The Present

Search Records

Explore