Posted April 09

The future of experimentation

8 min read time

As execution gets cheap, clarity gets expensive.

The Red Queen told Alice it takes all the running you can do just to stay in the same place. AI delivered exactly that dynamic to digital organizations. Teams generate ideas faster, produce content faster, and run experiments faster than ever. So do their competitors.

At the same time, AI is compressing the top of the funnel. Fewer interactions are making it all the way to your owned surfaces, which means many teams now face two shifts at once: the cost of launching experiments is collapsing, and the traffic available to learn from is getting scarcer.

That changes the game. Running more experiments faster is no longer a competitive advantage on its own. Many teams can now do that. The advantage comes from learning faster under tighter constraints.

The execution-clarity paradox

This is the central tension.

AI has made it dramatically cheaper to act, but not cheaper to know what action is worth taking.

Execution costs are collapsing. The cost of clarity, knowing what to test, what metric matters, and what the result actually means, is rising. The teams that mistake cheaper execution for easier learning will optimize faster toward the wrong outcomes.

The model of experimentation was built for scarcity

For most of the last decade, experimentation programs were organized around scarcity. Statistical expertise sat with small data science teams. Engineering time was scarce, so tests competed with roadmap work. Analyst bandwidth was limited, so interpretation often happened days or even weeks after a test ended

Those constraints are fast disappearing. Implementation used to be the gating function for experimentation. Increasingly, it is not.

We are already seeing AI systems surface hypotheses, generate production-quality variations, accelerate test setup, summarize results, produce executive-ready readouts, and recommend next-best actions. For a large share of common experiments in content, messaging, and layout, the marginal cost of execution is approaching zero.

The data backs this up.

According to a Gartner forecast (March 2026), by 2030 the cost of performing inference on a trillion-parameter LLM will fall by over 90% compared to 2025. That trajectory is collapsing execution costs across every AI-assisted workflow, experimentation included.

Meanwhile, McKinsey's 2025 global survey found that 88% of organizations now use AI in at least one business function, yet the majority remain in piloting stages, not yet realizing enterprise-wide impact. Adoption is scaling faster than the ability to learn from it, which is exactly the gap experimentation programs need to close.

But removing scarcity does not remove complexity. It relocates it. The bottleneck has not disappeared. It has moved upstream.

Cheap execution increases volume faster than it increases understanding. The real questions are now harder:

What are we trying to learn?
Which outcome matters?
Which metric indicates true signal, not just noise?

Three things become critical human responsibilities.

Defining the metric framework: The metric framework is the set of leading indicators that should predict the business outcome you care about. If the loop targets day-7 activation but that doesn’t predict 12-month retention, the organization has built an efficient machine pointed at the wrong target.
Setting guardrails: A loop optimized only for conversion will eventually find the most aggressive path to conversion, whether or not it creates long-term value. Someone still has to define the constraints.
Knowing when to override the system: Some strategic moves underperform in the short term. The system will try to revert to the local optimum. Knowing when to hold course is still human judgment.

Most experimentation teams are organized around running experiments. When AI owns that, the job moves upstream. It is deciding what is worth learning, what business outcome is worth moving, and which metrics actually tell you if you succeeded.

Are you still testing pages or testing decision policies?

The old mental model of experimentation was static comparison: A versus B. Two variations, one winner, ship it. That model is not dead. But it is no longer the center of gravity.

Increasingly, the object being tested is not a page or a variation. It is a decision policy: what to show, when to intervene, how to route, which offer or model or prompt to invoke, across web, app, and email.

A variation is a fixed experience. A policy is a set of rules, probabilities, or learned behaviors that determines which experience gets delivered under which conditions. It has to be evaluated across contexts, channels, user segments, and time, not just declared a winner and shipped.

The question is no longer which version performs better. It is which decision logic consistently produces better outcomes.

The teams that recognize this early will stop treating experimentation as a page-level optimization function and start treating it as a system for evaluating decision quality.

How do evals and experiments connect into one learning loop?

As decision systems become more dynamic, not every candidate should go straight into live traffic. This is where evals become essential.

Evals are the screening layer used to assess quality, consistency, and safety before a candidate reaches live users. In practice, that can mean curated golden datasets, unit tests for expected behavior, or model-based judging against defined criteria. Live experiments remain the proof layer to show whether a change actually moved behavior or business outcomes in real conditions. Neither alone is sufficient.

Andrej Karpathy ran 700 experiments in 48 hours with no human intervention through his open-source autoresearch system — because it had a reliable offline eval metric. The agent found 20 genuine improvements that months of manual work had missed. Shopify's CEO replicated the pattern overnight for a 19% gain. The lesson: when you have a trustworthy eval, experiment cost collapses to near zero. Without one, volume is just noise

Evals without experiments produce quality assessments, not causal evidence. Experiments without evals waste live traffic on candidates that should never have been promoted.

The architecture that works is straightforward:

Define the policy
Test it offline
Send stronger candidates into traffic
Measure causal impact
Feed failures back into the eval system

Evals filter. Experiments prove.

This is already visible in ad tech. Google’s Performance Max and Meta’s Advantage+ generate and evaluate candidate policies continuously. The loop runs continuously rather than waiting for a human to declare a winner.

Over the last 1–2 years, the most forward-looking product and engineering leaders I speak with have started treating evals as the way to ensure the quality of what they ship. But A/B testing remains the gold standard for proving those experiences actually improve outcomes, especially when they are LLM-based and non-deterministic.

The strongest teams will stop treating evals and experiments as separate practices run by different people with different goals. The tooling to support this end-to-end is still maturing across the industry, but the architectural pattern is clear: they connect into one learning loop.

The COE does not disappear. Its job changes.

If the learning loop increasingly runs itself by surfacing hypotheses, generating variations, and interpreting results, the question becomes who sets the boundaries it operates within.

As experimentation is becoming easier to launch, more distributed across teams, and less dependent on engineering, a central team cannot remain the reviewer of every test, the interpreter of every result, and the human backstop for every bad decision.

That does not mean governance becomes less important. It means governance will increasingly sit in the system itself: setup guardrails, design checks, metric warnings, standard evaluation loops, and clearer escalation paths for the cases that actually require human judgment.

The point is not to remove rigor from experimentation. It is to stop requiring a small number of people to manually carry all of it. The COE still matters because humans still have to shape the process, handle exceptions, and drive adoption across the organization.

The COE becomes less of a throughput bottleneck and more of a standards owner, change agent, and escalation path. It defines constraints, sets quality bars, decides where human oversight remains non-negotiable, and helps the organization adopt new ways of working without losing rigor.

The old COE protected the discipline by centralizing judgment. The next one will protect it by designing the system and organizational habits that democratize judgment safely.

Statistical rigor becomes the compounding advantage

Bad statistical inference scales just as easily as good inference. You are no longer risking one bad test. You are risking a system that gets better and better at optimizing the wrong thing.

Most experimentation programs were built when traffic was abundant enough to tolerate weak methodology. Teams could afford noisy tests, blunt metrics, and a fair amount of waste. Many cannot anymore. If execution cost is falling while the traffic available to learn from is getting scarcer, the organizations that learn efficiently gain a real edge.

That is why statistical rigor stops being methodology hygiene and becomes a structural advantage. Plugging data into an LLM and asking for a conclusion is not statistical rigor. Methods that improve signal efficiency and raise the standard of proof will matter more: variance reduction, sequential approaches, stronger proxy design, false-positive control, and tighter causal discipline.

More experiments are not the goal. More reliable learning is.

The winners in the next era will not simply be the teams that can launch more tests. They will be the ones that can learn faster under tighter constraints without lowering the standard of proof.

Wrapping up...

When execution gets cheap, clarity gets expensive. The next era of experimentation will not be won by the organizations that run the most tests. It will be won by the ones that know what they are trying to change, define the business outcome that matters, and can prove they changed it. As experimentation gets cheaper and more distributed, the real advantage shifts to judgment, governance, and rigor.

Check out Optimizely Experience Optimization platform

Sources

Gartner. Gartner Predicts That by 2030, Performing Inference on an LLM Will Cost Over 90% Less Than in 2025. Press release, March 25, 2026.
McKinsey & Company. The State of AI in 2025: Agents, Innovation, and Transformation. November 2025.
Karpathy, A. autoresearch. GitHub, March 2026. Coverage: Fortune, March 17, 2026.

Last modified: 4/9/2026 1:29:34 PM