Optimizing Optimizely: How we use server-side experimentation for product development

Written by:

Zack Liscio

To put it simply, Optimizely Full Stack decouples the deployment of code from the delivery of features the code represents. That allows you to quickly release new features to audience segments, test the performance of the changes, analyze their impact, and either release them more widely or roll them back, all from an intuitive drag-and-drop dashboard. No need to rewrite the code every time. What’s more, Full Stack allows experimentation wherever code runs, on a website or on a FitBit.  

Let’s take a closer look at what that all means and how it’s done. 

Use cases, from the simple to the complex 

Full Stack originally came about because a lot of the experiments our customers were running were exceeding the capabilities the web product was designed for. Web experimentation can be as simple as a piece of Java script making a change on a given webpage. There are many experiments that don’t really fit that paradigm, and customers increasingly want to experiment with changes in channels other than the website.  

For example, a customer may want to change a user experience throughout a number of online locations. Let’s say they want to release a new checkout experience: it’s not just that I want to change what people see on one page, but I want them to have the experience across a series of different pagesI need to roll out changes consistently across all the places my users interact with the feature, which for example, might include multiple screens of a web app, an iOS app and an Android app 

Those are the types of use cases where our customers are really looking for more than a visual editor to modify a page; they’re looking for what we call a decision service.” In a sense, that decision service is fundamentally simple. I want to know what version of the feature to offer to specific, rule-defined audience segments: 

  • Testing on the entire audience 
  • Testing the feature being on or off for different groups of customers 
  • Testing several versions of the feature 

The easiest option is a decision about turning the feature on or off – turn it on for high value users (or perhaps low value users if you’re uncertain about its impact) and off for everyone else. If I want to get more complex with that decision, I can decide to turn it on version A of the feature for one segment, version B for another, and for users who don’t match any of those segments, fall back on a default experience. And if something goes wrong, I can just turn the feature off.  

full stack experimentation

You can see how a decision that starts as a basic on-off switch gets to a place where there might be multiple variations, and suddenly you’re running a complex experiment.  

Here’s an example. Let’s say we have a search page that shows 50 results by default. Maybe we want to run an experiment to see if people respond differently if it shows 10 results instead of 50. Maybe we want to show that version to just 10% of the audience, or to a rule-governed segment of users. We can roll out the test (with no new coding), and easily measure downstream impact. Did overall revenue change with the test audience? Did it change time on site? Or any other metric of interest? 

The significance of “feature flags” 

We built a platform that makes a featured flag more efficient and powerfulA feature flag can be thought of as just a kind of wrapper that can be placed around any piece of code that represents a feature change. In other words, it’s a place in your code where a decision is going to be made, such as show a new homepage to these users, put these users into the new checkout experience, show these users a different CTA button, and so on.    

create flags

Those decisions are made according to rules – and in the dashboard we can just keep stacking rules, such as those that define an audience, to whatever level of complexity appropriate.  

Notice that we are not making changes to our code in order to roll out these experiments, because the variations representing different features already exist in the code. We can control the roll out and experimentation process from the Full Stack dashboard, without taking more developer time.  

feature flag variation

At Optimizely, we have an internal mandate to release pretty much every new feature behind a feature flag so it’s ready to be put into an experiment. Some features are released progressively or selectively; others we just turn on, but we have the ability – from the dashboard – to roll back the feature if we don’t like it. For any important changes, we do typically run an experiment. 

A real-life use case 

At Optimizely, our philosophy is “ABT” – always be testing. We use Optimizely to run experiments – on Optimizely. Some experiments are large and complex, but it’s often the continual, incremental improvements that make a big difference to our customers. 

One of these experiments was an attempt to identify an opportunity to improve the speed of our user interface. We noticed that a certain part of our application loaded slowly, which could be frustrating for our users. Our backend team decided to rewrite some of the supporting logic (how permissions were retrieved by the application) with the hypothesis that this would lead to a faster overall experience.  

The team implemented their proposed change, of course, behind a feature flag, and then started testing the new implementation on just a small portion of traffic. After running the experiment, we were able to conclude that the new implementation was indeed safe to use, and best of all, increased the speed of the application for end users by 45%. We were then able to turn that feature flag on for everyone, bringing that performance gain to our entire userbase.  

This cycle of problem identification, hypothesis generation, proposed solution testing, and winning variation rollout, was facilitated end to end by our unified feature flagging and experimentation suite.