A/B Test Reporting and Visualization in SQL

Published in

Ryan Iyengar

8 min readDec 1, 2017

For a lot of e-commerce, subscription, or internet companies, A/B testing is a way of life. There are many tools and software services out there for setting up and reporting on tests, but if you happen to be in the privileged position of owning your own A/B test data in a relational database, then it’s worth setting up your own custom reporting in SQL.

I’ll be using Periscope Data for this exercise, and some of their SQL shorthand to speed up my query writing. Specifically, I’ll be referring to sample data loaded into SQL Views, which are referred to within brackets as [table_name].

Example A/B Test Data

For this exercise, I’ve generated two tables of sample data, in a format you might expect to find it in a relational database. Check out my blog post on how to generate sample data in Excel and Google Sheets if you’re interested in more specifics. You can also check out the Google Sheet for A/B Test Sample Data here.

ab_test_entrants

This table contains the date a given user_id was entered into a test, the name of the test they entered into, and finally the version of the test they’re a member of. Tables like this should usually have a uniqueness constraint across user_id / test_name, so a business user knows that one user didn’t see multiple versions of a test. This particular sample table refers only to a single test, “new_funnel_test_2017–12–01”, but could theoretically contain more.

ab_test_conversions

This table refers to only the user_ids that reached a conversion step, and the date when they converted. Tables like this can get hairy to validate with conversions that can happen multiple times per user (e.g. e-commerce sales) and running multiple tests for the same user (e.g. running distinct tests in subsequent months). This is a simple example that dodges a lot of those complications by maintaining only one test.

Simple A/B Test Validations

Before I start digging into calculating conversion rates, I like to validate that my test data is correct. That means I don’t have orphaned users, my conversions are happening after the test started, and of course that I’m collecting multiple converting users across my expected test versions. Here’s some quick “unit testing” style queries that validate those questions.

Check for orphaned users

With a FULL OUTER JOIN between the entrants and converting users, how many users fall into the three possible buckets: Entered but not converted, entered and converted, and converted but not entered?

select
  count(
    case
      when ab_test_entrants.user_id is not null
      and ab_test_conversions.user_id is null
        then 1
      else null
    end
  )
  as users_entered_but_not_converted
  , count(
    case
      when ab_test_entrants.user_id is not null
      and ab_test_conversions.user_id is not null
        then 1
      else null
    end
  )
  as users_entered_and_converted
  , count(
    case
      when ab_test_entrants.user_id is null
      and ab_test_conversions.user_id is not null
        then 1
      else null
    end
  )
  as users_converted_and_not_entered
from
  [ab_test_entrants]
  full outer join [ab_test_conversions] using (user_id)

So far so good, 40 converted out of 200 total, with no orphaned converted users polluting my pool for analysis.

Check for conversions after test entries

With a LEFT JOIN starting from the AB test entrants, what is the distribution of days it takes to get from entry to conversion? Hopefully I see numbers greater than 0 to tell me I’m looking at the right time frame for conversion.

select
  datediff('day', date(ab_test_entrant_date), date(conversion_date)) as days_from_entry_to_conversion
  , count(*) as num_entrants
from
  [ab_test_entrants]
  left join [ab_test_conversions] using (user_id)
group by
  1

Also so far so good! Looks like this sample data set took at most 7 days to get from entry to conversion.

Check for data populating in each bucket

Finally, do I have the expected number of versions, collecting conversions in each bucket?

select
  ab_test_version
  , count(ab_test_entrants.user_id) as ab_test_entrants
  , count(ab_test_conversions.user_id) as ab_test_conversions
from
  [ab_test_entrants]
  left join [ab_test_conversions] using (user_id)
group by
  1

Yep! Two versions, both starting to collect real data to analyze.

A/B Test Conversion Rate Calculations

The simplest way to compare versions in an A/B test like this, is using conversion rates. So total number of unique users that converted in a bucket, divided by total number of users entered into that same bucket. From the above graph, the orange bar divided by the blue bar.

select
  ab_test_version
  , count(ab_test_conversions.user_id)::float /     
    count(ab_test_entrants.user_id) as conversion_rate
from
  [ab_test_entrants]
  left join [ab_test_conversions] using (user_id)
group by
  1

Our test appears to be winning, 22% of users are converting vs. 18% in the control treatment! Now, how do we know if that’s a statistically significant result?

Calculating A/B Test Confidence Intervals in SQL

I’m going to crib heavily from Periscope Data’s great blog post on the topic, and estimate a confidence interval around the initial results using just SQL. In simple terms, the more data you have, the tighter the confidence interval, and the more you should believe that the 4 percentage point difference above is a real thing, not a statistical accident.

with
  ab_test_conversions as (
    select
      ab_test_version
      , count(ab_test_entrants.user_id) as ab_test_entrants
      , count(ab_test_conversions.user_id) as ab_test_conversions
    from
      [ab_test_entrants]
      left join [ab_test_conversions] using (user_id)
    group by
      1
  )
  , ab_test_conversion_rates as (
    select
      ab_test_version
      , ab_test_entrants
      , ab_test_conversions
      , (ab_test_conversions + 1.92) / (ab_test_entrants + 3.84 )::float as conversion_rate
    from
      ab_test_conversions
  )
  , conversion_rate_standard_error as (
    select
      ab_test_version
      , ab_test_entrants
      , ab_test_conversions
      , conversion_rate
      , sqrt(conversion_rate * (1 - conversion_rate) / ab_test_entrants) as standard_error
    from
      ab_test_conversion_rates
  )
select
  ab_test_version
  , ab_test_entrants
  , ab_test_conversions
  , conversion_rate - standard_error * 1.96 as conversion_rate_low
  , conversion_rate
  , conversion_rate + standard_error * 1.96 as conversion_rate_high
from
  conversion_rate_standard_error

Stepping through the above, I’m doing one operation per CTE expression:

Counting the same conversions and entrants as the previous conversion rate chart by version
Modifying the numerator and denominator slightly to calculate the conversion rate, as per the Adjusted Wald Method.
Calculating the Standard Error of that conversion rate
Padding my calculated conversion rate on either side by multiplying that standard error by 1.96, to estimate the area under which 95% of cases probably fall.

That will generate 3 separate conversion rate estimates per test version, which can be a lot to visualize. My preferred method in Periscope Data is the combo scatter / bar.

So now we can see that even though our Test version is winning at the moment, there’s a decent chance it could be due to variance. If the conversion_rate_low of the Test version continues to increase over time, and eventually rises above the conversion_rate_high of the Control version, we can say: “with 95% confidence, the Test variant is a winner”.

Cumulative A/B Test Conversion Rates Over Time

Another way I like to visualize the variance in these kinds of test results, is through a cumulative conversion rate over time.

with
  ab_test_conversions_over_time as (
    select
      date(ab_test_entrant_date) as ab_test_entrant_date
      , ab_test_version
      , count(ab_test_entrants.user_id) as ab_test_entrants
      , count(ab_test_conversions.user_id) as ab_test_conversions
    from
      [ab_test_entrants]
      left join [ab_test_conversions] using (user_id)
    group by
      1
      , 2
  )
  , ab_test_conversions_cumulative as (
    select
      ab_test_entrant_date
      , ab_test_version
      , sum(ab_test_entrants) over(partition by ab_test_version order by ab_test_entrant_date rows unbounded preceding) as cumulative_entrants
      , sum(ab_test_conversions) over(partition by ab_test_version order by ab_test_entrant_date rows unbounded preceding) as cumulative_conversions
    from
      ab_test_conversions_over_time
  )
select
  ab_test_entrant_date
  , ab_test_version
  , cumulative_conversions::float / cumulative_entrants as cumulative_conversion_rate
from
  ab_test_conversions_cumulative

So starting from the same data set, but this time also grouping by date of entry, we can see how each day’s results affect the final outcome. Because tests like these tend to vary with different winners on different days, I prefer to track to cumulative outcome rather than daily conversion rate. It’s more clear which variant is trending towards winning, and easier to see noise and variance as well.

The intermediary CTE ab_test_conversions_cumulative looks like this:

So you can already tell that the Test variant gets a slower start out of the gate, but quickly caught up and overtook the control variant. If you take the final step in the query and divide cumulative_conversions into cumulative_entrants, you get this nice looking graph:

The most variant tests tend to cross over in cumulative results quite often. If instead you launch a test with a clear winner, it might take off early and never cross back over.

A subtle nuance to look out for here, is that by grouping by the entry date, but conversion dates can be much later, earlier dates on this graph can end up re-stated if more conversions appear. I don’t mind looking at charts that vary in the past, but for those less comfortable with that, there are ways to modify. My favorite of which is applying a “time-limit” to your numerator, so that instead of tabulating all conversions, you instead only consider conversions that happened within 7 days of their entry. That necessitates you wait at least 7 days before judging results to be final, but that delay can sometimes be a helpful bit of friction. Imagine if we had called this particular example test within 2 days, we would have thought it was a failure!

Parameterized SQL to make reporting on new A/B tests a breeze

Possibly my favorite feature of Periscope Data is their Parameterized SQL Snippets. You can take arbitrarily complex SQL that you need to run multiple times with only slight changes, and turn those slight changes into variables to be plugged in. As an example, we can take the same code from above, add a WHERE clause in the first CTE, and populate it with [ab_test_name], a dynamic parameter defined in the name of our Parameterized SQL Snippet, ab_test_confidence_intervals(ab_test_name).

Parameterized SQL Snippet:
ab_test_confidence_intervals(ab_test_name)with
  ab_test_conversions as (
    select
      ab_test_version
      , count(ab_test_entrants.user_id) as ab_test_entrants
      , count(ab_test_conversions.user_id) as ab_test_conversions
    from
      [ab_test_entrants]
      left join [ab_test_conversions] using (user_id)
    where
      ab_test_name=[ab_test_name] -- dynamic SQL Snippet parameter
    group by
      1
  )
  , ab_test_conversion_rates as (
    select
      ab_test_version
      , ab_test_entrants
      , ab_test_conversions
      , (ab_test_conversions + 1.92) / (ab_test_entrants + 3.84 )::float as conversion_rate
    from
      ab_test_conversions
  )
  , conversion_rate_standard_error as (
    select
      ab_test_version
      , ab_test_entrants
      , ab_test_conversions
      , conversion_rate
      , sqrt(conversion_rate * (1 - conversion_rate) / ab_test_entrants) as standard_error
    from
      ab_test_conversion_rates
  )
select
  ab_test_version
  , ab_test_entrants
  , ab_test_conversions
  , conversion_rate - standard_error * 1.96 as conversion_rate_low
  , conversion_rate
  , conversion_rate + standard_error * 1.96 as conversion_rate_high
from
  conversion_rate_standard_error

After saving that SQL Snippet, I can create a simple chart in Periscope Data with the below single line of code:

[ab_test_confidence_intervals('new_funnel_test_2017-12-01')]

Need a new test’s results visualized? No problem! Drop a new ab_test_name into the snippet, and you’ll get those results instead.

Final thoughts

Not everyone has the pleasure of working directly with A/B test data in a relational database (or possibly the headaches of maintaining that data!). But for those that do, it provides ample ground to experiment with visualizations and transformations like the above examples. I’ve found SQL to be an incredibly powerful and customizable tool for exploring this kind of data, and I hope this helps you get started with it too!