Projecting Customer Lifetime Value in SQL Using Exponential Decay

Published in

Ryan Iyengar

9 min readMay 16, 2016

Customer lifetime value is a really useful metric to use across a business, but unfortunately (or fortunately!) for a new business, most of their customers are still early on in their lifetimes. So getting a sense of the future lifetime revenue stream from a given customer group is really powerful. Predicting it really accurately isn’t necessarily the goal, but estimating it quickly and repeatably is great. That allows you to keep an eye on this projection as you make changes in your new company, and estimate their long term impacts using a consistent model. SQL is a great tool for this consistent projection, especially if you’ve already aggregated customer revenue data in a database.

I’ll be using Periscope Data to do all of this analysis, and I’ll use some of their SQL shorthand to speed up my query writing. It’s marked by statements wrapped in brackets, like [date:month] instead of date_trunc(‘month’, (date)::timestamp)::date.

Example Customer Data

Referring back to my previous post on Uber, I’ve already derived a usable data set. I looked around for other sample data, but turns out not a lot of people have generated plausible SaaS transaction level or customer level data. My Uber post used a very simplistic LTV calculation, I’ll try to define one in SQL, and see if my results match.

The data I’ve generated is already aggregated up to monthly cohorts, which I’ve loaded into Periscope as a .CSV, [expected_uber_cohorts].

with
  uber_cohorts as (
    select
      current_month
      , cohort_month
      , datediff('month', cohort_month, current_month) + 1 as months_since
      , revenue_share
      , drivers
    from
      [expected_uber_cohorts]
    where
      current_month <= '2014-11-01'
  )

Estimated Uber’s revenue share, split by the month the driver joined

Zooming in on one example cohort, we can see the revenue follows a roughly exponential decay pattern.

Given that, I can calculate a single decay constant that describes this curve, and then use it to project future values of this curve.

Future Months

SQL is great at transforming data that’s already there, but not great at generating new data. PostgreSQL has a couple generate_series() functions, but on Redshift, the flavor I’m using, those don’t work that well. Thankfully, I’ve got a table ready to go full of future dates called all_time, so I’ll generate a bunch of future months to project against using that table.

First, I’ll need all the distinct cohort months represented in my data set, and the latest of them.

  , uber_cohort_months as (
    select distinct
      cohort_month
    from
      uber_cohorts
    group by
      1
  )
  , latest_cohort_month as (
    select
      max(cohort_month) as latest_cohort_month
    from
      uber_cohort_months
  )

Next, I’ll take all of the distinct months after the latest cohort month, and before 4 years for a reasonable LTV time frame. I’m cross-joining in a table with a single value to enable myself to filter against it in my where clause. This outcome is doable a couple different ways, a subquery would also work here, but I prefer this style for readability.

  , future_months as (
    select distinct
      [date:month] as current_month
    from
      all_time
      join latest_cohort_month on
        1 = 1
    where
      [date:month] > latest_cohort_month
      and [date:month] < [latest_cohort_month + (365 * 4):month]
  )

Now that I’ve got existing cohorts and future months, I can permute them against one another with another cross join.

  , cohorts_with_future_months as (
    select
      uber_cohort_months.cohort_month
      , future_months.current_month
      , datediff('month', uber_cohort_months.cohort_month, future_months.current_month) + 1 as months_since
      , row_number() over(partition by cohort_month order by current_month) as months_in_future
    from
      uber_cohort_months
      join future_months on
        1 = 1
  )

Result of above permutation of future months

I take it for granted that every cohort_month also has a current_month data point up to the last cohort collected, in this case 11/2014. Permuting all past cohorts against future months allows me to have a table full of months next to their months difference from their latest data point.

Decay Constant Calculations

Now the real meat of this query. I’ll use a Lag function to calculate the exponential decay constant that would have resulted in the delta between two month’s worth of revenue.

, decay_constants as (
    select
      cohort_month
      , current_month
      , months_since
      , row_number() over(partition by cohort_month order by current_month desc) as months_ago
      , revenue_share
      , lag(revenue_share, 1) over(partition by cohort_month order by current_month) as lag_revenue_share
      , ln(revenue_share / LAG(revenue_share, 1) over(partition by cohort_month order by current_month)) as decay_constant
    from
      uber_cohorts
  )

Quick refresher on formulas from the exponential decay wiki:

Exponential decay constant

Formula for calculating result of exponential decay

We already have 3 variables, and want to solve for a 4th:

N(t): Revenue this month
N0: Revenue the previous month
t: 1, since we’re only calculating a delta of 1 month

To solve for lambda, we’ll divide this month’s revenue over last month’s, take the natural log of the result, and then divide by the number of month periods that have passed. Since t is 1, we can skip the last division.

So for the 3rd row, we calculate ln(3,964,152 / 4,546,282), resulting in -.1370. I’m leaving the negative sign in the result, since I’ll just be turning right around and including the negative when I re-derive future exponential decays.

Projection Types

There are couple options on how to use these kinds of constants, depending on the type of data we have in front of us.

Take the couple most recent constants per cohort, and decay out the ending known revenue from that point onward using that slope
Treat each cohort as homogeneous, and use unique constants per month of cohort life as an average of all previously observed cohort decays within that month index. For example, all month 3 cohorts decay around -.137 from month 2, so this unknown month 3 will also decay -.137 from the known month 2.
Same as previous, but instead of decaying off of the previous month, decay from the initial revenue amount, or another previous month

For now, we’ll go with the first option, but it’s fairly simple to switch around to fit this methodology to any of the above.

  , recent_avg_decay as (
    select
      cohort_month
      , avg(decay_constant) as recent_avg_decay
    from
      decay_constants
    where
      months_ago <= 3
      and months_since >= 4
    group by
      1
  )

Given we’ve already calculated individual month level decays, it’s trivial to select the most recent 3 months worth. Also, it looks like most cohorts have steep revenue decay in their first few months, so I’m limiting my data collection to only after that rocky first 3 months. This means I won’t have reliable projections for cohorts within the past 3 months, but given the variable steepness of those, I think exponential decay isn’t a great modeling tactic for them anyway.

One more piece, we’ll need the latest revenue from a given cohort to start decaying off of.

  , latest_cohort_revenue as (
    select
      cohort_month
      , revenue_share
    from
      uber_cohorts
      join latest_cohort_month on
        latest_cohort_month.latest_cohort_month = uber_cohorts.current_month
  )

We’ve already calculated the latest cohort month, so we can just join that in to the historical cohort data as a filter on the current_month. That should pick out the latest data point for each individual cohort.

Projections

Now the fun part! We get to stack all the building blocks we’ve built up to this point into a revenue projection.

, revenue_projection as (
    select
      cohort_month
      , current_month
      , months_since
      , months_in_future
      , revenue_share as latest_revenue_share
      , recent_avg_decay
      , revenue_share * exp(recent_avg_decay * months_in_future) as projected_revenue_share
    from
      cohorts_with_future_months
      left join recent_avg_decay using (cohort_month)
      left join latest_cohort_revenue using (cohort_month)
  )

We’ll start from our permutation table of every cohort with future months already laid out indexed against a month number. From there, join in each cohort’s two necessary data points, the decay constant calculated across recent months, and the latest revenue data point to decay off of. From there, we pull back our old friend, the exponential decay formula:

Formula for calculating result of exponential decay

This time we’re solving for N(t), so we multiply the last known revenue with e to the power of our recent constant times month periods in the future.

Calculating 4/2014’s cohort’s projected revenue, post 11/2014

So now we have projected revenue for future cohort months, and actual revenue for cohort months. Let’s mash them together!

  , actuals as (
    select
      cohort_month
      , current_month
      , 'Actual' as revenue_type
      , months_since
      , revenue_share
    from
      uber_cohorts
  )
  , projections as (
    select
      cohort_month
      , current_month
      , 'Projection' as revenue_type
      , months_since
      , projected_revenue_share as revenue_share
    from
      revenue_projection
  )
  , projected_cohorts as (
    select
      *
    from
      actuals
    union all
    select
      *
    from
      projections
  )

And that’s it! We now have a master table called projected_cohorts that joins together our actual data with projected data. Here’s a sample query from that table:

select
  current_month
  , revenue_type
  , revenue_share
from
  projected_cohorts
where
  cohort_month = '2014-04-01'

Result of the decay curve projection + actuals

Not bad! It’s tough to say whether or not future revenue will truly approach 0 in 2018. If Uber had 4 year old cohorts that were worth analyzing that might tell us the story, but as is the story with most growth companies, the present is too different from the past to make strong predictions about the future.

Because of our method of projection, the minimum length of months we ended up projecting for every cohort was 48. However, lots of old cohorts got extra months tacked on. So if we want 4 year LTVs, we’ll have to chop all those extra projections off. We also didn’t project anything for months after August 2014, due to noisier data.

select
  months_since
  , cohort_month
  , revenue_type
  , revenue_share
from
  projected_cohorts
where
  cohort_month <= '2014-08-01'
  and months_since <= 48

All cohorts 48 month projected revenue retention, indexed by month number

Lifetime Value

The sum of those lines is the total lifetime revenue of that cohort. Lifetime Value is a per unique user measure, though, so we’ll need the starting driver counts by cohort to divide by to find LTV.

  , initial_drivers as (
    select
      cohort_month
      , drivers as initial_drivers
    from
      uber_cohorts
    where
      months_since = 1
  )
  , total_cohort_revenue as (
    select
      cohort_month
      , sum(revenue_share) as total_revenue
    from
      projected_cohorts
    where
      cohort_month <= '2014-08-01'
      and months_since <= 48
    group by
      1
  )
select
  cohort_month
  , total_revenue
  , initial_drivers
  , total_revenue::float / initial_drivers as ltv
from
  total_cohort_revenue
  join initial_drivers using (cohort_month)

Result of above snippet, projected lifetime revenue per initial driver acquired

Meta Analysis

In my previous analysis, I used a simple churn estimate of 13%, resulting in an estimated LTV of $2.8k. In this analysis, I accounted for the variability of churn by the month in which it happened, which allowed for some slowing down of churn. This resulted in most of the future to be projected at 9% churn instead, resulting in an estimated LTV of $3.6k.

The second is much more accurate than the first, but there’s still some room to go to make this a fully fledged model. Potential additions that we could tack onto this basic framework:

Segment drivers into types, UberX vs. Uber Black
Segment drivers into activity buckets, <8hrs/week, >30 hours/week
Account for the seasonal effect of ridership on driver revenue (holidays are far more active than normal days)
Move beyond gross revenue share, and bring in Surge/Pool/Promotions that accounts for margin, and turn this into Profit LTV calculation.

Takeaways

If your customer data is already in an SQL database, and seems to follow an exponential decay pattern, you can apply this framework to quickly project future data points without porting your data into another piece of software. Overview of the main points:

Generate future months / cohort segments permutations
Choose a projection method (indexed from start, indexed from recent)
Calculate decay constants
Join in those decay constants against future months permutations
Use decay constants to calculate future revenue using month indices

Good luck!