Projecting Customer Lifetime Value in SQL Using Exponential Decay

Ryan Iyengar
Ryan Iyengar
Published in
9 min readMay 16, 2016

--

Customer lifetime value is a really useful metric to use across a business, but unfortunately (or fortunately!) for a new business, most of their customers are still early on in their lifetimes. So getting a sense of the future lifetime revenue stream from a given customer group is really powerful. Predicting it really accurately isn’t necessarily the goal, but estimating it quickly and repeatably is great. That allows you to keep an eye on this projection as you make changes in your new company, and estimate their long term impacts using a consistent model. SQL is a great tool for this consistent projection, especially if you’ve already aggregated customer revenue data in a database.

I’ll be using Periscope Data to do all of this analysis, and I’ll use some of their SQL shorthand to speed up my query writing. It’s marked by statements wrapped in brackets, like [date:month] instead of date_trunc(‘month’, (date)::timestamp)::date.

Example Customer Data

Referring back to my previous post on Uber, I’ve already derived a usable data set. I looked around for other sample data, but turns out not a lot of people have generated plausible SaaS transaction level or customer level data. My Uber post used a very simplistic LTV calculation, I’ll try to define one in SQL, and see if my results match.

The data I’ve generated is already aggregated up to monthly cohorts, which I’ve loaded into Periscope as a .CSV, [expected_uber_cohorts].

with
uber_cohorts as (
select
current_month
, cohort_month
, datediff('month', cohort_month, current_month) + 1 as months_since
, revenue_share
, drivers
from
[expected_uber_cohorts]
where
current_month <= '2014-11-01'
)
Estimated Uber’s revenue share, split by the month the driver joined

Zooming in on one example cohort, we can see the revenue follows a roughly exponential decay pattern.

Same as above, but for April 2014

Given that, I can calculate a single decay constant that describes this curve, and then use it to project future values of this curve.

Future Months

SQL is great at transforming data that’s already there, but not great at generating new data. PostgreSQL has a couple generate_series() functions, but on Redshift, the flavor I’m using, those don’t work that well. Thankfully, I’ve got a table ready to go full of future dates called all_time, so I’ll generate a bunch of future months to project against using that table.

First, I’ll need all the distinct cohort months represented in my data set, and the latest of them.

  , uber_cohort_months as (
select distinct
cohort_month
from
uber_cohorts
group by
1
)
, latest_cohort_month as (
select
max(cohort_month) as latest_cohort_month
from
uber_cohort_months
)

Next, I’ll take all of the distinct months after the latest cohort month, and before 4 years for a reasonable LTV time frame. I’m cross-joining in a table with a single value to enable myself to filter against it in my where clause. This outcome is doable a couple different ways, a subquery would also work here, but I prefer this style for readability.

  , future_months as (
select distinct
[date:month] as current_month
from
all_time
join latest_cohort_month on
1 = 1
where
[date:month] > latest_cohort_month
and [date:month] < [latest_cohort_month + (365 * 4):month]
)

Now that I’ve got existing cohorts and future months, I can permute them against one another with another cross join.

  , cohorts_with_future_months as (
select
uber_cohort_months.cohort_month
, future_months.current_month
, datediff('month', uber_cohort_months.cohort_month, future_months.current_month) + 1 as months_since
, row_number() over(partition by cohort_month order by current_month) as months_in_future
from
uber_cohort_months
join future_months on
1 = 1
)
Result of above permutation of future months

I take it for granted that every cohort_month also has a current_month data point up to the last cohort collected, in this case 11/2014. Permuting all past cohorts against future months allows me to have a table full of months next to their months difference from their latest data point.

Decay Constant Calculations

Now the real meat of this query. I’ll use a Lag function to calculate the exponential decay constant that would have resulted in the delta between two month’s worth of revenue.

, decay_constants as (
select
cohort_month
, current_month
, months_since
, row_number() over(partition by cohort_month order by current_month desc) as months_ago
, revenue_share
, lag(revenue_share, 1) over(partition by cohort_month order by current_month) as lag_revenue_share
, ln(revenue_share / LAG(revenue_share, 1) over(partition by cohort_month order by current_month)) as decay_constant
from
uber_cohorts
)

Quick refresher on formulas from the exponential decay wiki:

Exponential decay constant
Formula for calculating result of exponential decay

We already have 3 variables, and want to solve for a 4th:

  • N(t): Revenue this month
  • N0: Revenue the previous month
  • t: 1, since we’re only calculating a delta of 1 month

To solve for lambda, we’ll divide this month’s revenue over last month’s, take the natural log of the result, and then divide by the number of month periods that have passed. Since t is 1, we can skip the last division.

Results of decay_constants CTE above

So for the 3rd row, we calculate ln(3,964,152 / 4,546,282), resulting in -.1370. I’m leaving the negative sign in the result, since I’ll just be turning right around and including the negative when I re-derive future exponential decays.

Projection Types

There are couple options on how to use these kinds of constants, depending on the type of data we have in front of us.

  • Take the couple most recent constants per cohort, and decay out the ending known revenue from that point onward using that slope
  • Treat each cohort as homogeneous, and use unique constants per month of cohort life as an average of all previously observed cohort decays within that month index. For example, all month 3 cohorts decay around -.137 from month 2, so this unknown month 3 will also decay -.137 from the known month 2.
  • Same as previous, but instead of decaying off of the previous month, decay from the initial revenue amount, or another previous month

For now, we’ll go with the first option, but it’s fairly simple to switch around to fit this methodology to any of the above.

  , recent_avg_decay as (
select
cohort_month
, avg(decay_constant) as recent_avg_decay
from
decay_constants
where
months_ago <= 3
and months_since >= 4
group by
1
)

Given we’ve already calculated individual month level decays, it’s trivial to select the most recent 3 months worth. Also, it looks like most cohorts have steep revenue decay in their first few months, so I’m limiting my data collection to only after that rocky first 3 months. This means I won’t have reliable projections for cohorts within the past 3 months, but given the variable steepness of those, I think exponential decay isn’t a great modeling tactic for them anyway.

One more piece, we’ll need the latest revenue from a given cohort to start decaying off of.

  , latest_cohort_revenue as (
select
cohort_month
, revenue_share
from
uber_cohorts
join latest_cohort_month on
latest_cohort_month.latest_cohort_month = uber_cohorts.current_month
)

We’ve already calculated the latest cohort month, so we can just join that in to the historical cohort data as a filter on the current_month. That should pick out the latest data point for each individual cohort.

Projections

Now the fun part! We get to stack all the building blocks we’ve built up to this point into a revenue projection.

, revenue_projection as (
select
cohort_month
, current_month
, months_since
, months_in_future
, revenue_share as latest_revenue_share
, recent_avg_decay
, revenue_share * exp(recent_avg_decay * months_in_future) as projected_revenue_share
from
cohorts_with_future_months
left join recent_avg_decay using (cohort_month)
left join latest_cohort_revenue using (cohort_month)
)

We’ll start from our permutation table of every cohort with future months already laid out indexed against a month number. From there, join in each cohort’s two necessary data points, the decay constant calculated across recent months, and the latest revenue data point to decay off of. From there, we pull back our old friend, the exponential decay formula:

Formula for calculating result of exponential decay

This time we’re solving for N(t), so we multiply the last known revenue with e to the power of our recent constant times month periods in the future.

Calculating 4/2014’s cohort’s projected revenue, post 11/2014

So now we have projected revenue for future cohort months, and actual revenue for cohort months. Let’s mash them together!

  , actuals as (
select
cohort_month
, current_month
, 'Actual' as revenue_type
, months_since
, revenue_share
from
uber_cohorts
)
, projections as (
select
cohort_month
, current_month
, 'Projection' as revenue_type
, months_since
, projected_revenue_share as revenue_share
from
revenue_projection
)
, projected_cohorts as (
select
*
from
actuals
union all
select
*
from
projections
)

And that’s it! We now have a master table called projected_cohorts that joins together our actual data with projected data. Here’s a sample query from that table:

select
current_month
, revenue_type
, revenue_share
from
projected_cohorts
where
cohort_month = '2014-04-01'
Result of the decay curve projection + actuals

Not bad! It’s tough to say whether or not future revenue will truly approach 0 in 2018. If Uber had 4 year old cohorts that were worth analyzing that might tell us the story, but as is the story with most growth companies, the present is too different from the past to make strong predictions about the future.

Because of our method of projection, the minimum length of months we ended up projecting for every cohort was 48. However, lots of old cohorts got extra months tacked on. So if we want 4 year LTVs, we’ll have to chop all those extra projections off. We also didn’t project anything for months after August 2014, due to noisier data.

select
months_since
, cohort_month
, revenue_type
, revenue_share
from
projected_cohorts
where
cohort_month <= '2014-08-01'
and months_since <= 48
All cohorts 48 month projected revenue retention, indexed by month number

Lifetime Value

The sum of those lines is the total lifetime revenue of that cohort. Lifetime Value is a per unique user measure, though, so we’ll need the starting driver counts by cohort to divide by to find LTV.

  , initial_drivers as (
select
cohort_month
, drivers as initial_drivers
from
uber_cohorts
where
months_since = 1
)
, total_cohort_revenue as (
select
cohort_month
, sum(revenue_share) as total_revenue
from
projected_cohorts
where
cohort_month <= '2014-08-01'
and months_since <= 48
group by
1
)
select
cohort_month
, total_revenue
, initial_drivers
, total_revenue::float / initial_drivers as ltv
from
total_cohort_revenue
join initial_drivers using (cohort_month)
Result of above snippet, projected lifetime revenue per initial driver acquired

Meta Analysis

In my previous analysis, I used a simple churn estimate of 13%, resulting in an estimated LTV of $2.8k. In this analysis, I accounted for the variability of churn by the month in which it happened, which allowed for some slowing down of churn. This resulted in most of the future to be projected at 9% churn instead, resulting in an estimated LTV of $3.6k.

The second is much more accurate than the first, but there’s still some room to go to make this a fully fledged model. Potential additions that we could tack onto this basic framework:

  • Segment drivers into types, UberX vs. Uber Black
  • Segment drivers into activity buckets, <8hrs/week, >30 hours/week
  • Account for the seasonal effect of ridership on driver revenue (holidays are far more active than normal days)
  • Move beyond gross revenue share, and bring in Surge/Pool/Promotions that accounts for margin, and turn this into Profit LTV calculation.

Takeaways

If your customer data is already in an SQL database, and seems to follow an exponential decay pattern, you can apply this framework to quickly project future data points without porting your data into another piece of software. Overview of the main points:

  • Generate future months / cohort segments permutations
  • Choose a projection method (indexed from start, indexed from recent)
  • Calculate decay constants
  • Join in those decay constants against future months permutations
  • Use decay constants to calculate future revenue using month indices

Good luck!

--

--