Published on

February 16, 2022

CLV is an important metric to track for two reasons. First, the totality of a company's CLV over its entire customer base gives a rough idea of its market value. Thus, a company with a high total CLV will appear attractive to investors. Second, CLV analysis can guide the formulation of customer acquisition and retention strategies. For example, special attention could be given to high-value customers to ensure that they stay loyal to the company.

Many CLV models have been developed with different levels of sophistication and accuracy, ranging from rough heuristics to the use of complex probabilistic frameworks. In this blog post series, we delve into one of them: **the Beta Geometric Negative Binomial Distribution (BG-NBD) model**. This model, developed by ** Fader, Hardie, and Lee in 2005**, has been one of the most influential models in the domain, thanks to its interpretability and accuracy.

This series consists of three blog posts that build on top of one another. Our game plan is as follows. In Part 1, (this blog post), we'll achieve an ELI-5 understanding of the model and its assumptions. Afterwards, we'll see how these assumptions can be modeled by probability distributions.

In Part 2, we'll look into the Python library ** lifetimes** that allow us to conveniently fit a BG-NBD model to a dataset in a sci-kit-learn-like fashion and almost immediately receive the maximum likelihood estimate of the model's parameters. We'll also explore the various downstream analyses that

In Part 3, we'll look at an alternative way to implement the BG-NBD model, this time from a Bayesian perspective. We’ll see how the Bayesian hierarchical BG-NBD model allows us to inject our prior intuition of customer behavior into the model. To this end, we will be using the Python library ** PyMC3**.

Before digging deeper into the mathematics of BG-NBD, we need to understand what it can and cannot do. There are two major limitations to keep in mind:

- The model is only applicable to non-contractual, continuous purchases.
- The model only tackles one component of CLV calculation, which is the prediction of the number of purchases.

Let's learn about these limitations in greater detail.

Depending on the relationship between the sellers and the buyers, a business can either be a contractual business or a non-contractual business.

- A contractual business, as its name suggests, is one where the buyer-seller relationship is governed by contracts. When either party no longer wants to continue this relationship, the contract is terminated. Thanks to the contract, there is no ambiguity as to whether or not someone is, at a given point, a customer of the business.
- In a non-contractual business, on the other hand, purchases are made on a per-need basis without any contract.

We can further distinguish between continuous and discrete settings:

- In a continuous setting, purchases can occur at any given moment. The majority of purchase situations (e.g. grocery purchases) fall under this category.
- In a discrete setting, purchases usually occur periodically with some degree of regularity. An example of this is weekly magazine purchases.

**The BG-NBD model tackles the non-contractual, continuous situation, which is the most common yet most challenging of the four to analyze.** Under such a setting, customer attrition is not explicitly observable and can happen anytime. This makes it harder to differentiate between customers who have indefinitely churned and those who will return in the future. As we will see later, the BG-NBD model is capable of assigning probabilities to each of these two options.

A customer's CLV for a given period can be calculated by multiplying two numbers:

- Their predicted number of transactions within this period.
- The average value of each purchase.

Usually, these two components are tackled and modeled separately. The BG-NBD model addresses the first – predicting the number of transactions, which in many regards is the more difficult of the two.

The second component, the expected value of the purchases, can be found either by using simple heuristics, such as taking the average of all past purchases, or by a more sophisticated probabilistic model, such as the ** Gamma-Gamma model** (which was also created by the authors of BG-NBD).

We've so far framed our discussion around CLV calculation, which is what BG-NBD was initially intended for, However, BG-NBD is more versatile than that. It can in fact be used to model any phenomenon that involves different "users" making repeated "transactions" and predict how many future transactions will be made by those users if they are still "active".

For example:

- Predicting the future usage frequency of a mobile app by analyzing users' usage history.
- Predicting if your website users will return to your website.
- Predicting if your distant relative who used to call you periodically is still alive, literally, by analyzing their call pattern.
- Predicting if your Tinder dates have become disinterested in you by looking at their texting frequency.

Before going into the mathematics of the model, let's try to understand how the model works on a conceptual level.

Let's imagine the following scenario. The date is 31 December 2021 and you're the manager of a cake shop. You’ve carefully kept track of all transactions that happened this year and you want to predict how many transactions you can expect your customers to make in 2022.

You also happen to be a skilled data scientist and you plan to achieve this prediction by fitting a model to your data. This model should be able to describe your customers' purchasing behavior in an interpretable way.

There are some assumptions you could consider when developing the model.

You've noted that some people buy cakes every day and some every weekend. Some only buy them on special occasions that take place every six months. Your model will need a way to assign a different purchasing rate to each customer.

In the fiercely competitive cake business, loyalty isn't guaranteed. At any point, your customer can leave your business for another one. Let’s refer to this departure as the "deactivation" of a previously active customer. To conveniently model a deactivation, we can assume that it can only happen after a successful purchase. That is to say, after every purchase in your store, a customer will decide whether they'll continue buying at your shop or to move to another one. Deactivation happens when the customer decides to move.

We'll assume that deactivation is both **permanent**and **latent**. **Permanent,** because once a customer decides to churn, they will never return. **Latent,** because they won't explicitly come up to you to let you know that they will no longer be your customer.

With these assumptions in place, let's consider the following scenario, where we have two customers, A and B. Each of them has made some transactions in 2021, and each transaction is indicated by a red dot.

Can we tell which customers have deactivated and which ones will still frequent your store and contribute to your future revenue?

The answer is yes – to a certain extent. For example, looking at A's pattern, we see that he used to shop pretty frequently, but it's been a while since you've seen him. Because his inter-transaction time is so much shorter than the time that's passed since his last transaction, it is quite likely that A has been deactivated.

On the other hand, B is an infrequent shopper and her most recent purchase is not that long ago compared to her average between-purchase period. It is pretty likely that she'll come back.

Now let's develop these assumptions into a more sophisticated model

Traditionally, CLV was calculated using a simple function of the past data. For example, we can estimate the value of future transactions by taking a fixed fraction of the value of past transactions. Such a calculation, unsurprisingly, is simplistic, unreliable, and uninterpretable.

The BG-NBD model, on the other hand, is a probabilistic model. In a probabilistic model, we assume that our observations (i.e. the transactions) are generated by a physical process that we can model using probability distributions. Our task is to estimate the parameters that best explain our existing observations. One commonly used option is to find maximum likelihood estimators of these parameters. We can then use these estimated parameters to perform future predictions.

With that introduction, now let's convert the assumptions qualitatively described into a solid probabilistic framework.

First, let's focus on the repeat purchasing behaviors of active customers. We can assume that as long as a customer is still active, their transactions follow a ** Poisson process** with a constant purchase rate 𝜆. With this assumption, we can model the time-to-next-purchase Δt as an exponential distribution parameterized by 𝜆. The pdf of this distribution is as follows:

Each active customer will have their own exponential distribution that we can use to predict the probability of the time of the next purchase.

The graph shows the pdf of two exponential distributions associated with two customers. The first customer (the blue curve) generally buys a cake every day (their purchasing rate 𝜆 is 1 cake/day). The probability that their next purchase takes place within one day of their current purchase can be found by taking the area under the blue curve between 0 and 1 and calculating to 0.63

The second customer only buys a cake every week (their 𝜆 is 1/7 cake/day). After performing the same integration, we see that it is much less probable (P = 0.13) that their next purchase will happen sometime before tomorrow.

It is useful to think that all these customers with their differing 𝜆's, contribute to a store-wide 𝜆 distribution. Our task now is to model this 𝜆 distribution. In doing so, we'll need to comply with the following requirements:

- The distribution should preferably be one that is well-studied.
- Since 𝜆 can only take values in positive real numbers, the chosen distribution must only have positive values.
- The distribution needs to be flexible enough to model different customer bases with different purchasing behaviors.

The Gamma distribution ticks all those boxes; this is the one used in BD-NBD to model 𝜆. It is parameterized by the shape parameter r and the scale parameter α; different combinations of these two parameters result in the gamma distribution taking distinct shapes. Here is the pdf of the distribution:

It is important to note that this Gamma distribution is not just some theoretical mumbo-jumbo. A specific Gamma distribution is in fact a quantitative description of the collective purchasing behavior of a specific customer base that carries business implications.

For example, the blue line shows a downward sloping, left-leaning Gamma distribution that results from setting both lambda and alpha to equal 1. If this distribution corresponded to my customer base, I wouldn't be too happy – the heavy left skew means that the bulk of my customers have a purchasing rate that is close to zero. That is, they barely purchase any cake!

Another Gamma distribution is shown in orange. This is a healthier distribution in which the 𝜆 peaks around 2, which means that a considerable chunk of my population buys two cakes per day. Not too shabby! Now, a little nerdy note – the combination of Poisson/Gamma distributions, which we've been using to model our customers' purchasing behavior is also known as Negative Binomial Distribution (NBD). Yep, this is where the name of our model comes from.

Now let's deal with the deactivation process. As mentioned earlier, after each purchase, a customer will make a decision on whether or not to deactivate. We can assign a probability p for this deactivation. Consequently, the transaction after which a customer deactivates is distributed according to the shifted geometric distribution. The pmf of this discrete distribution is shown below:

This PMF is very intuitive – if a customer deactivates only after the xth transaction, they must have survived the preceding x - 1 transactions. Each of this survival carries the probability (1 - p), hence the (1-p)x-1 component.

This graph compares two customers with p = 0.01 and p = 0.1

Do note:

- By definition, a customer must perform at least one transaction before deactivating.
- The higher the p, the more likely it is that the deactivation happens earlier. We see that the customer with p = 0.01 (blue) has a much lower probability of deactivating early compared to the customer with p = 0.1 (orange)

Similar to 𝜆, it is useful to consider that a population of customers is associated with a distribution p. This time, however, we can't use Gamma distribution, which has no upper bound. We'll need another distribution that is equally flexible but whose value range from 0 to 1 (because p can only be between 0 to 1).

The Beta distribution fits our needs. Here is the pdf of the beta distribution:

We can see that the distribution is parameterized by two positive shape parameters, a and b. Here are some examples of Beta distribution:

Similar to the Gamma distribution, this Beta distribution also carries business implications. You'd want to see a left-skewed Beta distribution that puts most of its weight near 0, which suggests that most of your customers have low p and aren't likely to deactivate early.

Another quick note – it is this combination of Beta/Geometric distributions that give rise to the "BG" in the BG-NBD model. Now you know!

We've looked at all the distributions with which we quantitatively describe the behavior of our customers. How do we then obtain the best parameters for these distributions?

One way is to get the maximum likelihood estimators (MLE), which are parameter estimators that maximize the likelihood that the model produced the data that was actually observed.

Let's make it more concrete. Suppose that we currently are at time T and we're looking back at the historical transactions of a particular customer that has a purchase rate 𝜆. They made their first transaction at t1 and their last at tx. These points, drawn on a timeline, look like this:

We can derive the individual-level likelihood function of this person by following the below steps:

- The likelihood of the first transaction occurring at t1 is described using the exponential distribution we elaborated earlier:

- The likelihood of the second transaction occurring at t2 is the probability of the customer remaining active after t1 (1-p) multiplied by the standard exponential likelihood component:

- Such a likelihood pattern is repeated for each subsequent transaction, that is, the likelihood of the xth transaction happening at tx is:

Now, let's analyze what happens after the last transaction at tx. We don't observe any transaction between tx and T, and this absence can be due to either one of the following scenarios:

- The customer was deactivated after their last transaction at tx. As we know, the probability of this happening is p.
- They remain active yet didn't make any transaction at this interval. The probability of this happening is:

- The likelihood of observing the transaction pattern that we observed is simply the multiplication of all the likelihood for earlier transactions times the sum of the likelihood of the two scenarios:

- The above-defined likelihood formula is applicable to customers who made some purchases in the observation period. Since we assume that all customers were active to begin with, the likelihood that a customer will not make any purchase between the time [0, T] is the standard exponential function:

- Lastly, combining the above two likelihood formulas, we obtain a generalized formula for all customers notwithstanding the number of transactions they made (or lack thereof):

We can then programmatically try out different values of p and 𝜆 and choose a (p, 𝜆) combination that maximizes this likelihood. This will then be the "best" parameters that describe the purchasing behavior and the deactivation probability of this given individual.

Do note that this individual-level likelihood function only involves three unknown variables that need to be supplied by the data:

- x: the number of repeat transactions. This is also called the (repeat) frequency.
- tx: the age of the customer at his last transaction time. This is also called recency.
- T: the age of the customer at the point of analysis.

Interestingly enough, the exact time of the earlier transactions (t1 …to tx-1) isn't needed.

A dataset whose rows correspond to different customer IDs and whose columns indicate the x, tx, and T of each customer is referred to as being in the "RFM format". The RFM format is the canonical format used frequently in CLV analysis.

As a company with (hopefully) many customers, oftentimes, we're not too interested in looking at individual customers. Rather, we'd like to analyze our customer base as a whole. Specifically, we're interested in obtaining the best Gamma and Beta distributions that describe the performance of our entire business.

Just like how we can use MLE to get the best p and 𝜆 for an individual, we can also use MLE to get the best r, α, a, and b for the population. I won't be deriving the population-level likelihood equation in this blog post; it’s long enough as it is. However, if you've understood the math, you should be in good shape to dive into the derivation that is clearly explained in the ** BG-NBD paper**.

Alright, I know that we've gone through a lot of math, which can be challenging. You might be thinking: Is there a way to skip all these equations and simply go to the implementation of this model?

I hear you! In Part 2 of the series, we'll look at the Python library lifetimes that allows us to use a couple lines of code to get the MLE of r, α, a, and b for a from a given transaction record. This library also contains other useful analytical and plotting functions that will allow us to derive business insights from the BG-NBD model and other related models.

In addition, in Part 3 of the series, we will take a look at an alternative implementation of BG-NBD, which approaches the parameter estimation from the Bayesian perspective.

I hope to see you there!