Customer Segementation (RFM Analysis)¶

Author: Nishan Karki¶


Table of Contents¶

1. Setup & Configuration¶

  • 1.1 Import Libraries
  • 1.2 Verify Tables

2. Business Context¶

  • 2.1 Objective
  • 2.2 Data Scope Definition

3. Analysis & Metrics¶

  • 3.1 Recency Analysis

    Summary¶
  • 3.2 Frequency Analysis

    Summary¶
  • 3.3 Monetary Analysis

    Summary¶

4. Key Findings & Business Interpretation¶

1. Setup & Configuration:¶

1.1 Import Libraries:¶

In [1]:
import pandas as pd
import sqlite3
from sqlalchemy import create_engine
import os
import numpy as np
import sys
from IPython.display import display, Markdown, Image
from pathlib import Path

import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import cufflinks as cf
import matplotlib.pyplot as plt
%matplotlib inline

import math
import warnings
warnings.filterwarnings("ignore")
import importlib
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()

# Add the project root to the system path:
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

# Import custome modules:
from src.rfm_analysis import rfm_analysis_metrics, rfm_analysis_summary
from plots import rfm_analysis_plots
from utils import helper
In [2]:
# Create a connection to the ecommerce database which is required for RFM metric function calls:
conn = sqlite3.connect("../data/ecommerce.db")
In [3]:
# A helper function to reload the package to pick up necessary changes:
def reload_package():
    importlib.reload(rfm_analysis_metrics)
    importlib.reload(rfm_analysis_summary)
    importlib.reload(rfm_analysis_plots)
    importlib.reload(helper)
    print("Packages reloaded!!")

1.2 Verify Tables:¶

In [4]:
# Create a global path for all SQL files that is related to the RFM analysis:
SQL_BASE_PATH = Path("../sql/03_Customer_Segmentation_(RFM_Analysis)")
In [5]:
# Get the list of the available tables:
tables = helper.extract_table_names(conn=conn)
print(f"The list of the tables:\n{tables}")
The list of the tables:
                           name
0                     customers
1                   geolocation
2                        orders
3                   order_items
4                order_payments
5                 order_reviews
6                      products
7                       sellers
8  product_category_translation

2. Business Context:¶

2.1 Objective:¶

To segment customers based on Recency, Frequency, and Monetary value in order to identify high-value, loyal, at-risk, and churn-prone customer groups, enabling targeted retention strategies, personalized marketing, and revenue optimization.

2.2 Data Scope Definition:¶

For RFM analysis, the orders with order_status=delivered are considered. This ensures that:

  1. Revenue has been fully realized
  2. The transaction represents a completed customer purchase
  3. Customer behavior is measured based on confirmed buying activity rather than intent or failed transactions

NOTE: A temporary table customer_order_infois created for further analysis:

In [6]:
rfm_analysis_metrics.create_temp_table_customer_order_info(conn=conn, sql_base_path=SQL_BASE_PATH)
Temporary table `customer_order_info` successfully created !!

3. Analysis & Metrics:¶

3.1 Recency Analysis:¶

  • Recency asks "When did the customer made their last purchase that was successfull or the order status was delivered?
  • In business terms:
    • Low Recency (recent purchases): active or engaged customers
    • High Recency (long-time ago): inactive, churn-risk customers

But this dataset is old and using the current date won't make sense, so for this analysis, the latest date from the order_purchase_timestamp + 10 DAYS will be used as the threshold.

3.1.1 Execute the get_recency_analysis() function from the rfm_analysis_metrics module:¶

In [7]:
recency_table_full = rfm_analysis_metrics.get_recency_analysis(conn=conn, sql_base_path=SQL_BASE_PATH)
recency_table_full.head(3)
Out[7]:
customer_id latest_purchase_timestamp date_threshold days_since_last_purchase bins
0 00012a2ce6f8dcda20d059ce98491703 2017-11-14 2018-09-08 297 50
1 000161a058600d5901f007fab4c27140 2017-07-16 2018-09-08 418 80
2 0001fd6190edaaf884bcaf3d49edf079 2017-02-28 2018-09-08 556 110
In [8]:
print(f"The average recency day is: {np.floor(recency_table_full['days_since_last_purchase'].mean())}")
The average recency day is: 248.0
In [9]:
recency_table_full[recency_table_full['bins'].isna()]
Out[9]:
customer_id latest_purchase_timestamp date_threshold days_since_last_purchase bins

3.1.2 Now, we will use the bins feature of the recency_table_full table to group the customers and count the unique IDs:¶

In [10]:
customers_per_bins = recency_table_full.groupby(['bins'])['customer_id'].nunique()
In [11]:
customers_per_bins
Out[11]:
bins
0       9196
10      9545
20     11213
30     11553
40     11276
50     11198
60      7076
70      6895
80      5844
90      5591
100     3960
110     2747
120      118
130       90
140      176
Name: customer_id, dtype: int64

3.1.3 Mapping the bins (a 50-day recency bins) values to days range for the visualization purpose:¶

In [12]:
# Backup if something goes wrong:
customers_per_bins_labeled = customers_per_bins.copy()

# Mapping dictionary:
bin_mapping = {
    0: '0-50',
    10: '50-100',
    20: '100-150',
    30: '150-200',
    40: '200-250',
    50: '250-300',
    60: '300-350',
    70: '350-400',
    80: '400-450',
    90: '450-500',
    100: '500-550',
    110: '550-600',
    120: '600-650',
    130: '650-700',
    140: '700+'
}

customers_per_bins_labeled.index = customers_per_bins_labeled.index.map(bin_mapping)
customers_per_bins_labeled

customers_per_bins_labeled_df = customers_per_bins_labeled.to_frame().reset_index().rename(columns={'bins': 'Days Interval',
                                                                                                   'customer_id': 'Total Customers'})

This is how a well-formatted results looks like for the customer counts by the bin size where we use a 50-day recency bins interval:

In [13]:
customers_per_bins_labeled_df
Out[13]:
Days Interval Total Customers
0 0-50 9196
1 50-100 9545
2 100-150 11213
3 150-200 11553
4 200-250 11276
5 250-300 11198
6 300-350 7076
7 350-400 6895
8 400-450 5844
9 450-500 5591
10 500-550 3960
11 550-600 2747
12 600-650 118
13 650-700 90
14 700+ 176

3.1.4 Call the plot_customer_distribution_recency_bar_chart() function from rfm_analysis_plots module:¶

In [14]:
fig = rfm_analysis_plots.plot_customer_distribution_recency_bar_chart(df = customers_per_bins_labeled_df)

Summary:¶

Call the get_recency_analysis_summary() function from rfm_analysis_summary module to

summarize the Recency Analysis section:

In [15]:
reload_package()
Packages reloaded!!
In [16]:
rfm_analysis_summary.get_recency_analysis_summary(recency_table_full, customers_per_bins_labeled_df, fig)

Summary: Recency Analysis Per Customer:¶


Objective:

This analysis evaluates customer recency by measuring the number of days since a customer’s most recent delivered purchase. As the dataset is historical and not up to date, a reference date of 2018-09-08 (10 days after the last recorded purchase) was used instead of the current date to avoid distorting recency values. Each customer appears only once, using a window function to capture their latest completed purchase.

Key Insights:

  • On average, customers take approximately 248 days to place their second purchase, indicating a long repurchase cycle.

  • Out of 96,478 customers, only 18,741 return within 1–100 days, while the majority (45,240 customers) make a second purchase between 100–300 days.

  • A significant portion of customers show very long gaps between purchases, with 25,406 customers returning after 300–500 days, and 7,091 customers taking more than 500 days.

  • Overall, repeat purchasing behavior is heavily concentrated in the 100–300 day window, suggesting low short-term retention and a slow repeat-purchase cycle.


The bar chart below visualizes customer counts across 50-day recency bins, highlighting the delayed repurchase behavior.

In [ ]:
 

3.2 Frequency Analysis:¶

  • Frequency asks "How often does a customer make a purchase?"
  • In business terms,
    • High Frequency: loyal customers
    • Low Frequency: one-time customer or chustomer who are at churn-risk

3.2.1 Call the get_customer_order_frequency_analysis() function from the rfm_analysis_metrics module:¶

To make the results interpretable from a business perspective, customers were segmented as follows:

  • One-Time Buyer: 1 order
  • Returning Customer: 2 orders
  • Loyal Customer: 3 orders
  • Very Loyal Customer: 4 orders
  • VIP Customers: 5 or more orders
In [17]:
customer_order_frequency_interpretation = rfm_analysis_metrics.get_customer_order_frequency_analysis(conn=conn,
                                                                                                    sql_base_path=SQL_BASE_PATH)
In [18]:
# The returned dataframe consists of the information below:
customer_order_frequency_interpretation.head()
Out[18]:
customer_unique_id total_orders Interpretation
0 0000366f3b9a7992bf8c76cfdf3221e2 1 One-Time Buyer
1 0000b849f77a49e4a4ce2b2a4ca5be3f 1 One-Time Buyer
2 0000f46a3911fa3c0805444483337064 1 One-Time Buyer
3 0000f6ccb0745a6a4b88665a16c9f078 1 One-Time Buyer
4 0004aac84e0df4da2b147fca70cf8255 1 One-Time Buyer

3.2.2 Use the Interpretation feature to count the customer and their contribution in % out of all the total orders made:¶

In [19]:
customer_order_frequency_interpretation_counts = customer_order_frequency_interpretation['Interpretation'].value_counts()
customer_order_frequency_interpretation_counts = customer_order_frequency_interpretation_counts.to_frame().reset_index()
customer_order_frequency_interpretation_counts.rename(columns={'count':'Total Counts'}, inplace=True)
customer_order_frequency_interpretation_counts['Customer Share (%)'] = np.round((customer_order_frequency_interpretation_counts['Total Counts']
                                                                        / customer_order_frequency_interpretation_counts['Total Counts'].sum()) * 100, 3)
In [20]:
customer_order_frequency_interpretation_counts
Out[20]:
Interpretation Total Counts Customer Share (%)
0 One-Time Buyer 90557 97.000
1 Returning Customer 2573 2.756
2 Loyal Customer 181 0.194
3 Very Loyal 28 0.030
4 VIP Customers 19 0.020

3.2.3 Call the plot_customer_purchase_frequency_bar_chart() function from rfm_analysis_plots module to get the purchase frequency bar chart:¶

In [48]:
reload_package()
Packages reloaded!!
In [49]:
fig = rfm_analysis_plots.plot_customer_purchase_frequency_bar_chart(df=customer_order_frequency_interpretation_counts)

Summary:¶

Call the frequency_analysis_summary() function from rfm_analysis_summary

module to summarize the Frequency Analysis section:

In [50]:
rfm_analysis_summary.get_frequency_analysis_summary(customer_order_frequency_interpretation_counts, fig)

Summary: Frequency Analysis¶


Objective:¶

Analyze the customers based on their buying-frequency pattern to find out VIP customers and customers at churn risk. The number of distinct orders per customer is used to calculate the frequency metric.

For the better interpretability from a business perspective, customers are segmented as below:

  • One-Time Buyer: 1 order
  • Returning Customer: 2 orders
  • Loyal Customer: 3 orders
  • Very Loyal Customer: 4 orders
  • VIP Customers: 5 or more orders

Key Insights:¶

  • The frequency distribution is skewed towards right, where One-Time Buyer domniates the customer base accounting for 96.947%, a total of 90,557 customers.
  • Returning (2.756% or 2,573 customers) and Loyal (0.194% or 181 customers) form a much smaller but strategically important segment.
  • Very Loyal (0.030% or 28 customers) and VIP (0.020% or 19) customers represent a very small fraction but are likely to contribute disproportionately higher lifetime value.
  • Overall, the Frequency analysis reveals that the business is highly dependent on first-time buyers, highlighting a significant opportunity for:
    • Retention strategies
    • Repeat purchase incentives
    • Loyalty and CRM programs

To properly visualize this imbalance, a logarithmic scale was used in the frequency distribution chart. This approach prevents one-time buyers from visually overwhelming the chart and allows meaningful comparison between smaller yet valuable customer segments. The bar chart below visualizes the customer purchase frequency in % in 5 different segments:

In [ ]:
 

3.3 Monetary Analysis:¶

  • Monetary asks "How much money did each customer spend in total??"
  • This answers:
    • Who are the highest-value customers?
    • Who looks loyal but actually spends very little?
    • Who spends a lot but purchases infrequently?

3.3.1 Call the get_total_spending_per_customer() function from rfm_analysis_metrics module to get the information about total spendings per customer:¶

In [24]:
total_spendings_per_customer = rfm_analysis_metrics.get_total_spending_per_customer(conn=conn,
                                                                                   sql_base_path=SQL_BASE_PATH)
In [25]:
print(f"The `total_spendings_per_customer` table has a total {total_spendings_per_customer.shape[0]} rows and "
      f"{total_spendings_per_customer.shape[1]} features.")
The `total_spendings_per_customer` table has a total 93358 rows and 2 features.
In [26]:
# Below is the information, the table holds:
total_spendings_per_customer.head()
Out[26]:
customer_unique_id total_amount
0 0a0a92112bd4c708ca5fde585afaa872 13664.08
1 da122df9eeddfedc1dc1f5349a1a690c 7571.63
2 763c8b1c9c68a0229c42c9fc6f662b93 7274.88
3 dc4802a71eae9be1dd28f5d788ceb526 6929.31
4 459bef486812aa25204be022145caa62 6922.21

3.3.2 Merging customer_order_frequency_interpretation and total_spendings_per_customer dataframes:¶

Now, the code below is used to merge these two dataframes in order to get the customer's information about:

  1. Total number of orders
  2. The cusomer's interpreation (i.e. which buying category do the belong to)
  3. The total amount the customer spent so far
In [27]:
total_frequency_and_total_amount = pd.merge(customer_order_frequency_interpretation, 
               total_spendings_per_customer, on='customer_unique_id' , how='inner')
In [ ]:
 

The merged dataframe total_frequency_and_total_amount holds the information as displayed below:

In [28]:
total_frequency_and_total_amount.head()
Out[28]:
customer_unique_id total_orders Interpretation total_amount
0 0000366f3b9a7992bf8c76cfdf3221e2 1 One-Time Buyer 141.90
1 0000b849f77a49e4a4ce2b2a4ca5be3f 1 One-Time Buyer 27.19
2 0000f46a3911fa3c0805444483337064 1 One-Time Buyer 86.22
3 0000f6ccb0745a6a4b88665a16c9f078 1 One-Time Buyer 43.62
4 0004aac84e0df4da2b147fca70cf8255 1 One-Time Buyer 196.89

3.3.3 Grouping the customers based on Interpretation feature:¶

In [29]:
segment_monetary_summary = (total_frequency_and_total_amount.groupby('Interpretation')
                            .agg(total_customers=('customer_unique_id', 'nunique'),
                            total_revenue=('total_amount', 'sum'),
                            avg_revenue_per_customer=('total_amount', 'mean'))
                            .reset_index()
                            .sort_values(by='total_revenue', ascending=False))

The dataframe was grouped based on Interpretation feature to group them into those 5 bins to analyze the customers behavior based on their buying frequency:

In [30]:
segment_monetary_summary
Out[30]:
Interpretation total_customers total_revenue avg_revenue_per_customer
1 One-Time Buyer 90557 14555586.29 160.733972
2 Returning Customer 2573 748811.57 291.026650
0 Loyal Customer 181 78333.95 432.784254
4 Very Loyal 28 22086.25 788.794643
3 VIP Customers 19 14955.69 787.141579

3.3.4 Merging segment_monetary_summary & customer_order_frequency_interpretation_counts to get the full table for visual comparison:¶

In [31]:
segment_monetary_summary_w_customer_share = pd.merge(segment_monetary_summary,
                                                    customer_order_frequency_interpretation_counts,
                                                     on='Interpretation',
                                                    how='inner').drop(columns=['Total Counts'])

segment_monetary_summary_w_customer_share = (segment_monetary_summary_w_customer_share
                                             .rename(columns={'total_customers':'Total Customers',
                                                             'total_revenue': 'Total Revenue',
                                                             'avg_revenue_per_customer': 'Average Revenue Per Customer',
                                                             'Customer Share (%)':'Revenue Contribution (%)'}))

The resulted merged table segment_monetary_summary_w_customer_share will be used to compare the customers using a visualization, where the key idea would be to compare VOLUME V/S VALUE, the table segment_monetary_summary_w_customer_share is displayed below:

In [32]:
segment_monetary_summary_w_customer_share
Out[32]:
Interpretation Total Customers Total Revenue Average Revenue Per Customer Revenue Contribution (%)
0 One-Time Buyer 90557 14555586.29 160.733972 97.000
1 Returning Customer 2573 748811.57 291.026650 2.756
2 Loyal Customer 181 78333.95 432.784254 0.194
3 Very Loyal 28 22086.25 788.794643 0.030
4 VIP Customers 19 14955.69 787.141579 0.020

3.3.5 Call the plot_volume_vs_value_chart() function from rfm_analysis_plots module to plot the VOLUME V/S VALUE chart:¶

In [51]:
fig = rfm_analysis_plots.plot_volume_vs_value_chart(df=segment_monetary_summary_w_customer_share)
In [54]:
reload_package()
Packages reloaded!!

Summary:¶

Call the get_monetary_analysis_summary() function from rfm_analysis_summary

module to summarize the Monetary Analysis section:

In [55]:
rfm_analysis_summary.get_monetary_analysis_summary(segment_monetary_summary_w_customer_share,
                                                            fig)

Summary: Monetary Analysis¶


Objective:¶

Analyze the total revenue, the average revenue per customer, and the revenue contribution(%) of the customers, based on the 5-Segments created during the frequency analysis. The key idea is to find the valuable customer segment supported by the calculated revenue-metrics only considering the orders with the delivered status, such that the revenue was realized by the firm.

Key Insights:¶

  • The companys revenue heavily relies on the One-Time Buyer, with a total of 90,557 customers, a total revenue of R\$ 14.55 M, with an average revenue of R\$ 160.73 per customer, and a revenue contribution of 97.00 %.
  • The valueof the customer increases as the customer counts for their second purchase starts decreasing, although, revenue contribution (%) for the rest of the segments except One-Time Buyer is < 2.7%, value wise, the company earns on an average of R\$ 291.02 to R\$ 787.14 per customer in other remaining 4 segments while the average revenue per customer for One-Time Buyer is just R\$ 160.73.
  • The metrics clearly speaks that the volume is the KEY revenue driver.

The chart below displays the comparison between the Volume and Value, where the bar chart represent the volume, while the line chart represents the value:

4. Key Findings & Business Interpretation:¶

In [56]:
reload_package()
Packages reloaded!!

Call the get_key_findings_and_interp() function from rfm_analysis_summary

module to list the Key Findings & Business Interpretation:

In [37]:
rfm_analysis_summary.get_key_findings_and_interp()

Key Findings:¶

  • One-Time Buyers generate the highest total revenue overall, primarily due to their overwhelming volume (≈97% of the customer base), despite having the lowest average revenue per customer.
  • Returning, Loyal, and Very Loyal customers, although significantly fewer in number, contribute progressively higher average revenue per customer, indicating stronger individual customer value.
  • VIP Customers represent the smallest segment but show one of the highest average revenues per customer, highlighting their importance from a long-term value perspective rather than volume.
  • The contrast between total revenue (volume-driven) and average revenue per customer (value-driven) reveals that revenue concentration alone can be misleading without considering customer quality.

Business Interpretation:¶

  • Revenue is currently volume-driven, dominated by One-Time Buyers.
  • However, high-frequency segments deliver disproportionately higher value per customer, making them ideal targets for:
    • Retention strategies
    • Personalized marketing
    • Loyalty and reward programs

The final visualization effectively demonstrates this dual perspective by:

  • Using Log₁₀ scaled total revenue to highlight revenue concentration.
  • Overlaying average revenue per customer to reveal customer value across segments.

This analysis reinforces the importance of balancing customer acquisition with customer retention and value growth to build a sustainable revenue strategy.

In [ ]:
 
In [ ]:
 
In [ ]: