import pandas as pd
import sqlite3
from sqlalchemy import create_engine
import os
import numpy as np
import sys
from IPython.display import display, Markdown, Image
from pathlib import Path
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import cufflinks as cf
import matplotlib.pyplot as plt
%matplotlib inline
import math
import warnings
warnings.filterwarnings("ignore")
import importlib
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
# Add the project root to the system path:
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
sys.path.append(project_root)
# Import custome modules:
from src.rfm_analysis import rfm_analysis_metrics, rfm_analysis_summary
from plots import rfm_analysis_plots
from utils import helper
# Create a connection to the ecommerce database which is required for RFM metric function calls:
conn = sqlite3.connect("../data/ecommerce.db")
# A helper function to reload the package to pick up necessary changes:
def reload_package():
importlib.reload(rfm_analysis_metrics)
importlib.reload(rfm_analysis_summary)
importlib.reload(rfm_analysis_plots)
importlib.reload(helper)
print("Packages reloaded!!")
# Create a global path for all SQL files that is related to the RFM analysis:
SQL_BASE_PATH = Path("../sql/03_Customer_Segmentation_(RFM_Analysis)")
# Get the list of the available tables:
tables = helper.extract_table_names(conn=conn)
print(f"The list of the tables:\n{tables}")
The list of the tables:
name
0 customers
1 geolocation
2 orders
3 order_items
4 order_payments
5 order_reviews
6 products
7 sellers
8 product_category_translation
To segment customers based on Recency, Frequency, and Monetary value in order to identify high-value, loyal, at-risk, and churn-prone customer groups, enabling targeted retention strategies, personalized marketing, and revenue optimization.
For RFM analysis, the orders with order_status=delivered are considered.
This ensures that:
NOTE: A temporary table customer_order_infois created for further analysis:
rfm_analysis_metrics.create_temp_table_customer_order_info(conn=conn, sql_base_path=SQL_BASE_PATH)
Temporary table `customer_order_info` successfully created !!
order status was delivered?Low Recency (recent purchases): active or engaged customersHigh Recency (long-time ago): inactive, churn-risk customersBut this dataset is old and using the current date won't make sense, so for this analysis, the latest date from the
order_purchase_timestamp + 10 DAYS will be used as the threshold.
get_recency_analysis() function from the rfm_analysis_metrics module:¶recency_table_full = rfm_analysis_metrics.get_recency_analysis(conn=conn, sql_base_path=SQL_BASE_PATH)
recency_table_full.head(3)
| customer_id | latest_purchase_timestamp | date_threshold | days_since_last_purchase | bins | |
|---|---|---|---|---|---|
| 0 | 00012a2ce6f8dcda20d059ce98491703 | 2017-11-14 | 2018-09-08 | 297 | 50 |
| 1 | 000161a058600d5901f007fab4c27140 | 2017-07-16 | 2018-09-08 | 418 | 80 |
| 2 | 0001fd6190edaaf884bcaf3d49edf079 | 2017-02-28 | 2018-09-08 | 556 | 110 |
print(f"The average recency day is: {np.floor(recency_table_full['days_since_last_purchase'].mean())}")
The average recency day is: 248.0
recency_table_full[recency_table_full['bins'].isna()]
| customer_id | latest_purchase_timestamp | date_threshold | days_since_last_purchase | bins |
|---|
recency_table_full table to group the customers and count the unique IDs:¶customers_per_bins = recency_table_full.groupby(['bins'])['customer_id'].nunique()
customers_per_bins
bins 0 9196 10 9545 20 11213 30 11553 40 11276 50 11198 60 7076 70 6895 80 5844 90 5591 100 3960 110 2747 120 118 130 90 140 176 Name: customer_id, dtype: int64
a 50-day recency bins) values to days range for the visualization purpose:¶# Backup if something goes wrong:
customers_per_bins_labeled = customers_per_bins.copy()
# Mapping dictionary:
bin_mapping = {
0: '0-50',
10: '50-100',
20: '100-150',
30: '150-200',
40: '200-250',
50: '250-300',
60: '300-350',
70: '350-400',
80: '400-450',
90: '450-500',
100: '500-550',
110: '550-600',
120: '600-650',
130: '650-700',
140: '700+'
}
customers_per_bins_labeled.index = customers_per_bins_labeled.index.map(bin_mapping)
customers_per_bins_labeled
customers_per_bins_labeled_df = customers_per_bins_labeled.to_frame().reset_index().rename(columns={'bins': 'Days Interval',
'customer_id': 'Total Customers'})
This is how a well-formatted results looks like for the customer counts by the bin size where we use a 50-day recency bins interval:
customers_per_bins_labeled_df
| Days Interval | Total Customers | |
|---|---|---|
| 0 | 0-50 | 9196 |
| 1 | 50-100 | 9545 |
| 2 | 100-150 | 11213 |
| 3 | 150-200 | 11553 |
| 4 | 200-250 | 11276 |
| 5 | 250-300 | 11198 |
| 6 | 300-350 | 7076 |
| 7 | 350-400 | 6895 |
| 8 | 400-450 | 5844 |
| 9 | 450-500 | 5591 |
| 10 | 500-550 | 3960 |
| 11 | 550-600 | 2747 |
| 12 | 600-650 | 118 |
| 13 | 650-700 | 90 |
| 14 | 700+ | 176 |
plot_customer_distribution_recency_bar_chart() function from rfm_analysis_plots module:¶fig = rfm_analysis_plots.plot_customer_distribution_recency_bar_chart(df = customers_per_bins_labeled_df)
get_recency_analysis_summary() function from rfm_analysis_summary module to Recency Analysis section:reload_package()
Packages reloaded!!
rfm_analysis_summary.get_recency_analysis_summary(recency_table_full, customers_per_bins_labeled_df, fig)
This analysis evaluates customer recency by measuring the number of days since a customer’s most recent delivered purchase.
As the dataset is historical and not up to date, a reference date of 2018-09-08 (10 days after the last recorded purchase) was
used instead of the current date to avoid distorting recency values. Each customer appears only once, using a window function to capture
their latest completed purchase.
On average, customers take approximately 248 days to place their second purchase, indicating a long repurchase cycle.
Out of 96,478 customers, only 18,741 return within 1–100 days, while the majority (45,240 customers) make a second purchase between 100–300 days.
A significant portion of customers show very long gaps between purchases, with 25,406 customers returning after 300–500 days, and 7,091 customers taking more than 500 days.
Overall, repeat purchasing behavior is heavily concentrated in the 100–300 day window, suggesting low short-term retention and a slow repeat-purchase cycle.
The bar chart below visualizes customer counts across 50-day recency bins, highlighting the delayed repurchase behavior.
High Frequency: loyal customersLow Frequency: one-time customer or chustomer who are at churn-riskget_customer_order_frequency_analysis() function from the rfm_analysis_metrics module:¶To make the results interpretable from a business perspective, customers were segmented as follows:
customer_order_frequency_interpretation = rfm_analysis_metrics.get_customer_order_frequency_analysis(conn=conn,
sql_base_path=SQL_BASE_PATH)
# The returned dataframe consists of the information below:
customer_order_frequency_interpretation.head()
| customer_unique_id | total_orders | Interpretation | |
|---|---|---|---|
| 0 | 0000366f3b9a7992bf8c76cfdf3221e2 | 1 | One-Time Buyer |
| 1 | 0000b849f77a49e4a4ce2b2a4ca5be3f | 1 | One-Time Buyer |
| 2 | 0000f46a3911fa3c0805444483337064 | 1 | One-Time Buyer |
| 3 | 0000f6ccb0745a6a4b88665a16c9f078 | 1 | One-Time Buyer |
| 4 | 0004aac84e0df4da2b147fca70cf8255 | 1 | One-Time Buyer |
Interpretation feature to count the customer and their contribution in % out of all the total orders made:¶customer_order_frequency_interpretation_counts = customer_order_frequency_interpretation['Interpretation'].value_counts()
customer_order_frequency_interpretation_counts = customer_order_frequency_interpretation_counts.to_frame().reset_index()
customer_order_frequency_interpretation_counts.rename(columns={'count':'Total Counts'}, inplace=True)
customer_order_frequency_interpretation_counts['Customer Share (%)'] = np.round((customer_order_frequency_interpretation_counts['Total Counts']
/ customer_order_frequency_interpretation_counts['Total Counts'].sum()) * 100, 3)
customer_order_frequency_interpretation_counts
| Interpretation | Total Counts | Customer Share (%) | |
|---|---|---|---|
| 0 | One-Time Buyer | 90557 | 97.000 |
| 1 | Returning Customer | 2573 | 2.756 |
| 2 | Loyal Customer | 181 | 0.194 |
| 3 | Very Loyal | 28 | 0.030 |
| 4 | VIP Customers | 19 | 0.020 |
plot_customer_purchase_frequency_bar_chart() function from rfm_analysis_plots module to get the purchase frequency bar chart:¶reload_package()
Packages reloaded!!
fig = rfm_analysis_plots.plot_customer_purchase_frequency_bar_chart(df=customer_order_frequency_interpretation_counts)
frequency_analysis_summary() function from rfm_analysis_summary Frequency Analysis section:rfm_analysis_summary.get_frequency_analysis_summary(customer_order_frequency_interpretation_counts, fig)
Analyze the customers based on their buying-frequency pattern to find out VIP customers and customers at churn risk. The number of distinct orders per customer
is used to calculate the frequency metric.
For the better interpretability from a business perspective, customers are segmented as below:
One-Time Buyer: 1 orderReturning Customer: 2 ordersLoyal Customer: 3 ordersVery Loyal Customer: 4 ordersVIP Customers: 5 or more orders One-Time Buyer domniates the customer base accounting for 96.947%, a total of 90,557 customers.To properly visualize this imbalance,
a logarithmic scalewas used in the frequency distribution chart. This approach prevents one-time buyers from visually overwhelming the chart and allows meaningful comparison between smaller yet valuable customer segments. The bar chart below visualizes the customer purchase frequency in % in 5 different segments:
get_total_spending_per_customer() function from rfm_analysis_metrics module to get the information about total spendings per customer:¶total_spendings_per_customer = rfm_analysis_metrics.get_total_spending_per_customer(conn=conn,
sql_base_path=SQL_BASE_PATH)
print(f"The `total_spendings_per_customer` table has a total {total_spendings_per_customer.shape[0]} rows and "
f"{total_spendings_per_customer.shape[1]} features.")
The `total_spendings_per_customer` table has a total 93358 rows and 2 features.
# Below is the information, the table holds:
total_spendings_per_customer.head()
| customer_unique_id | total_amount | |
|---|---|---|
| 0 | 0a0a92112bd4c708ca5fde585afaa872 | 13664.08 |
| 1 | da122df9eeddfedc1dc1f5349a1a690c | 7571.63 |
| 2 | 763c8b1c9c68a0229c42c9fc6f662b93 | 7274.88 |
| 3 | dc4802a71eae9be1dd28f5d788ceb526 | 6929.31 |
| 4 | 459bef486812aa25204be022145caa62 | 6922.21 |
customer_order_frequency_interpretation and total_spendings_per_customer dataframes:¶Now, the code below is used to merge these two dataframes in order to get the customer's information about:
total_frequency_and_total_amount = pd.merge(customer_order_frequency_interpretation,
total_spendings_per_customer, on='customer_unique_id' , how='inner')
The merged dataframe total_frequency_and_total_amount holds the information as displayed below:
total_frequency_and_total_amount.head()
| customer_unique_id | total_orders | Interpretation | total_amount | |
|---|---|---|---|---|
| 0 | 0000366f3b9a7992bf8c76cfdf3221e2 | 1 | One-Time Buyer | 141.90 |
| 1 | 0000b849f77a49e4a4ce2b2a4ca5be3f | 1 | One-Time Buyer | 27.19 |
| 2 | 0000f46a3911fa3c0805444483337064 | 1 | One-Time Buyer | 86.22 |
| 3 | 0000f6ccb0745a6a4b88665a16c9f078 | 1 | One-Time Buyer | 43.62 |
| 4 | 0004aac84e0df4da2b147fca70cf8255 | 1 | One-Time Buyer | 196.89 |
Interpretation feature:¶segment_monetary_summary = (total_frequency_and_total_amount.groupby('Interpretation')
.agg(total_customers=('customer_unique_id', 'nunique'),
total_revenue=('total_amount', 'sum'),
avg_revenue_per_customer=('total_amount', 'mean'))
.reset_index()
.sort_values(by='total_revenue', ascending=False))
The dataframe was grouped based on Interpretation feature to group them into those 5 bins to analyze the customers behavior
based on their buying frequency:
segment_monetary_summary
| Interpretation | total_customers | total_revenue | avg_revenue_per_customer | |
|---|---|---|---|---|
| 1 | One-Time Buyer | 90557 | 14555586.29 | 160.733972 |
| 2 | Returning Customer | 2573 | 748811.57 | 291.026650 |
| 0 | Loyal Customer | 181 | 78333.95 | 432.784254 |
| 4 | Very Loyal | 28 | 22086.25 | 788.794643 |
| 3 | VIP Customers | 19 | 14955.69 | 787.141579 |
segment_monetary_summary & customer_order_frequency_interpretation_counts to get the full table for visual comparison:¶segment_monetary_summary_w_customer_share = pd.merge(segment_monetary_summary,
customer_order_frequency_interpretation_counts,
on='Interpretation',
how='inner').drop(columns=['Total Counts'])
segment_monetary_summary_w_customer_share = (segment_monetary_summary_w_customer_share
.rename(columns={'total_customers':'Total Customers',
'total_revenue': 'Total Revenue',
'avg_revenue_per_customer': 'Average Revenue Per Customer',
'Customer Share (%)':'Revenue Contribution (%)'}))
The resulted merged table segment_monetary_summary_w_customer_share will be used to compare the customers
using a visualization, where the key idea would be to compare VOLUME V/S VALUE, the table segment_monetary_summary_w_customer_share is displayed below:
segment_monetary_summary_w_customer_share
| Interpretation | Total Customers | Total Revenue | Average Revenue Per Customer | Revenue Contribution (%) | |
|---|---|---|---|---|---|
| 0 | One-Time Buyer | 90557 | 14555586.29 | 160.733972 | 97.000 |
| 1 | Returning Customer | 2573 | 748811.57 | 291.026650 | 2.756 |
| 2 | Loyal Customer | 181 | 78333.95 | 432.784254 | 0.194 |
| 3 | Very Loyal | 28 | 22086.25 | 788.794643 | 0.030 |
| 4 | VIP Customers | 19 | 14955.69 | 787.141579 | 0.020 |
plot_volume_vs_value_chart() function from rfm_analysis_plots module to plot the VOLUME V/S VALUE chart:¶fig = rfm_analysis_plots.plot_volume_vs_value_chart(df=segment_monetary_summary_w_customer_share)
reload_package()
Packages reloaded!!
get_monetary_analysis_summary() function from rfm_analysis_summary rfm_analysis_summary.get_monetary_analysis_summary(segment_monetary_summary_w_customer_share,
fig)
Analyze the total revenue, the average revenue per customer, and the revenue contribution(%) of the customers, based on the 5-Segments created during
the frequency analysis.
The key idea is to find the valuable customer segment supported by the calculated revenue-metrics only considering the orders with the delivered status,
such that the revenue was realized by the firm.
One-Time Buyer, with a total of 90,557 customers, a total revenue of R\$ 14.55 M, with an
average revenue of R\$ 160.73 per customer, and a revenue contribution of 97.00 %.valueof the customer increases as the customer counts for their second purchase starts decreasing, although, revenue contribution (%) for the rest of the
segments except One-Time Buyer is < 2.7%, value wise, the company earns on an average of R\$ 291.02 to R\$ 787.14 per customer in other remaining 4 segments
while the average revenue per customer for One-Time Buyer is just R\$ 160.73.The chart below displays the comparison between the Volume and Value, where the bar chart represent the volume, while the line chart represents the value:
reload_package()
Packages reloaded!!
get_key_findings_and_interp() function from rfm_analysis_summary rfm_analysis_summary.get_key_findings_and_interp()
The final visualization effectively demonstrates this dual perspective by:
This analysis reinforces the importance of balancing customer acquisition with customer retention and value growth to build a sustainable revenue strategy.