import pandas as pd
import os
import numpy as np
from IPython.display import display, Markdown, Image, HTML
import sys
import sqlite3
from pathlib import Path
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import cufflinks as cf
import matplotlib.pyplot as plt
%matplotlib inline
import importlib
import warnings
warnings.filterwarnings("ignore")
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
# Add the project root to the system path:
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
sys.path.append(project_root)
# Import custome modules:
from src.revenue_leakage import revenue_leakage_metrics, revenue_leakage_summary
from plots import revenue_leakage_plots
from utils import helper
# Create a connection to the ecommerce database which is required for the function calls:
conn = sqlite3.connect("../data/ecommerce.db")
# A helper function to reload the package to pick up necessary changes:
def reload_package():
importlib.reload(revenue_leakage_metrics)
importlib.reload(revenue_leakage_summary)
importlib.reload(helper)
print("Packages reloaded!!")
# Create a global path for all SQL files that is related to the revenue leakage analysis:
SQL_BASE_PATH = Path("../sql/02_Revenue_Leakage_Analysis")
helper.extract_table_names(conn=conn)
| name | |
|---|---|
| 0 | customers |
| 1 | geolocation |
| 2 | orders |
| 3 | order_items |
| 4 | order_payments |
| 5 | order_reviews |
| 6 | products |
| 7 | sellers |
| 8 | product_category_translation |
In e-commerce, revenue loss is not caused solely by low demand or poor sales performance. A significant portion of lost revenue originates from orders that were successfully placed but never fulfilled due to cancellations.
Each cancelled order represents:
Importantly, not all cancellations have the same business impact. Cancellations in low-volume but high-value categories can be significantly more damaging than a large number of cancellations in low-priced categories.
Therefore, treating all cancellations equally may lead to misaligned priorities. A data-driven approach is required to identify where revenue leakage is most severe and where intervention should be prioritized.
The objective of this analysis is to identify product categories that contribute most to revenue leakage by evaluating:
This allows the business to distinguish between:
Before conducting the analysis, the following hypothesis are formulated:
Hypothesis 1: High-priced or logistics-heavy categories will show disproportionately high revenue loss, even if cancellation volume is low.
Hypothesis 2: Some categories may have high cancellation rates but low revenue impact, indicating operational inefficiencies rather than financial risk.
The tables we will be using for this business problem:
1. orders: to identify the canceled orders via order_status
2. order_items: to calculate the revenues or possible revenues based on the order status
3. products: to map the items to the categories
4. product_category_translation: for readable category names
duplicates = helper.check_duplicates(conn=conn)
duplicates
| order_id | order_item_id | total |
|---|
There is no any duplicates. A duplicate data would have falsify the revenue calculations, since there are no any duplicates, summing at the row level is safe.
For this metrics, I will be considering both price + freight_value
Here, order_id and order_item_id will be crucial, where, order_id represents an ID for one checkout
where as order_item_id represents the number of items in that particular checkout.
reload_package()
Packages reloaded!!
# 1. Execute the SQL code to extract the revenue metrics:
revenue_baseline_metrics = (revenue_leakage_metrics.get_baseline_metrics(conn=conn,sql_base_path=SQL_BASE_PATH))
# 2. A function that calculates the revenue leakage in percentage:
revenue_leakage_percentage = revenue_leakage_metrics.generate_revenue_leakage(revenue_baseline_metrics=revenue_baseline_metrics)
revenue_metrics_summary() function from revenue_leakage_summary Baseline Revenue Metrics section:revenue_leakage_summary.revenue_metrics_summary(revenue_baseline_metrics, revenue_leakage_percentage)
The headline revenue metrics provide a high-level view of the platform’s overall revenue performance and operational efficiency.
Out of a total market opportunity of R\$15.84M, the business successfully
realized R\$15.42M in revenue from delivered orders, indicating that the vast
majority of potential revenue was converted into actual sales.
Revenue lost due to order cancellations amounts to R\$0.11M, resulting in an
overall revenue leakage of just 0.69%.
The metrics and the amounts have been presented in a tabular format below:
| Metrics | Amount_millions |
|---|---|
| Total Market Opportunity | 15.84 |
| Total Revenue Loss From Cancellation | 0.11 |
| Total Revenue Realized (Delivered Items) | 15.42 |
From a business perspective, this is a strong operational signal:
A revenue leakage below 1% (0.69%) suggests effective order fulfillment and low cancellation impact
The gap between expected and realized revenue is minimal, indicating stable customer transactions and reliable logistics
Revenue loss from cancellations exists but is not a material risk at the aggregate level.
However, while the overall leakage percentage is low, this aggregate view can mask underlying patterns. A small total leakage may still be concentrated within specific order types, customer segments, or specific product categories, which could represent optimization opportunities.
This motivates a deeper, granular analysis, starting with a breakdown of orders by status.
Order Status Breakdown¶
order_status:¶reload_package()
Packages reloaded!!
breakdown_of_orders = revenue_leakage_metrics.get_breakdown_of_orders(conn=conn, sql_base_path=SQL_BASE_PATH)
breakdown_of_orders
| order_status | total_n_orders | |
|---|---|---|
| 0 | delivered | 96478 |
| 1 | shipped | 1107 |
| 2 | canceled | 625 |
| 3 | unavailable | 609 |
| 4 | invoiced | 314 |
| 5 | processing | 301 |
| 6 | created | 5 |
| 7 | approved | 2 |
total_n_orders = breakdown_of_orders[['total_n_orders']].sum()[0]
print(total_n_orders)
99441
reload_package()
Packages reloaded!!
Cancelled Orders without items:
cancelled_orders_without_items = revenue_leakage_metrics.get_break_down_on_cncl_orders(conn=conn,
sql_base_path=SQL_BASE_PATH)['without_items']
cancelled_orders_without_items
| canceled_orders_without_items | |
|---|---|
| 0 | 164 |
Cancelled Orders with items:
cancelled_orders_with_items = revenue_leakage_metrics.get_break_down_on_cncl_orders(conn=conn,
sql_base_path=SQL_BASE_PATH)['with_items']
cancelled_orders_with_items
| canceled_orders_with_items | |
|---|---|
| 0 | 461 |
Concatenating the two results:
cancelled_orders_granularity = pd.concat([cancelled_orders_with_items, cancelled_orders_without_items], axis=1).T
cancelled_orders_granularity = (cancelled_orders_granularity.reset_index()
.rename(columns = {'index': 'Cancelled Orders',
0: 'Total Count'}))
cancelled_orders_granularity
| Cancelled Orders | Total Count | |
|---|---|---|
| 0 | canceled_orders_with_items | 461 |
| 1 | canceled_orders_without_items | 164 |
order_status_breakdown_summary() function from revenue_leakage_summary Breakdown on Cancelled Orders section:revenue_leakage_summary.order_status_breakdown_summary(breakdown_of_orders,
total_n_orders,
cancelled_orders_with_items,
cancelled_orders_without_items)
The company had a total number of 99441 orders. The table below summarizes the break down of these orders.
| order_status | total_n_orders |
|---|---|
| delivered | 96478 |
| shipped | 1107 |
| canceled | 625 |
| unavailable | 609 |
| invoiced | 314 |
| processing | 301 |
| created | 5 |
| approved | 2 |
The company had a total of 625 canceled orders. If we further breakdown these canceled orders, there are two kinds of orders:
with-items: 461without-items: 164Which results to total canceled orders we have i.e.
Total canceled orders (625) = Total canceled orders with items (461) + Total canceled orders without items (164)
The goal is to look at different categories where the orders were canceled and find their contribution in the total revenue leakage:
# Get the total number of canceled orders:
cancelled_orders = breakdown_of_orders[breakdown_of_orders['order_status'] == 'canceled']['total_n_orders'].iloc[0]
# Get the canceled orders with items:
cancelled_orders_with_items_val = cancelled_orders_with_items.iloc[:, 0].values[0]
# Get the canceled orders without items:
cancelled_orders_without_items_val = cancelled_orders_without_items.iloc[:, 0].values[0]
# Get the revenue leakage percentage:
revenue_leakage_percentage = revenue_leakage_metrics.generate_revenue_leakage(revenue_baseline_metrics)
reload_package()
Packages reloaded!!
product level:¶# Cancled orders with their product category and their amount and contribution to the revenue leakage:
revenue_lost_per_prodct_cat = revenue_leakage_metrics.get_cncl_by_prod_cat_amount(conn=conn, sql_base_path=SQL_BASE_PATH)
# Top 10 product categories based on the revenue leakge out of all:
top_10_canceled_prod_category = revenue_lost_per_prodct_cat.head(10)
# Product category with the maximum amount lost due to order cancellation:
prod_cat_w_max_amount = (revenue_lost_per_prodct_cat[revenue_lost_per_prodct_cat['total_amount']
==
revenue_lost_per_prodct_cat['total_amount'].max()])
# Product category with the minimum amount lost due to order cancellation:
prod_cat_w_min_amount = (revenue_lost_per_prodct_cat[revenue_lost_per_prodct_cat['total_amount']
==
revenue_lost_per_prodct_cat['total_amount'].min()])
top_10_canceled_prod_category:¶reload_package()
Packages reloaded!!
fig = (revenue_leakage_plots.plot_top_10_product_categories_by_revenue_loss
(top_10_canceled_prod_category=top_10_canceled_prod_category))
category_level_revenue_leakage_summary() function from revenue_leakage_summary Category-Level Revenue Leakage section:reload_package()
Packages reloaded!!
revenue_leakage_summary.category_level_revenue_leakage_summary(revenue_lost_per_prodct_cat,
prod_cat_w_max_amount, prod_cat_w_min_amount,
cancelled_orders, revenue_leakage_percentage,
cancelled_orders_with_items_val,
cancelled_orders_without_items_val, fig)
We found out that there is a total of: 625 cancelled orders and a revenue leakage of: 0.69 %, if we drill down, there are two categories of order cancellations:
Total canceled orders with items (461):
These are the orders where a customer placed an order, the items were ready to be shipped but customer canceled the orders
at the end moment before the shipment started resulting to the creation of order_id in the order_items table.
Total canceled orders without items (164): These are the orders where a customer
placed an order but due to certain scenarios like payment failure or app crash, the orders never reached the item level
order_id resulting in the creation of order_id in the orders table only.
The table below lists the top 10 product categories whose orders were canceled and are displayed in
descending order w.r.t total_amount, where, the table consists
of product_category, total_amount, and revenue_leak_percentage
| product_category | total_amount | revenue_leak_percentage | |
|---|---|---|---|
| 0 | cool_stuff | 15153.48 | 14.311 |
| 1 | sports_leisure | 9411.85 | 8.889 |
| 2 | computers_accessories | 8330.82 | 7.868 |
| 3 | watches_gifts | 7249.14 | 6.846 |
| 4 | auto | 6731.72 | 6.358 |
| 5 | housewares | 6362.63 | 6.009 |
| 6 | garden_tools | 4694.01 | 4.433 |
| 7 | toys | 4589.26 | 4.334 |
| 8 | musical_instruments | 4116.05 | 3.887 |
| 9 | health_beauty | 3582.29 | 3.383 |
Where, the maxiumum values was:
| product_category | total_amount | revenue_leak_percentage | |
|---|---|---|---|
| 0 | cool_stuff | 15153.48 | 14.311 |
and the the minimum values was:
| product_category | total_amount | revenue_leak_percentage | |
|---|---|---|---|
| 46 | christmas_supplies | 20.93 | 0.02 |
To get the better idea, the chart below plots the top 10 product_categories, based
on their contribution in the total revenue leakage:
# Get the total number of order cancelations by product category:
number_of_cancellations_by_product_category = revenue_leakage_metrics.get_cncl_by_prod_cat_volume(conn=conn,
sql_base_path=SQL_BASE_PATH)
revenue_lost_per_prodct_cat.head(2)
| product_category | total_amount | revenue_leak_percentage | |
|---|---|---|---|
| 0 | cool_stuff | 15153.48 | 14.311 |
| 1 | sports_leisure | 9411.85 | 8.889 |
number_of_cancellations_by_product_category.head(7)
| product_category | total_n_orders | |
|---|---|---|
| 0 | sports_leisure | 51 |
| 1 | housewares | 49 |
| 2 | computers_accessories | 46 |
| 3 | health_beauty | 36 |
| 4 | furniture_decor | 36 |
| 5 | toys | 34 |
| 6 | auto | 30 |
print(f"There are a total of {number_of_cancellations_by_product_category.shape[0]} unique product categories.")
There are a total of 47 unique product categories.
number_of_cancellations_by_product_category[number_of_cancellations_by_product_category['product_category'] == 'cool_stuff']
| product_category | total_n_orders | |
|---|---|---|
| 12 | cool_stuff | 16 |
cncl_count_and_amount_by_prd_cat = pd.merge(revenue_lost_per_prodct_cat,
number_of_cancellations_by_product_category,
on='product_category',
how='inner')
cncl_count_and_amount_by_prd_cat.head(2)
| product_category | total_amount | revenue_leak_percentage | total_n_orders | |
|---|---|---|---|---|
| 0 | cool_stuff | 15153.48 | 14.311 | 16 |
| 1 | sports_leisure | 9411.85 | 8.889 | 51 |
Canceled Volume vs Revenue Loss:¶fig = revenue_leakage_plots.plot_cancelation_volume_vs_revenue_loss(cncl_count_and_amount_by_prd_cat)
reload_package()
Packages reloaded!!
category_lvl_vol_vs_rvn_summary() function from revenue_leakage_summary Category Level Comparison (Volume vs Revenue) section:revenue_leakage_summary.category_lvl_vol_vs_rvn_summary(fig, cncl_count_and_amount_by_prd_cat)
To further understand the drivers of revenue leakage, a category-level comparison was conducted between the number of cancelled orders and the total revenue lost per product category.
The scatter plot below visualizes this relationship:
Hypothesis 1:
High-priced or logistics-heavy categories will show disproportionately high revenue loss, even if cancellation volume is low.
Supported:
Hypothesis 2:
Some categories may have high cancellation rates but low revenue impact, indicating operational inefficiencies rather than financial risk.
Supported:
There was just one outlier or a product category that is marked in the scatter plot above, where the number of cancellations was low but the financial impact it had was high compared to all the products displayed below:
| product_category | total_amount | revenue_leak_percentage | total_n_orders | |
|---|---|---|---|---|
| 0 | cool_stuff | 15153.48 | 14.311 | 16 |
This analysis highlights the importance of evaluating both frequency and financial impact when addressing revenue leakage.
This analysis evaluates potential revenue leakage by comparing the company’s total market opportunity with realized revenue and identifying losses caused by order cancellations. The objective is to quantify the magnitude of revenue leakage, assess whether it represents a business risk, and determine whether revenue loss is driven primarily by cancellation volume or by high-value products with fewer cancellations.
1. Revenue Leakage Is Currently Low:
The company shows a relatively low level of revenue leakage. Out of a total market opportunity of R\$ 15.8M, the company has realized R\$ 15.42M, resulting in a revenue leakage of R\$ 0.11M (0.69%). This indicates efficient order fulfillment and a minimal gap between expected and realized revenue.
2. Order Fulfillment Efficiency Drives Revenue Realization:
Out of 99,441 total orders, the majority have been successfully delivered, indicating strong operational performance:
Delivered: 96,478Shipped: 1,107 Cancelled: 625 Unavailable: 609 Invoiced: 314 Processing: 301 Created / Approved: 7 Among the 625 cancelled orders, cancellations can be further split into:
This breakdown confirms that most cancellations involve products and therefore have direct revenue impact.
3. Revenue Leakage Is Concentrated in a Small Set of Product Categories:
Despite the overall low leakage, revenue loss is not evenly distributed across product categories. Out of 47 product categories, just 6 categories account for approximately 50% of total revenue leakage:
cool_stuff (14.31%)sports_leisure (8.88%)computers_accessories (7.86%)watches_gifts (6.84%)auto (6.35%)housewares (6.00%)This concentration suggests targeted opportunities for monitoring and mitigation.
4. Revenue Leakage Is Not Driven Solely by Cancellation Volume:
Revenue leakage is influenced more by product value than by the sheer number of cancellations. For example, cool_stuff contributes the highest revenue leakage (R$ 15,153.48; 14.31%) despite ranking only 13th in cancellation count (16 cancelled orders). In contrast, sports_leisure has the highest number of cancellations (51) but a lower revenue leakage percentage (8.88%). This indicates that high-value products can generate disproportionate revenue loss even with relatively few cancellations.
cool_stuff may help prevent disproportionate revenue impact from future cancellations.