type
status
date
slug
summary
tags
category
icon
password
What is an ECDF? (Empirical Cumulative Distribution Function)
The Empirical Cumulative Distribution Function (ECDF) is a way to visualize the distribution of data points in a dataset. It answers this question:
“For a given value x, what proportion of the data is less than or equal to x?”
How does it work?
- Sort your data from smallest to largest.
- For each value
x
in the sorted list, compute:
- Plot
x
on the X-axis andECDF(x)
on the Y-axis.
What does an ECDF plot show you?
- The X-axis represents the value of the variable (e.g., sale per customer).
- The Y-axis represents the proportion of data that is less than or equal to that value.
- The curve always rises from 0 to 1 (or 0% to 100%).
Why use ECDFs instead of histograms?
ECDF | Histogram |
Doesn’t need bins (no loss of detail) | Requires bins (may hide fine details) |
Shows exact cumulative probabilities | Shows only counts/frequencies per bin |
Good for comparing multiple datasets | Less precise for visual comparisons |
How to interpret an ECDF curve
- Steep curve → Data values are concentrated in a narrow range.
- Flat regions → Gaps or repeated values in the data.
- Sudden jump → Multiple identical values at that point.
- High values reached quickly → Most data are low in value.
- Slow climb → Data are more spread out.
Example (based on SalePerCustomer
data):

This chart displays three ECDFs (Empirical Cumulative Distribution Functions), each showing the cumulative distribution of Sale per Customer across different value ranges.
Interpretation of Each Plot:
1. Top Plot
- The X-axis ranges from 0 to approximately 45,000.
- The curve is stretched, with most data concentrated on the far left and a long tail on the right.
- Interpretation: There are extreme outliers — a small number of customers with very high average sales. These values occur rarely but inflate the maximum significantly.
2. Middle Plot
- The horizontal axis is limited to about 7,000.
- The ECDF curve becomes steeper, indicating that most customers fall within the 0–3,000 range in terms of sale per customer.
- Interpretation: This view better captures the core distribution of the dataset, minimizing the visual effect of extreme values.
3. Bottom Plot
- The range is further narrowed to 0–70.
- The curve is very steep, suggesting that the majority of values are tightly clustered in a small range, particularly below 30.
- Observation: Roughly 75% of customers spent less than 30 per visit.
Overall Conclusion:
- The dataset contains a very small number of extremely high sale-per-customer values, which are outliers.
- However, the vast majority of customers have low and tightly clustered spending patterns.
- The ECDF, shown at multiple scales, reveals both the distribution concentration and the existence of rare, high-spending customers, offering deeper insight than simple statistics like mean or median.
- In a sales analysis context, this suggests that a few high-value customers drive disproportionate revenue, while most customers are infrequent or low-spending buyers.
- Author:Entropyobserver
- URL:https://tangly1024.com/article/1e5d698f-3512-804a-9115-d3e4d4cba99c
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!