Lazy loaded image
Mathematics
Lazy loaded imageEmpirical Cumulative Distribution Function
Words 429Read Time 2 min
Apr 30, 2020
Apr 30, 2025
type
status
date
slug
summary
tags
category
icon
password

What is an ECDF? (Empirical Cumulative Distribution Function)

The Empirical Cumulative Distribution Function (ECDF) is a way to visualize the distribution of data points in a dataset. It answers this question:
“For a given value x, what proportion of the data is less than or equal to x?”

How does it work?

  1. Sort your data from smallest to largest.
  1. For each value x in the sorted list, compute:
  1. Plot x on the X-axis and ECDF(x) on the Y-axis.

What does an ECDF plot show you?

  • The X-axis represents the value of the variable (e.g., sale per customer).
  • The Y-axis represents the proportion of data that is less than or equal to that value.
  • The curve always rises from 0 to 1 (or 0% to 100%).

Why use ECDFs instead of histograms?

ECDF
Histogram
Doesn’t need bins (no loss of detail)
Requires bins (may hide fine details)
Shows exact cumulative probabilities
Shows only counts/frequencies per bin
Good for comparing multiple datasets
Less precise for visual comparisons

How to interpret an ECDF curve

  • Steep curve → Data values are concentrated in a narrow range.
  • Flat regions → Gaps or repeated values in the data.
  • Sudden jump → Multiple identical values at that point.
  • High values reached quickly → Most data are low in value.
  • Slow climb → Data are more spread out.

Example (based on SalePerCustomer data):

notion image

This chart displays three ECDFs (Empirical Cumulative Distribution Functions), each showing the cumulative distribution of Sale per Customer across different value ranges.

Interpretation of Each Plot:

1. Top Plot

  • The X-axis ranges from 0 to approximately 45,000.
  • The curve is stretched, with most data concentrated on the far left and a long tail on the right.
  • Interpretation: There are extreme outliers — a small number of customers with very high average sales. These values occur rarely but inflate the maximum significantly.

2. Middle Plot

  • The horizontal axis is limited to about 7,000.
  • The ECDF curve becomes steeper, indicating that most customers fall within the 0–3,000 range in terms of sale per customer.
  • Interpretation: This view better captures the core distribution of the dataset, minimizing the visual effect of extreme values.

3. Bottom Plot

  • The range is further narrowed to 0–70.
  • The curve is very steep, suggesting that the majority of values are tightly clustered in a small range, particularly below 30.
  • Observation: Roughly 75% of customers spent less than 30 per visit.

Overall Conclusion:

  • The dataset contains a very small number of extremely high sale-per-customer values, which are outliers.
  • However, the vast majority of customers have low and tightly clustered spending patterns.
  • The ECDF, shown at multiple scales, reveals both the distribution concentration and the existence of rare, high-spending customers, offering deeper insight than simple statistics like mean or median.
  • In a sales analysis context, this suggests that a few high-value customers drive disproportionate revenue, while most customers are infrequent or low-spending buyers.
上一篇
TF-IDF in NLP
下一篇
A/B Testing