Empirical Cumulative Distribution Function

type

status

date

slug

summary

What is an ECDF? (Empirical Cumulative Distribution Function)

The Empirical Cumulative Distribution Function (ECDF) is a way to visualize the distribution of data points in a dataset. It answers this question:

“For a given value x, what proportion of the data is less than or equal to x?”

How does it work?

Sort your data from smallest to largest.

For each value x in the sorted list, compute:

Plot x on the X-axis and ECDF(x) on the Y-axis.

What does an ECDF plot show you?

The X-axis represents the value of the variable (e.g., sale per customer).

The Y-axis represents the proportion of data that is less than or equal to that value.

The curve always rises from 0 to 1 (or 0% to 100%).

Why use ECDFs instead of histograms?

ECDF	Histogram
Doesn’t need bins (no loss of detail)	Requires bins (may hide fine details)
Shows exact cumulative probabilities	Shows only counts/frequencies per bin
Good for comparing multiple datasets	Less precise for visual comparisons

How to interpret an ECDF curve

Steep curve → Data values are concentrated in a narrow range.

Flat regions → Gaps or repeated values in the data.

Sudden jump → Multiple identical values at that point.

High values reached quickly → Most data are low in value.

Slow climb → Data are more spread out.

Example (based on `SalePerCustomer` data):

This chart displays three ECDFs (Empirical Cumulative Distribution Functions), each showing the cumulative distribution of Sale per Customer across different value ranges.

Interpretation of Each Plot:

1. Top Plot

The X-axis ranges from 0 to approximately 45,000.

The curve is stretched, with most data concentrated on the far left and a long tail on the right.

Interpretation: There are extreme outliers — a small number of customers with very high average sales. These values occur rarely but inflate the maximum significantly.

2. Middle Plot

The horizontal axis is limited to about 7,000.

The ECDF curve becomes steeper, indicating that most customers fall within the 0–3,000 range in terms of sale per customer.

Interpretation: This view better captures the core distribution of the dataset, minimizing the visual effect of extreme values.

3. Bottom Plot

The range is further narrowed to 0–70.

The curve is very steep, suggesting that the majority of values are tightly clustered in a small range, particularly below 30.

Observation: Roughly 75% of customers spent less than 30 per visit.

Overall Conclusion:

The dataset contains a very small number of extremely high sale-per-customer values, which are outliers.

However, the vast majority of customers have low and tightly clustered spending patterns.

The ECDF, shown at multiple scales, reveals both the distribution concentration and the existence of rare, high-spending customers, offering deeper insight than simple statistics like mean or median.

In a sales analysis context, this suggests that a few high-value customers drive disproportionate revenue, while most customers are infrequent or low-spending buyers.