Lazy loaded image
Mathematics
Lazy loaded imageIQR
Words 441Read Time 2 min
May 4, 2020
May 4, 2025
type
status
date
slug
summary
tags
category
icon
password

What is IQR (Interquartile Range)?

IQR is a statistical tool used to measure the spread of the middle 50% of a dataset. It is useful because it ignores extreme values on both ends, giving a more robust understanding of the data's central tendency.
Formula:
Where:
  • Q1 (25th percentile): 25% of the data falls below this value
  • Q3 (75th percentile): 75% of the data falls below this value
  • The middle 50% lies between Q1 and Q3

Retail Sales Example

Let’s say you're analyzing a retail dataset where you want to understand typical purchasing behavior based on Quantity—the number of items bought per transaction.
At first glance, most transactions are small purchases. The 100 looks like an unusual bulk order—possibly:
  • A wholesale purchase
  • A data entry error (e.g., extra zero)

  • A return mistakenly marked as a sale

Step-by-Step: Detect Outliers Using IQR

Step 1: Sort the data (internally handled by pandas during quantile calculation)


Step 2: Calculate Q1 and Q3

Output:

Step 3: Compute IQR


Step 4: Define outlier boundaries

This tells us:
  • Anything < -4 or > 12 is considered an outlier
  • These thresholds define a “normal” range for Quantity: [-4, 12]

Step 5: Identify Outliers

Output:
All values except 100 are normal
100 is an outlier

Handling Outliers in Practice

option 1: Remove the outlier

Use this if you're doing analysis that should ignore extreme cases (e.g., understanding “typical” customer behavior).

Option 2: Cap the outlier (Winsorizing)

Use this if you don’t want to lose data but want to limit the effect of extreme values on your models (e.g., for linear regression or clustering).

Option 3: Mark the outlier for review

Use this if you want to keep the data but treat it differently in your downstream tasks or reporting.

Why Use IQR in Sales Data?

Advantages:

Feature
Why It’s Useful
Robust to extreme values
Ignores very large/small outliers that could be typos or rare cases
No need for normal distribution
Works on skewed data like retail sales (which often are right-skewed)
Easy to implement
Just a few lines of code in Python or Excel
Protects business logic
Avoids decisions based on rare, unrealistic transactions

Real-World Sales Use Cases:

  1. Customer Segmentation: Prevent big one-time purchases from distorting average behavior.
  1. Forecasting Demand: Exclude extreme sales spikes that are due to promotions or errors.
  1. Data Quality Checks: Flag potential manual entry mistakes (e.g., "10000 units of pens").
 
上一篇
 Data Imputation
下一篇
General Data Cleaning Guide (E-commerce/Review Dataset)