type
status
date
slug
summary
tags
category
icon
password
What is IQR (Interquartile Range)?
IQR is a statistical tool used to measure the spread of the middle 50% of a dataset. It is useful because it ignores extreme values on both ends, giving a more robust understanding of the data's central tendency.
Formula:
Where:
- Q1 (25th percentile): 25% of the data falls below this value
- Q3 (75th percentile): 75% of the data falls below this value
- The middle 50% lies between Q1 and Q3
Retail Sales Example
Let’s say you're analyzing a retail dataset where you want to understand typical purchasing behavior based on
Quantity
—the number of items bought per transaction.At first glance, most transactions are small purchases. The
100
looks like an unusual bulk order—possibly:- A wholesale purchase
- A data entry error (e.g., extra zero)
- A return mistakenly marked as a sale
Step-by-Step: Detect Outliers Using IQR
Step 1: Sort the data (internally handled by pandas during quantile calculation)
Step 2: Calculate Q1 and Q3
Output:
Step 3: Compute IQR
Step 4: Define outlier boundaries
This tells us:
- Anything < -4 or > 12 is considered an outlier
- These thresholds define a “normal” range for Quantity:
[-4, 12]
Step 5: Identify Outliers
Output:
All values except
100
are normal100
is an outlierHandling Outliers in Practice
option 1: Remove the outlier
Use this if you're doing analysis that should ignore extreme cases (e.g., understanding “typical” customer behavior).
Option 2: Cap the outlier (Winsorizing)
Use this if you don’t want to lose data but want to limit the effect of extreme values on your models (e.g., for linear regression or clustering).
Option 3: Mark the outlier for review
Use this if you want to keep the data but treat it differently in your downstream tasks or reporting.
Why Use IQR in Sales Data?
Advantages:
Feature | Why It’s Useful |
Robust to extreme values | Ignores very large/small outliers that could be typos or rare cases |
No need for normal distribution | Works on skewed data like retail sales (which often are right-skewed) |
Easy to implement | Just a few lines of code in Python or Excel |
Protects business logic | Avoids decisions based on rare, unrealistic transactions |
Real-World Sales Use Cases:
- Customer Segmentation: Prevent big one-time purchases from distorting average behavior.
- Forecasting Demand: Exclude extreme sales spikes that are due to promotions or errors.
- Data Quality Checks: Flag potential manual entry mistakes (e.g., "10000 units of pens").
- Author:Entropyobserver
- URL:https://tangly1024.com/article/1e9d698f-3512-803b-a313-cb08355fe17a
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!