Lazy loaded image
technology
5️⃣General Data Cleaning Guide (E-commerce/Review Dataset)
Words 81Read Time 1 min
Jul 2, 2021
Apr 6, 2025
type
status
date
slug
summary
tags
category
icon
password

General Data Cleaning Guide (E-commerce/Review Dataset)


✅ 1. Price & Numerical Fields Cleaning

Field Type
Common Issues
Cleaning Steps
Price fields (price, actual_price, discounted_price)
Contain currency symbols (, $, etc.), commas, string format
Remove symbols and commas, convert to float
Percentage fields (discount_percentage)
Include % symbol
Strip %, convert to float
Rating fields (rating)
May contain strings, invalid entries
Convert to float, ensure values between 0–5
Count fields (rating_count, review_count)
Often in string format
Convert to int, handle missing or zero values

✅ 2. Category Fields Cleaning

Field Type
Common Issues
Cleaning Steps
Category (category)
Multi-level categories joined by `
, >or/`
Brand/Model (brand, model)
Inconsistent casing, missing values
Normalize case (e.g. lowercase), fill missing with Unknown

✅ 3. Text & Review Fields Cleaning

Field Type
Common Issues
Cleaning Steps
Review Content (review_content)
Extra whitespace, line breaks, HTML, symbols
Remove \n, \r, strip HTML tags, normalize case
Review Title (review_title)
Very short/irrelevant titles
Set minimum length threshold or filter by keyword relevance
Product Description (about_product, features)
Encoding issues, special characters
Clean non-visible characters, decode text properly

✅ 4. Multi-Value Fields Expansion

Field Type
Common Issues
Cleaning Steps
User Info (user_id, user_name)
Multiple values joined in one cell
Use str.split(',') + explode() to flatten
Review IDs, Titles, Contents
Several reviews stored in a single row
Split and create one row per review
Tags or Keywords
Multiple values in a string
Split and optionally explode into separate rows

✅ 5. Link/Image Fields Cleaning

Field Type
Common Issues
Cleaning Steps
Image Link (img_link)
Invalid URLs or placeholders
Keep only valid URLs starting with http
Product Link (product_link)
Truncated or malformed
Check structure and completeness

✅ 6. General Cleaning Steps

Task
Purpose
Remove Duplicates
Use drop_duplicates() on keys like product_id + review_id
Handle Missing Values
Fill with defaults (fillna()), or remove incomplete rows
Convert Data Types
Use astype() or pd.to_numeric() for consistency
Validate Logical Ranges
E.g., discounted_price should be < actual_price, rating between 0–5
Reset Index
Use reset_index(drop=True) after major operations

📦 What You Get After Cleaning

  • Structured numerical fields ready for analysis
  • Clean, usable review texts for NLP tasks (sentiment analysis, keyword extraction)
  • Exploded reviews for per-user analysis
  • Standardized category and brand info for grouping or segmentation

📁 Example: Before vs. After Cleaning

Raw Data (Before)
Cleaned Data (After)
₹1,099
1099.0
64%
64.0
`Mobiles
Accessories
R2AJM3LFTLZHFO,R6AQJGUP6P86
Two rows with separate review IDs
Great cable\nWorks fast
Great cable Works fast
上一篇
绑定成功测试
下一篇
Data Visualization