type
status
date
slug
summary
tags
category
icon
password
General Data Cleaning Guide (E-commerce/Review Dataset)
✅ 1. Price & Numerical Fields Cleaning
Field Type | Common Issues | Cleaning Steps |
Price fields ( price , actual_price , discounted_price ) | Contain currency symbols ( ₹ , $ , etc.), commas, string format | Remove symbols and commas, convert to float |
Percentage fields ( discount_percentage ) | Include % symbol | Strip % , convert to float |
Rating fields ( rating ) | May contain strings, invalid entries | Convert to float , ensure values between 0–5 |
Count fields ( rating_count , review_count ) | Often in string format | Convert to int , handle missing or zero values |
✅ 2. Category Fields Cleaning
Field Type | Common Issues | Cleaning Steps |
Category ( category ) | Multi-level categories joined by ` | , >or /` |
Brand/Model ( brand , model ) | Inconsistent casing, missing values | Normalize case (e.g. lowercase), fill missing with Unknown |
✅ 3. Text & Review Fields Cleaning
Field Type | Common Issues | Cleaning Steps |
Review Content ( review_content ) | Extra whitespace, line breaks, HTML, symbols | Remove \n , \r , strip HTML tags, normalize case |
Review Title ( review_title ) | Very short/irrelevant titles | Set minimum length threshold or filter by keyword relevance |
Product Description ( about_product , features ) | Encoding issues, special characters | Clean non-visible characters, decode text properly |
✅ 4. Multi-Value Fields Expansion
Field Type | Common Issues | Cleaning Steps |
User Info ( user_id , user_name ) | Multiple values joined in one cell | Use str.split(',') + explode() to flatten |
Review IDs, Titles, Contents | Several reviews stored in a single row | Split and create one row per review |
Tags or Keywords | Multiple values in a string | Split and optionally explode into separate rows |
✅ 5. Link/Image Fields Cleaning
Field Type | Common Issues | Cleaning Steps |
Image Link ( img_link ) | Invalid URLs or placeholders | Keep only valid URLs starting with http |
Product Link ( product_link ) | Truncated or malformed | Check structure and completeness |
✅ 6. General Cleaning Steps
Task | Purpose |
Remove Duplicates | Use drop_duplicates() on keys like product_id + review_id |
Handle Missing Values | Fill with defaults ( fillna() ), or remove incomplete rows |
Convert Data Types | Use astype() or pd.to_numeric() for consistency |
Validate Logical Ranges | E.g., discounted_price should be < actual_price , rating between 0–5 |
Reset Index | Use reset_index(drop=True) after major operations |
📦 What You Get After Cleaning
- Structured numerical fields ready for analysis
- Clean, usable review texts for NLP tasks (sentiment analysis, keyword extraction)
- Exploded reviews for per-user analysis
- Standardized category and brand info for grouping or segmentation
📁 Example: Before vs. After Cleaning
Raw Data (Before) | Cleaned Data (After) |
₹1,099 | 1099.0 |
64% | 64.0 |
`Mobiles | Accessories |
R2AJM3LFTLZHFO,R6AQJGUP6P86 | Two rows with separate review IDs |
Great cable\nWorks fast | Great cable Works fast |
- Author:NotionNext
- URL:http://preview.tangly1024.com/article/1c6d698f-3512-814b-b7ef-c7151902d5fa
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!