Lazy loaded image
Lazy loaded imageSklearn Workflow
Words 775Read Time 2 min
Apr 1, 2025
Jun 16, 2025
type
status
date
slug
summary
tags
category
icon
password

1. Data Loading and Preprocessing

  • Data preprocessing includes cleaning the data, feature extraction, handling missing values, encoding categorical variables, etc.
    • For example, text data may need to be converted into feature vectors using techniques like CountVectorizer or TfidfVectorizer.
    • For numerical data, you may need to standardize features (using StandardScaler) or encode categorical variables (using OneHotEncoder).

2. Model Selection

  • Choose the appropriate model from sklearn based on your task (classification, regression, etc.).
    • For classification tasks, you can use models like LogisticRegression, RandomForestClassifier, SGDClassifier, etc.
    • For regression tasks, models like LinearRegression, SVR, etc., can be used.

3. Training the Model

  • Splitting the dataset: The dataset is usually split into training and test sets (e.g., using train_test_split()).
  • Training the model: You use the training data to train the model using the fit() method.

4. Evaluating the Model

  • Use the test set to evaluate how well the trained model performs. Common evaluation metrics include:
    • Classification tasks: accuracy, precision, recall, F1-score, etc.
    • Regression tasks: Mean Squared Error (MSE), R², etc.

5. Hyperparameter Tuning

  • We can use techniques like GridSearchCV or RandomizedSearchCV to tune the model's hyperparameters and improve its performance.

6. Cross-Validation (Optional)

  • Use cross-validation to evaluate the model more thoroughly, ensuring that the model performs well across different subsets of the data. This helps avoid overfitting or bias due to a specific train-test split. Methods like cross_val_score are commonly used.

7. Model Saving and Loading (Optional)

  • After training, you may save the model for later use with joblib or pickle. This allows you to reload the model and make predictions without retraining.

Typical sklearn Workflow Example:

Core Concept

Term
Meaning
Used On
.fit()
Learn from the data
Training set only
.transform()
Apply what was learned
Test set or new data
.fit_transform()
Learn and apply in one step
Usually for training
1. .fit(): Learn From Training Data
This method is used to learn internal parameters from the training set.

Examples:

  • CountVectorizer: learns the vocabulary
  • TfidfVectorizer: learns vocabulary + IDF values
  • StandardScaler: learns the mean and standard deviation
  • MinMaxScaler: learns min and max values
At this point, no actual data transformation happens. The object just memorizes what it needs to later transform data.

2. .transform(): Apply Learned Rules
This method is used to apply the learned rules (from .fit()) to new data (like validation or test data).
  • It uses the learned vocabulary/statistics.
  • It does not learn anything new—it just converts data using what's already known.
Important: You should never fit on test data, or it leads to data leakage.

3. .fit_transform(): Learn + Apply in One Step
This is a shortcut that first calls .fit() and then .transform() on the same data.
  • Most commonly used on training data
  • Saves time and makes code cleaner

What is a Pipeline?

A Pipeline chains together multiple steps (like preprocessing and modeling) into a single object. It ensures:
  • Clean, readable code
  • Proper training/test separation
  • Avoids data leakage
  • Easy cross-validation
1. Fit the pipeline (on training data)
  • .fit() will:
    • call .fit() on StandardScaler and transform X_train
    • pass transformed data to LogisticRegression.fit()
2. Predict with the pipeline (on test data)
  • It automatically:
    • transforms X_test using the already-learned scaler
    • passes it to the trained model

use Pipeline to compare different combinations of vectorizers and classifiers

上一篇
K-Means Algorithm
下一篇
Sparse vs Dense vectors