Lazy loaded image
Lazy loaded imageSklearn Workflow
Words 1294Read Time 4 min
Apr 1, 2025
Jul 25, 2025
type
status
date
slug
summary
tags
category
icon
password

1. Data Loading and Preprocessing

Data preprocessing is one of the most crucial steps in any machine learning pipeline. The quality of your data directly impacts your model's performance. Preprocessing typically involves tasks like cleaning the data, handling missing values, transforming features, and more.
Text Data Processing:
For text data, we need to convert the raw text into a format that a machine learning model can understand. This is often done through vectorization, such as CountVectorizer or TfidfVectorizer.
  • CountVectorizer converts each document into a vector where the values represent the counts of each word in the vocabulary.
    • TfidfVectorizer converts text to vectors, considering the importance of each word by giving lower weight to common words that appear in many documents.
      Numerical Data Processing:
      For numerical data, common steps include scaling and encoding:
      • Standardization: Using StandardScaler, we standardize the data to have a mean of 0 and a standard deviation of 1, ensuring all features contribute equally to the model.
        • One-Hot Encoding: For categorical variables, OneHotEncoder can be used to convert each category into a binary feature.
           
           

          2. Model Selection

          The selection of a model depends on the type of task you're solving, such as classification or regression.
          • Classification Models:
            • Logistic Regression is commonly used for binary classification tasks.
            • Random Forest Classifier is useful when you have non-linear data, as it builds multiple decision trees and aggregates their predictions.
            • Regression Models:
              • Linear Regression is commonly used when the output variable is continuous and linear in nature.
              • Support Vector Regression (SVR) can be used for both linear and non-linear regression tasks.

              3. Training the Model

              Once the model is selected, the next step is to train it. This involves splitting the dataset into training and testing sets, usually with train_test_split(), and using the fit() method to train the model on the training data.
              • Splitting the Data:
                • Training the Model:
                  This step applies the model to the training data, adjusting its internal parameters (weights) to best fit the data.

                  4. Evaluating the Model

                  Once the model is trained, it’s important to evaluate how well it performs on unseen data. Evaluation metrics depend on the type of problem:
                  • Classification Evaluation:
                    • Accuracy is the percentage of correct predictions.
                    • Precision, Recall, and F1-Score are useful when the class distribution is imbalanced (e.g., rare events).
                  • Regression Evaluation:
                    • Mean Squared Error (MSE) measures the average of the squared errors between predicted and actual values.
                    • R² (Coefficient of Determination) explains how well the regression model predicts the target variable.

                  5. Hyperparameter Tuning

                  Model performance can often be improved by tuning its hyperparameters. GridSearchCV and RandomizedSearchCV are two methods for performing hyperparameter optimization.
                  • GridSearchCV: It exhaustively tests all parameter combinations within the specified grid.
                    • RandomizedSearchCV: It randomly tests parameter combinations and can be faster than GridSearchCV when the parameter space is large.

                      6. Cross-Validation (Optional)

                      Cross-validation is a method used to ensure that the model generalizes well to unseen data. It splits the data into multiple subsets, trains the model on some of them, and tests it on the remaining ones.
                      • Using cross_val_score for cross-validation:
                        Cross-validation helps to avoid overfitting and gives a more reliable estimate of the model’s performance.

                        7. Model Saving and Loading (Optional)

                        Once the model is trained, it can be saved to disk for later use, avoiding the need to retrain it every time.
                        • Using joblib to save and load the model:
                          This is especially useful in production environments where you want to avoid retraining the model every time the application runs.

                          Typical sklearn Workflow Example:

                          Core Concept
                          Term
                          Meaning
                          Used On
                          .fit()
                          Learn from the data
                          Training set only
                          .transform()
                          Apply what was learned
                          Test set or new data
                          .fit_transform()
                          Learn and apply in one step
                          Usually for training
                          1. .fit(): Learn From Training Data
                          This method is used to learn internal parameters from the training set.
                          Examples:
                          • CountVectorizer: learns the vocabulary
                          • TfidfVectorizer: learns vocabulary + IDF values
                          • StandardScaler: learns the mean and standard deviation
                          • MinMaxScaler: learns min and max values
                          At this point, no actual data transformation happens. The object just memorizes what it needs to later transform data.

                          2. .transform(): Apply Learned Rules
                          This method is used to apply the learned rules (from .fit()) to new data (like validation or test data).
                          • It uses the learned vocabulary/statistics.
                          • It does not learn anything new—it just converts data using what's already known.
                          Important: You should never fit on test data, or it leads to data leakage.

                          3. .fit_transform(): Learn + Apply in One Step
                          This is a shortcut that first calls .fit() and then .transform() on the same data.
                          • Most commonly used on training data
                          • Saves time and makes code cleaner

                          What is a Pipeline?

                          A Pipeline chains multiple steps together, making the code cleaner and ensuring that all steps are executed in the correct order. It also ensures that data transformations are performed consistently during both training and testing.
                          Here’s how a pipeline works in scikit-learn:
                          1. Creating a Pipeline:
                            1. Training the Pipeline:
                              1. Making Predictions:
                                A Pipeline ensures that each step (like vectorization and classification) is executed in the correct order, preventing mistakes like fitting on the test set.

                                Use Pipeline to compare different combinations of vectorizers and classifiers

                                To evaluate and compare the performance of different models or feature extraction techniques, you can use a pipeline with different combinations of vectorizers and classifiers.
                                This allows you to systematically test and compare different configurations of models, vectorizers, and preprocessing techniques to find the best performing pipeline.
                                上一篇
                                K-Means Algorithm
                                下一篇
                                Overview of Machine Learning