type
status
date
slug
summary
tags
category
icon
password
1. Data Loading and Preprocessing
- Data preprocessing includes cleaning the data, feature extraction, handling missing values, encoding categorical variables, etc.
- For example, text data may need to be converted into feature vectors using techniques like
CountVectorizer
orTfidfVectorizer
. - For numerical data, you may need to standardize features (using
StandardScaler
) or encode categorical variables (usingOneHotEncoder
).
2. Model Selection
- Choose the appropriate model from
sklearn
based on your task (classification, regression, etc.). - For classification tasks, you can use models like
LogisticRegression
,RandomForestClassifier
,SGDClassifier
, etc. - For regression tasks, models like
LinearRegression
,SVR
, etc., can be used.
3. Training the Model
- Splitting the dataset: The dataset is usually split into training and test sets (e.g., using
train_test_split()
).
- Training the model: You use the training data to train the model using the
fit()
method.
4. Evaluating the Model
- Use the test set to evaluate how well the trained model performs. Common evaluation metrics include:
- Classification tasks: accuracy, precision, recall, F1-score, etc.
- Regression tasks: Mean Squared Error (MSE), R², etc.
5. Hyperparameter Tuning
- We can use techniques like GridSearchCV or RandomizedSearchCV to tune the model's hyperparameters and improve its performance.
6. Cross-Validation (Optional)
- Use cross-validation to evaluate the model more thoroughly, ensuring that the model performs well across different subsets of the data. This helps avoid overfitting or bias due to a specific train-test split. Methods like
cross_val_score
are commonly used.
7. Model Saving and Loading (Optional)
- After training, you may save the model for later use with
joblib
orpickle
. This allows you to reload the model and make predictions without retraining.
Typical sklearn
Workflow Example:
Core Concept
Term | Meaning | Used On |
.fit() | Learn from the data | Training set only |
.transform() | Apply what was learned | Test set or new data |
.fit_transform() | Learn and apply in one step | Usually for training |
1.
.fit()
: Learn From Training DataThis method is used to learn internal parameters from the training set.
Examples:
CountVectorizer
: learns the vocabulary
TfidfVectorizer
: learns vocabulary + IDF values
StandardScaler
: learns the mean and standard deviation
MinMaxScaler
: learns min and max values
At this point, no actual data transformation happens. The object just memorizes what it needs to later transform data.
2.
.transform()
: Apply Learned RulesThis method is used to apply the learned rules (from
.fit()
) to new data (like validation or test data).- It uses the learned vocabulary/statistics.
- It does not learn anything new—it just converts data using what's already known.
Important: You should never fit on test data, or it leads to data leakage.
3.
.fit_transform()
: Learn + Apply in One StepThis is a shortcut that first calls
.fit()
and then .transform()
on the same data.- Most commonly used on training data
- Saves time and makes code cleaner
What is a Pipeline
?
A
Pipeline
chains together multiple steps (like preprocessing and modeling) into a single object. It ensures:- Clean, readable code
- Proper training/test separation
- Avoids data leakage
- Easy cross-validation
1. Fit the pipeline (on training data)
.fit()
will:- call
.fit()
onStandardScaler
and transformX_train
- pass transformed data to
LogisticRegression.fit()
2. Predict with the pipeline (on test data)
- It automatically:
- transforms
X_test
using the already-learned scaler - passes it to the trained model
use Pipeline
to compare different combinations of vectorizers and classifiers
- Author:Entropyobserver
- URL:https://tangly1024.com/article/20cd698f-3512-806c-9ebc-ecca206129d0
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!