type
status
date
slug
summary
tags
category
icon
password
1. Introduction
Logistic Regression is a widely used supervised learning algorithm for binary classification problems. It is commonly applied in scenarios such as credit scoring, medical diagnosis, and advertisement click-through rate prediction. Despite having "regression" in its name, Logistic Regression is fundamentally a classification algorithm.
Logistic Regression is popular in the industry due to its simplicity, parallelizability, and strong interpretability. The essence of Logistic Regression is assuming that data follows a certain distribution and then using Maximum Likelihood Estimation to estimate parameters.
Maximum Likelihood Estimation (MLE) is a statistical method that finds the parameter values that make the observed data most likely.
Imagine you're a detective investigating whether a die is "loaded" (biased). You roll it 10 times and get the following results:
6, 2, 3, 6, 4, 6, 6, 5, 6, 1
You notice that 6 appears 5 times out of 10, which seems suspicious.
MLE asks: What is the probability of rolling a 6 (p) that best explains this data? If the die were fair, each face would have a probability of . But if it’s biased, we need to estimate p.
In logistic regression, we assume that the probability of a class y follows a Bernoulli distribution:
where:
MLE helps find the best parameters w that maximize the likelihood of the observed labels. This leads to the log-likelihood function, which is minimized using optimization techniques like gradient descent.
2. Mathematical Principles
2.1 Basic Concept of Logistic Regression
Logistic Regression is a supervised learning algorithm used for binary classification. The key steps include:
- Computing a weighted sum of input features using linear regression:
- are input features,
- are corresponding weights,
- b is the bias term.
where:
- Applying the Sigmoid function to map the sum into the (0,1) range as a probability:
- Setting a threshold (e.g., 0.5) for classification:
- If ≥0.5, predict class 1.
- If <0.5, predict class 0.

2.2 Sigmoid (S-Shaped) Function
The key to Logistic Regression is the Sigmoid (or Logistic) function, expressed as:
where, representing the linear combination of input features.
The Sigmoid function outputs values in the (0,1) range, representing the probability of class 1:

- Monotonic increasing function: Larger input values yield outputs closer to 1, while smaller input values yield outputs closer to 0.
- S-shaped curve: The function is symmetric and smooth, ensuring a gradual transition between 0 and 1.
- Probability interpretation: Outputs lie in the (0,1) range, making them suitable for probability representation.
Derivative of the Sigmoid Function
This derivative is useful for gradient-based optimization.
2.3 Decision Boundary
The decision boundary in logistic regression is linear. Given the model:
At , we obtain:
This represents a straight-line equation in a two-dimensional space:
For higher-dimensional datasets, the decision boundary is a hyperplane that separates data into two classes.
2.4 Loss Function
Logistic regression uses the logarithmic loss function (Log Loss), also known as cross-entropy loss:
where:
- m is the number of training samples,
- is the actual label (0 or 1) of sample i,
- is the predicted probability.
Understanding Cross-Entropy Loss
- If y=1, the loss function reduces to:
- If , the loss is close to 0 (correct prediction).
- If , the loss becomes very large (incorrect prediction).
- If y=0, the loss function reduces to:
- If , the loss is close to 0 (correct prediction).
- If , the loss becomes very large (incorrect prediction).
Why Use Cross-Entropy Instead of MSE?
- Cross-entropy loss is better suited for classification, as it directly optimizes probability estimates.
- Mean Squared Error (MSE) can cause vanishing gradients, slowing down training.
2.5 Parameter Optimization: Gradient Descent
The goal is to find the optimal parameters w and b that minimize the loss function. The most common optimization method is gradient descent.
Computing the Gradients
The gradients of the loss function with respect to parameters are:
Gradient Descent Update Rule
The parameters are updated using the gradient descent rule:
where:
- is the learning rate, controlling the step size of updates.
- The process repeats iteratively until convergence.
Types of Gradient Descent
- Batch Gradient Descent (BGD):
- Uses all training data in each update.
- More stable but computationally expensive for large datasets.
- Stochastic Gradient Descent (SGD):
- Updates parameters using a single random sample per iteration.
- Faster but introduces more noise in updates.
- Mini-Batch Gradient Descent (MBGD):
- Uses a small batch of samples per update.
- Balances speed and stability.
3.Example: Email Spam Classification
We want to classify emails as spam (1) or not spam (0) based on two features:
- Number of times "Free" appears (Feature X1X_1X1)
- Number of times "Win" appears (Feature X2X_2X2)
Training Data
Email | "Free" Count () | "Win" Count (X2X_2X2) | Spam? (YYY) |
1 | 3 | 2 | 1 (Spam) |
2 | 0 | 0 | 0 (Not Spam) |
3 | 2 | 1 | 1 (Spam) |
4 | 1 | 0 | 0 (Not Spam) |
We initialize model parameters:
- Weights:
- Bias:
- Learning Rate:
Step 1: Compute Linear Combination z
Email | |
1 | |
2 | |
3 | |
4 |
Step 2: Apply the Sigmoid Function
Email | z | |
1 | 2.5 | |
2 | 0 | |
3 | 1.5 | |
4 | 0.5 |
These values represent the probability of an email being spam.
Step 3: Compute Loss (Cross-Entropy)
For each email:
Step 4: Gradient Descent (Parameter Update)
Compute gradients:
Compute Gradients
Update Parameters
Step 5: Prediction on a New Email
New email:
- "Free" appears 2 times
- "Win" appears 1 time
Compute:
Since , we classify it as spam.
4 Logistic Regression coding
Logistic Regression Hyperparameter Tuning Summary
1. Key Hyperparameters to Tune
C
(Regularization Strength)- Controls the regularization strength (L2 regularization by default).
- Lower values (
C < 1
) increase regularization (reduce overfitting), while higher values (C > 1
) decrease regularization. - Typical range:
[0.01, 0.1, 1, 10]
- Example:
solver
(Optimization Algorithm)liblinear
: Best for small datasets, supportsL1
andL2
.lbfgs
: Default, good for medium-sized datasets, supportsL2
.saga
: Efficient for large, sparse datasets, supportsL1
,L2
, andelasticnet
.- Example:
max_iter
(Maximum Iterations)- Increases iterations to ensure convergence for large datasets.
- Typical range:
[500, 1000, 2000]
- Example:
penalty
(Regularization Type)l2
: Default, works for most cases.l1
: Used for feature selection (only works withliblinear
andsaga
).elasticnet
: Combinesl1
andl2
, supported only withsaga
.- Example:
2. How to Optimize These Hyperparameters?
(1) Grid Search (GridSearchCV
) - Exhaustive search for the best combination:
(2) Random Search (RandomizedSearchCV
) - Efficient for large parameter spaces:
3. Summary Table
Parameter | Effect | Typical Range | Best Use Case |
C | Regularization strength | 0.01 ~ 10 | Lower for preventing overfitting, higher for better fit |
solver | Optimization algorithm | lbfgs , liblinear , saga | lbfgs for default, saga for large datasets |
max_iter | Number of iterations | 500 ~ 2000 | Increase if convergence issues arise |
penalty | Regularization type | l1 , l2 , elasticnet | l1 for sparse features |
For best results, use
GridSearchCV
or RandomizedSearchCV
to automatically find the optimal5.Top Logistic Regression Interview Questions & Answers
Below is a comprehensive list of interview questions related to Logistic Regression, along with detailed answers.
Q1: What is Logistic Regression?
Answer:
Logistic Regression is a supervised learning algorithm used for binary classification problems. Instead of predicting continuous values like Linear Regression, it predicts the probability of a sample belonging to a particular class using the sigmoid function.
Mathematically, the model is:
where is the probability that the output belongs to class 1.
Q2: Why can't we use Linear Regression for classification?
Answer:
Linear Regression provides continuous values, which are not suitable for classification. If we try to use it for classification:
- Unbounded Output: Linear Regression can output values beyond [0,1], making it unsuitable for probabilities.
- Poor Decision Boundaries: Linear Regression does not naturally map to distinct classes.
- Lack of Probability Interpretation: Logistic Regression outputs probabilities, making threshold-based classification more meaningful.
Q3: What is the Sigmoid function and why is it used in Logistic Regression?
Answer:
The Sigmoid function is defined as:
where z is the linear combination of input features.
- Why is it used?
- It maps any real number to a value between 0 and 1, making it suitable for probability estimation.
- It introduces non-linearity, enabling classification.
- It is differentiable, allowing optimization via Gradient Descent.
Q4: What is the Decision Boundary in Logistic Regression?
Answer:
The decision boundary is the line (or surface in higher dimensions) that separates different classes.
For a binary classification problem, it is defined by the equation:
- If the result is ≥ 0.5, classify as class 1.
- If the result is < 0.5, classify as class 0.
The decision boundary is linear unless feature transformations (like polynomial terms) are introduced.
Q5: What is the Loss Function used in Logistic Regression?
Answer:
Logistic Regression uses Log Loss (Cross-Entropy Loss):
where:
- is the predicted probability,
- is the actual label.
This function penalizes incorrect predictions and is convex, allowing optimization via Gradient Descent.
Q6: How are the parameters of Logistic Regression optimized?
Answer:
Logistic Regression parameters are optimized using Gradient Descent:
- Compute gradient of the cost function with respect to each parameter
- Update parameters iteratively using:
where is the learning rate.
- Repeat until convergence.
Other optimization methods include:
- Newton’s Method (Newton-Raphson)
- Stochastic Gradient Descent (SGD)
- Batch & Mini-Batch Gradient Descent
Q7: What is Regularization in Logistic Regression? Why is it needed?
Answer:
Regularization prevents overfitting by adding a penalty term to the loss function:
- L1 Regularization (Lasso):
- Helps in feature selection (some coefficients shrink to zero).
- L2 Regularization (Ridge):
- Helps in reducing large coefficients but does not eliminate features.
Q8: How do you evaluate a Logistic Regression model?
Answer:
Key evaluation metrics include:
- Accuracy:
- Precision:
- Recall (Sensitivity):
- F1-Score: Harmonic mean of Precision & Recall
- ROC-AUC: Measures probability ranking of positive classes.
Q9: How do you handle imbalanced datasets in Logistic Regression?
Answer:
For imbalanced data:
- Use class weights:
- Oversampling (SMOTE) or Undersampling
- Threshold Adjustment based on ROC Curve.
Q10: What are some alternatives to Logistic Regression?
Answer:
- Naïve Bayes: Works well for text classification.
- Decision Trees: Handles nonlinear data.
- Support Vector Machines (SVMs): Works well with high-dimensional data.
- Neural Networks: Useful for complex feature interactions.
Q11: When should you NOT use Logistic Regression?
Answer:
- When the relationship is highly non-linear.
- When features are heavily correlated (Multicollinearity).
- When there are many categorical features with high cardinality.
- When dealing with imbalanced data without proper handling.
Q12: What is Multinomial Logistic Regression?
Answer:
Multinomial Logistic Regression is used for multi-class classification. Instead of using a single Sigmoid function, it applies the Softmax function:
where K is the number of classes.
Q13: Can Logistic Regression be used for Time Series Data?
Answer:
No, Logistic Regression assumes independent observations, whereas time series data exhibits dependencies over time. Instead, use Recurrent Neural Networks (RNNs) or Hidden Markov Models (HMMs).
- Author:NotionNext
- URL:http://preview.tangly1024.com/article/1c7d698f-3512-80f8-8fa0-c949bde042fc
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!