type
status
date
slug
summary
tags
category
icon
password
1. Introduction
Logistic Regression is a widely used supervised learning algorithm for binary classification problems. It is commonly applied in scenarios such as credit scoring, medical diagnosis, and advertisement click-through rate prediction. Despite having "regression" in its name, Logistic Regression is fundamentally a classification algorithm.
Logistic Regression is popular in the industry due to its simplicity, parallelizability, and strong interpretability. The essence of Logistic Regression is assuming that data follows a certain distribution and then using Maximum Likelihood Estimation to estimate parameters.
2.Mathematical Theory
1. Probability Model (Bernoulli Distribution)
In logistic regression, the output variable y is assumed to follow a Bernoulli distribution. This means that yyy can take only two values: y=1 (success) or y=0 (failure).
The probability of observing yyy, given the input features is modeled as:
This formula expresses the likelihood of observing y=1 or y=0 given the probability p:
- If y=1, the likelihood is p
- If y=0, the likelihood is 1−p
So, this equation simply models the probability of observing either outcome y=0 or y=1 based on the probability ppp.
2. Logistic Function (Sigmoid Function)
The probability ppp in the Bernoulli distribution is modeled using the logistic function (also called the sigmoid function), which ensures that ppp lies between 0 and 1, as probabilities should. The logistic function is defined as:
Here:
- is the intercept (bias term).
- are the weights (parameters) corresponding to the features .
- The expression is a linear combination of the input features and their corresponding weights.
The logistic function transforms the linear combination into a probability ppp that ranges from 0 to 1, which can then be interpreted as the probability of class y=1.
3. Maximum Likelihood Estimation (MLE)
Now, the task of logistic regression is to find the best parameters that maximize the likelihood of observing the data (the labels yyy) based on the features x. This is where Maximum Likelihood Estimation (MLE) comes in.
- Likelihood: The likelihood function is the product of the probabilities of observing each data point in the dataset. For each data point, the likelihood is given by , which depends on the parameter p.
- Maximizing the Likelihood: MLE seeks to maximize the likelihood function by adjusting the parameters . Essentially, MLE finds the parameters that make the observed labels (0 or 1) as probable as possible.
Maximum Likelihood Estimation (MLE) is a statistical method that finds the parameter values that make the observed data most likely.
Imagine you're a detective investigating whether a die is "loaded" (biased). You roll it 10 times and get the following results:
6, 2, 3, 6, 4, 6, 6, 5, 6, 1
Out of these 10 rolls, the number 6 appears 5 times. So, MLE helps us estimate the probability of rolling a 6 based on these observed results. Since 6 appears 5 times out of 10 rolls, the MLE estimate of ppp would be:
This means that the likelihood of rolling a 6 is 50%, based on the data you have collected.
MLE works by finding the parameters (like p) that maximize the likelihood of observing the data. Here, the data is 10 die rolls, and the goal is to find the parameter (the probability of rolling a 6) that makes these 10 results most likely.If you collect more data (say, 100 rolls), MLE will update the estimate based on the new data, ensuring that the estimate aligns with the new evidence.
4. Log-Likelihood Function
In practice, we work with the log-likelihood because it's mathematically easier to handle. The log-likelihood function is the natural logarithm of the likelihood:
Here:
- N is the number of data points.
- is the observed label for the i-th data point.
- is the predicted probability for the i-th data point (calculated using the logistic function).
The log-likelihood function quantifies how well the parameters explain the observed data.
5. Optimization: Gradient Descent
To find the best parameters, we need to optimize the log-likelihood function. Gradient descent is commonly used for this:
- Objective: We aim to maximize the log-likelihood, which is equivalent to minimizing the negative log-likelihood.
- Gradient: The gradient (or derivative) of the log-likelihood tells us how to adjust the parameters to improve the likelihood.
- Descent: Starting with some initial values for w0,w1,…,wnw_0, w_1, \dots, w_nw0,w1,…,wn, we iteratively adjust the parameters in the direction of the gradient to increase the log-likelihood, eventually converging to the optimal parameters.
2. Mathematical Principles
2.1 Basic Concept of Logistic Regression
- Computing a weighted sum of input features using linear regression:
- are input features,
- are corresponding weights,
- b is the bias term.
where:
- Applying the Sigmoid function to map the sum into the (0,1) range as a probability:
- Setting a threshold (e.g., 0.5) for classification:
- If ≥0.5, predict class 1.
- If <0.5, predict class 0.

2.2 Sigmoid (S-Shaped) Function
The key to Logistic Regression is the Sigmoid (or Logistic) function, expressed as:
where, representing the linear combination of input features.
The Sigmoid function outputs values in the (0,1) range, representing the probability of class 1:

- Monotonic increasing function: Larger input values yield outputs closer to 1, while smaller input values yield outputs closer to 0.
- S-shaped curve: The function is symmetric and smooth, ensuring a gradual transition between 0 and 1.
- Probability interpretation: Outputs lie in the (0,1) range, making them suitable for probability representation.
Derivative of the Sigmoid Function
This derivative is useful for gradient-based optimization.
2.3 Decision Boundary
The decision boundary in logistic regression is linear. Given the model:
At , we obtain:
This represents a straight-line equation in a two-dimensional space:
For higher-dimensional datasets, the decision boundary is a hyperplane that separates data into two classes.
2.4 Loss Function
Logistic regression uses the logarithmic loss function (Log Loss), also known as cross-entropy loss:
where:
- m is the number of training samples,
- is the actual label (0 or 1) of sample i,
- is the predicted probability.
Understanding Cross-Entropy Loss
- If y=1, the loss function reduces to:
- If , the loss is close to 0 (correct prediction).
- If , the loss becomes very large (incorrect prediction).
- If y=0, the loss function reduces to:
- If , the loss is close to 0 (correct prediction).
- If , the loss becomes very large (incorrect prediction).
Why Use Cross-Entropy Instead of MSE?
- Cross-entropy loss is better suited for classification, as it directly optimizes probability estimates.
- Mean Squared Error (MSE) can cause vanishing gradients, slowing down training.
2.5 Parameter Optimization: Gradient Descent
The goal is to find the optimal parameters w and b that minimize the loss function. The most common optimization method is gradient descent.
Computing the Gradients
The gradients of the loss function with respect to parameters are:
Gradient Descent Update Rule
The parameters are updated using the gradient descent rule:
where:
- is the learning rate, controlling the step size of updates.
- The process repeats iteratively until convergence.
Types of Gradient Descent
- Batch Gradient Descent (BGD):
- Uses all training data in each update.
- More stable but computationally expensive for large datasets.
- Stochastic Gradient Descent (SGD):
- Updates parameters using a single random sample per iteration.
- Faster but introduces more noise in updates.
- Mini-Batch Gradient Descent (MBGD):
- Uses a small batch of samples per update.
- Balances speed and stability.
3.Example: Email Spam Classification
We want to classify emails as spam (1) or not spam (0) based on two features:
- Number of times "Free" appears (Feature X1X_1X1)
- Number of times "Win" appears (Feature X2X_2X2)
Training Data
Email | "Free" Count () | "Win" Count (X2X_2X2) | Spam? (YYY) |
1 | 3 | 2 | 1 (Spam) |
2 | 0 | 0 | 0 (Not Spam) |
3 | 2 | 1 | 1 (Spam) |
4 | 1 | 0 | 0 (Not Spam) |
We initialize model parameters:
- Weights:
- Bias:
- Learning Rate:
Step 1: Compute Linear Combination z
Email | |
1 | |
2 | |
3 | |
4 |
Step 2: Apply the Sigmoid Function
Email | z | |
1 | 2.5 | |
2 | 0 | |
3 | 1.5 | |
4 | 0.5 |
These values represent the probability of an email being spam.
Step 3: Compute Loss (Cross-Entropy)
For each email:
Step 4: Gradient Descent (Parameter Update)
Compute gradients:
Compute Gradients
Update Parameters
Step 5: Prediction on a New Email
New email:
- "Free" appears 2 times
- "Win" appears 1 time
Compute:
Since , we classify it as spam.
4 Logistic Regression coding
Logistic Regression Hyperparameter Tuning Summary
1. Key Hyperparameters to Tune
C
(Regularization Strength)- Controls the regularization strength (L2 regularization by default).
- Lower values (
C < 1
) increase regularization (reduce overfitting), while higher values (C > 1
) decrease regularization. - Typical range:
[0.01, 0.1, 1, 10]
- Example:
solver
(Optimization Algorithm)liblinear
: Best for small datasets, supportsL1
andL2
.lbfgs
: Default, good for medium-sized datasets, supportsL2
.saga
: Efficient for large, sparse datasets, supportsL1
,L2
, andelasticnet
.- Example:
max_iter
(Maximum Iterations)- Increases iterations to ensure convergence for large datasets.
- Typical range:
[500, 1000, 2000]
- Example:
penalty
(Regularization Type)l2
: Default, works for most cases.l1
: Used for feature selection (only works withliblinear
andsaga
).elasticnet
: Combinesl1
andl2
, supported only withsaga
.- Example:
2. How to Optimize These Hyperparameters?
(1) Grid Search (
GridSearchCV
) - Exhaustive search for the best combination:(2) Random Search (
RandomizedSearchCV
) - Efficient for large parameter spaces:3. Summary Table
Parameter | Effect | Typical Range | Best Use Case |
C | Regularization strength | 0.01 ~ 10 | Lower for preventing overfitting, higher for better fit |
solver | Optimization algorithm | lbfgs , liblinear , saga | lbfgs for default, saga for large datasets |
max_iter | Number of iterations | 500 ~ 2000 | Increase if convergence issues arise |
penalty | Regularization type | l1 , l2 , elasticnet | l1 for sparse features |
For best results, use
GridSearchCV
or RandomizedSearchCV
to automatically find the optimal5.Top Logistic Regression Interview Questions & Answers
Below is a comprehensive list of interview questions related to Logistic Regression, along with detailed answers.
Q1: What is Logistic Regression?
Answer:
Logistic Regression is a supervised learning algorithm used for binary classification problems. Instead of predicting continuous values like Linear Regression, it predicts the probability of a sample belonging to a particular class using the sigmoid function.
Mathematically, the model is:
where is the probability that the output belongs to class 1.
Q2: Why can't we use Linear Regression for classification?
Answer:
Linear Regression provides continuous values, which are not suitable for classification. If we try to use it for classification:
- Unbounded Output: Linear Regression can output values beyond [0,1], making it unsuitable for probabilities.
- Poor Decision Boundaries: Linear Regression does not naturally map to distinct classes.
- Lack of Probability Interpretation: Logistic Regression outputs probabilities, making threshold-based classification more meaningful.
Q3: What is the Sigmoid function and why is it used in Logistic Regression?
Answer:
The Sigmoid function is defined as:
where z is the linear combination of input features.
- Why is it used?
- It maps any real number to a value between 0 and 1, making it suitable for probability estimation.
- It introduces non-linearity, enabling classification.
- It is differentiable, allowing optimization via Gradient Descent.
Q4: What is the Decision Boundary in Logistic Regression?
Answer:
The decision boundary is the line (or surface in higher dimensions) that separates different classes.
For a binary classification problem, it is defined by the equation:
- If the result is ≥ 0.5, classify as class 1.
- If the result is < 0.5, classify as class 0.
The decision boundary is linear unless feature transformations (like polynomial terms) are introduced.
Q5: What is the Loss Function used in Logistic Regression?
Answer:
Logistic Regression uses Log Loss (Cross-Entropy Loss):
where:
- is the predicted probability,
- is the actual label.
This function penalizes incorrect predictions and is convex, allowing optimization via Gradient Descent.
Q6: How are the parameters of Logistic Regression optimized?
Answer:
Logistic Regression parameters are optimized using Gradient Descent:
- Compute gradient of the cost function with respect to each parameter
- Update parameters iteratively using:
where is the learning rate.
- Repeat until convergence.
Other optimization methods include:
- Newton’s Method (Newton-Raphson)
- Stochastic Gradient Descent (SGD)
- Batch & Mini-Batch Gradient Descent
Q7: What is Regularization in Logistic Regression? Why is it needed?
Answer:
Regularization prevents overfitting by adding a penalty term to the loss function:
- L1 Regularization (Lasso):
- Helps in feature selection (some coefficients shrink to zero).
- L2 Regularization (Ridge):
- Helps in reducing large coefficients but does not eliminate features.
Q8: How do you evaluate a Logistic Regression model?
Answer:
Key evaluation metrics include:
- Accuracy:
- Precision:
- Recall (Sensitivity):
- F1-Score: Harmonic mean of Precision & Recall
- ROC-AUC: Measures probability ranking of positive classes.
Q9: How do you handle imbalanced datasets in Logistic Regression?
Answer:
For imbalanced data:
- Use class weights:
- Oversampling (SMOTE) or Undersampling
- Threshold Adjustment based on ROC Curve.
Q10: What are some alternatives to Logistic Regression?
Answer:
- Naïve Bayes: Works well for text classification.
- Decision Trees: Handles nonlinear data.
- Support Vector Machines (SVMs): Works well with high-dimensional data.
- Neural Networks: Useful for complex feature interactions.
Q11: When should you NOT use Logistic Regression?
Answer:
- When the relationship is highly non-linear.
- When features are heavily correlated (Multicollinearity).
- When there are many categorical features with high cardinality.
- When dealing with imbalanced data without proper handling.
Q12: What is Multinomial Logistic Regression?
Answer:
Multinomial Logistic Regression is used for multi-class classification. Instead of using a single Sigmoid function, it applies the Softmax function:
where K is the number of classes.
Q13: Can Logistic Regression be used for Time Series Data?
Answer:
No, Logistic Regression assumes independent observations, whereas time series data exhibits dependencies over time. Instead, use Recurrent Neural Networks (RNNs) or Hidden Markov Models (HMMs).
- Author:Entropyobserver
- URL:https://tangly1024.com/article/1c7d698f-3512-80f8-8fa0-c949bde042fc
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!