Logistic Regression | EntropyObserver

type

status

date

slug

summary

1. Introduction

Logistic Regression is a widely used supervised learning algorithm for binary classification problems. It is commonly applied in scenarios such as credit scoring, medical diagnosis, and advertisement click-through rate prediction. Despite having "regression" in its name, Logistic Regression is fundamentally a classification algorithm.

Logistic Regression is popular in the industry due to its simplicity, parallelizability, and strong interpretability. The essence of Logistic Regression is assuming that data follows a certain distribution and then using Maximum Likelihood Estimation to estimate parameters.

2.Mathematical Theory

1. Probability Model (Bernoulli Distribution)

In logistic regression, the output variable y is assumed to follow a Bernoulli distribution. This means that yyy can take only two values: y=1 (success) or y=0 (failure).

The probability of observing yyy, given the input features is modeled as:

This formula expresses the likelihood of observing y=1 or y=0 given the probability p:

If y=1, the likelihood is p

If y=0, the likelihood is 1−p

So, this equation simply models the probability of observing either outcome y=0 or y=1 based on the probability ppp.

2. Logistic Function (Sigmoid Function)

The probability ppp in the Bernoulli distribution is modeled using the logistic function (also called the sigmoid function), which ensures that ppp lies between 0 and 1, as probabilities should. The logistic function is defined as:

Here:

is the intercept (bias term).

are the weights (parameters) corresponding to the features .

The expression is a linear combination of the input features and their corresponding weights.

The logistic function transforms the linear combination into a probability ppp that ranges from 0 to 1, which can then be interpreted as the probability of class y=1.

3. Maximum Likelihood Estimation (MLE)

Now, the task of logistic regression is to find the best parameters that maximize the likelihood of observing the data (the labels yyy) based on the features x. This is where Maximum Likelihood Estimation (MLE) comes in.

Likelihood: The likelihood function is the product of the probabilities of observing each data point in the dataset. For each data point, the likelihood is given by , which depends on the parameter p.

Maximizing the Likelihood: MLE seeks to maximize the likelihood function by adjusting the parameters . Essentially, MLE finds the parameters that make the observed labels (0 or 1) as probable as possible.

Maximum Likelihood Estimation (MLE) is a statistical method that finds the parameter values that make the observed data most likely. Imagine you're a detective investigating whether a die is "loaded" (biased). You roll it 10 times and get the following results:

6, 2, 3, 6, 4, 6, 6, 5, 6, 1

Out of these 10 rolls, the number 6 appears 5 times. So, MLE helps us estimate the probability of rolling a 6 based on these observed results. Since 6 appears 5 times out of 10 rolls, the MLE estimate of ppp would be:

This means that the likelihood of rolling a 6 is 50%, based on the data you have collected.

MLE works by finding the parameters (like p) that maximize the likelihood of observing the data. Here, the data is 10 die rolls, and the goal is to find the parameter (the probability of rolling a 6) that makes these 10 results most likely.If you collect more data (say, 100 rolls), MLE will update the estimate based on the new data, ensuring that the estimate aligns with the new evidence.

4. Log-Likelihood Function

In practice, we work with the log-likelihood because it's mathematically easier to handle. The log-likelihood function is the natural logarithm of the likelihood:

Here:

N is the number of data points.

is the observed label for the i-th data point.

is the predicted probability for the i-th data point (calculated using the logistic function).

The log-likelihood function quantifies how well the parameters explain the observed data.

5. Optimization: Gradient Descent

To find the best parameters, we need to optimize the log-likelihood function. Gradient descent is commonly used for this:

Objective: We aim to maximize the log-likelihood, which is equivalent to minimizing the negative log-likelihood.

Gradient: The gradient (or derivative) of the log-likelihood tells us how to adjust the parameters to improve the likelihood.

Descent: Starting with some initial values for w0,w1,…,wnw_0, w_1, \dots, w_nw0,w1,…,wn, we iteratively adjust the parameters in the direction of the gradient to increase the log-likelihood, eventually converging to the optimal parameters.

2. Mathematical Principles

2.1 Basic Concept of Logistic Regression

Computing a weighted sum of input features using linear regression:

where:

are input features,

are corresponding weights,

b is the bias term.

Applying the Sigmoid function to map the sum into the (0,1) range as a probability:

Setting a threshold (e.g., 0.5) for classification:

If ≥0.5, predict class 1.

If <0.5, predict class 0.

2.2 Sigmoid (S-Shaped) Function

The key to Logistic Regression is the Sigmoid (or Logistic) function, expressed as:

where, representing the linear combination of input features.

The Sigmoid function outputs values in the (0,1) range, representing the probability of class 1:

Monotonic increasing function: Larger input values yield outputs closer to 1, while smaller input values yield outputs closer to 0.

S-shaped curve: The function is symmetric and smooth, ensuring a gradual transition between 0 and 1.

Probability interpretation: Outputs lie in the (0,1) range, making them suitable for probability representation.

Derivative of the Sigmoid Function

This derivative is useful for gradient-based optimization.

2.3 Decision Boundary

The decision boundary in logistic regression is linear. Given the model:

At , we obtain:

This represents a straight-line equation in a two-dimensional space:

For higher-dimensional datasets, the decision boundary is a hyperplane that separates data into two classes.

2.4 Loss Function

Logistic regression uses the logarithmic loss function (Log Loss), also known as cross-entropy loss:

where:

m is the number of training samples,

is the actual label (0 or 1) of sample i,

is the predicted probability.

Understanding Cross-Entropy Loss

If y=1, the loss function reduces to:

If , the loss is close to 0 (correct prediction).
If , the loss becomes very large (incorrect prediction).

If y=0, the loss function reduces to:

If , the loss is close to 0 (correct prediction).
If , the loss becomes very large (incorrect prediction).

Why Use Cross-Entropy Instead of MSE?

Cross-entropy loss is better suited for classification, as it directly optimizes probability estimates.

Mean Squared Error (MSE) can cause vanishing gradients, slowing down training.

2.5 Parameter Optimization: Gradient Descent

The goal is to find the optimal parameters w and b that minimize the loss function. The most common optimization method is gradient descent.

Computing the Gradients

The gradients of the loss function with respect to parameters are:

Gradient Descent Update Rule

The parameters are updated using the gradient descent rule:

where:

is the learning rate, controlling the step size of updates.

The process repeats iteratively until convergence.

Types of Gradient Descent

Batch Gradient Descent (BGD):

Uses all training data in each update.

More stable but computationally expensive for large datasets.

Stochastic Gradient Descent (SGD):

Updates parameters using a single random sample per iteration.

Faster but introduces more noise in updates.

Mini-Batch Gradient Descent (MBGD):

Uses a small batch of samples per update.

Balances speed and stability.

3.Example: Email Spam Classification

We want to classify emails as spam (1) or not spam (0) based on two features:

Number of times "Free" appears (Feature X1X_1X1)

Number of times "Win" appears (Feature X2X_2X2)

Training Data

Email	"Free" Count ()	"Win" Count (X2X_2X2)	Spam? (YYY)
1	3	2	1 (Spam)
2	0	0	0 (Not Spam)
3	2	1	1 (Spam)
4	1	0	0 (Not Spam)

We initialize model parameters:

Weights:

Bias:

Learning Rate:

Step 1: Compute Linear Combination z

Email
1
2
3
4

Step 2: Apply the Sigmoid Function

Email	z
1	2.5
2	0
3	1.5
4	0.5

These values represent the probability of an email being spam.

Step 3: Compute Loss (Cross-Entropy)

For each email:

Step 4: Gradient Descent (Parameter Update)

Compute gradients:

Compute Gradients

Update Parameters

Step 5: Prediction on a New Email

New email:

"Free" appears 2 times

"Win" appears 1 time

Compute:

Since , we classify it as spam.

4 Logistic Regression coding

Logistic Regression Hyperparameter Tuning Summary

1. Key Hyperparameters to Tune

C (Regularization Strength)

Controls the regularization strength (L2 regularization by default).
Lower values (C < 1) increase regularization (reduce overfitting), while higher values (C > 1) decrease regularization.
Typical range: [0.01, 0.1, 1, 10]
Example:

solver (Optimization Algorithm)

liblinear: Best for small datasets, supports L1 and L2.
lbfgs: Default, good for medium-sized datasets, supports L2.
saga: Efficient for large, sparse datasets, supports L1, L2, and elasticnet.
Example:

max_iter (Maximum Iterations)

Increases iterations to ensure convergence for large datasets.
Typical range: [500, 1000, 2000]
Example:

penalty (Regularization Type)

l2: Default, works for most cases.
l1: Used for feature selection (only works with liblinear and saga).
elasticnet: Combines l1 and l2, supported only with saga.
Example:

2. How to Optimize These Hyperparameters?

(1) Grid Search (GridSearchCV) - Exhaustive search for the best combination:

(2) Random Search (RandomizedSearchCV) - Efficient for large parameter spaces:

3. Summary Table

Parameter	Effect	Typical Range	Best Use Case
`C`	Regularization strength	`0.01 ~ 10`	Lower for preventing overfitting, higher for better fit
`solver`	Optimization algorithm	`lbfgs`, `liblinear`, `saga`	`lbfgs` for default, `saga` for large datasets
`max_iter`	Number of iterations	`500 ~ 2000`	Increase if convergence issues arise
`penalty`	Regularization type	`l1`, `l2`, `elasticnet`	`l1` for sparse features

For best results, use GridSearchCV or RandomizedSearchCV to automatically find the optimal

5.Top Logistic Regression Interview Questions & Answers

Below is a comprehensive list of interview questions related to Logistic Regression, along with detailed answers.

Q1: What is Logistic Regression?

Answer:

Logistic Regression is a supervised learning algorithm used for binary classification problems. Instead of predicting continuous values like Linear Regression, it predicts the probability of a sample belonging to a particular class using the sigmoid function.

Mathematically, the model is:

where is the probability that the output belongs to class 1.

Q2: Why can't we use Linear Regression for classification?

Answer:

Linear Regression provides continuous values, which are not suitable for classification. If we try to use it for classification:

Unbounded Output: Linear Regression can output values beyond [0,1], making it unsuitable for probabilities.

Poor Decision Boundaries: Linear Regression does not naturally map to distinct classes.

Lack of Probability Interpretation: Logistic Regression outputs probabilities, making threshold-based classification more meaningful.

Q3: What is the Sigmoid function and why is it used in Logistic Regression?

Answer:

The Sigmoid function is defined as:

where z is the linear combination of input features.

Why is it used?

It maps any real number to a value between 0 and 1, making it suitable for probability estimation.
It introduces non-linearity, enabling classification.
It is differentiable, allowing optimization via Gradient Descent.

Q4: What is the Decision Boundary in Logistic Regression?

Answer:

The decision boundary is the line (or surface in higher dimensions) that separates different classes.

For a binary classification problem, it is defined by the equation:

If the result is ≥ 0.5, classify as class 1.

If the result is < 0.5, classify as class 0.

The decision boundary is linear unless feature transformations (like polynomial terms) are introduced.

Q5: What is the Loss Function used in Logistic Regression?

Answer:

Logistic Regression uses Log Loss (Cross-Entropy Loss):

where:

is the predicted probability,

is the actual label.

This function penalizes incorrect predictions and is convex, allowing optimization via Gradient Descent.

Q6: How are the parameters of Logistic Regression optimized?

Answer:

Logistic Regression parameters are optimized using Gradient Descent:

Compute gradient of the cost function with respect to each parameter

Update parameters iteratively using:

where is the learning rate.

Repeat until convergence.

Other optimization methods include:

Newton’s Method (Newton-Raphson)

Stochastic Gradient Descent (SGD)

Batch & Mini-Batch Gradient Descent

Q7: What is Regularization in Logistic Regression? Why is it needed?

Answer:

Regularization prevents overfitting by adding a penalty term to the loss function:

L1 Regularization (Lasso):

Helps in feature selection (some coefficients shrink to zero).

L2 Regularization (Ridge):

Helps in reducing large coefficients but does not eliminate features.

Q8: How do you evaluate a Logistic Regression model?

Answer:

Key evaluation metrics include:

Accuracy:

Precision:

Recall (Sensitivity):

F1-Score: Harmonic mean of Precision & Recall

ROC-AUC: Measures probability ranking of positive classes.

Q9: How do you handle imbalanced datasets in Logistic Regression?

Answer:

For imbalanced data:

Use class weights:

Oversampling (SMOTE) or Undersampling

Threshold Adjustment based on ROC Curve.

Q10: What are some alternatives to Logistic Regression?

Answer:

Naïve Bayes: Works well for text classification.

Decision Trees: Handles nonlinear data.

Support Vector Machines (SVMs): Works well with high-dimensional data.

Neural Networks: Useful for complex feature interactions.

Q11: When should you NOT use Logistic Regression?

Answer:

When the relationship is highly non-linear.

When features are heavily correlated (Multicollinearity).

When there are many categorical features with high cardinality.

When dealing with imbalanced data without proper handling.

Q12: What is Multinomial Logistic Regression?

Answer:

Multinomial Logistic Regression is used for multi-class classification. Instead of using a single Sigmoid function, it applies the Softmax function:

where K is the number of classes.

Q13: Can Logistic Regression be used for Time Series Data?

Answer:

No, Logistic Regression assumes independent observations, whereas time series data exhibits dependencies over time. Instead, use Recurrent Neural Networks (RNNs) or Hidden Markov Models (HMMs).

1. Introduction

2.Mathematical Theory

1. Probability Model (Bernoulli Distribution)

2. Logistic Function (Sigmoid Function)

3. Maximum Likelihood Estimation (MLE)

4. Log-Likelihood Function

5. Optimization: Gradient Descent

2. Mathematical Principles

2.1 Basic Concept of Logistic Regression

2.2 Sigmoid (S-Shaped) Function

2.3 Decision Boundary

2.4 Loss Function

2.5 Parameter Optimization: Gradient Descent

3.Example: Email Spam Classification

Step 1: Compute Linear Combination z

Step 2: Apply the Sigmoid Function

Step 3: Compute Loss (Cross-Entropy)

Step 4: Gradient Descent (Parameter Update)

Step 5: Prediction on a New Email

4 Logistic Regression coding

Logistic Regression Hyperparameter Tuning Summary

2. How to Optimize These Hyperparameters?

5.Top Logistic Regression Interview Questions & Answers

Q1: What is Logistic Regression?

Q2: Why can't we use Linear Regression for classification?

Q3: What is the Sigmoid function and why is it used in Logistic Regression?

Q4: What is the Decision Boundary in Logistic Regression?

Q5: What is the Loss Function used in Logistic Regression?

Q6: How are the parameters of Logistic Regression optimized?

Q7: What is Regularization in Logistic Regression? Why is it needed?

Q8: How do you evaluate a Logistic Regression model?

Q9: How do you handle imbalanced datasets in Logistic Regression?

Q10: What are some alternatives to Logistic Regression?

Q11: When should you NOT use Logistic Regression?

Q12: What is Multinomial Logistic Regression?

Q13: Can Logistic Regression be used for Time Series Data?

Entropyobserver

Discussion Channel

Join our community for discussion and sharing