Lazy loaded image
Logistic Regression
Words 2671Read Time 7 min
Mar 1, 2020
Apr 2, 2025
type
status
date
slug
summary
tags
category
icon
password

1. Introduction

Logistic Regression is a widely used supervised learning algorithm for binary classification problems. It is commonly applied in scenarios such as credit scoring, medical diagnosis, and advertisement click-through rate prediction. Despite having "regression" in its name, Logistic Regression is fundamentally a classification algorithm.
Logistic Regression is popular in the industry due to its simplicity, parallelizability, and strong interpretability. The essence of Logistic Regression is assuming that data follows a certain distribution and then using Maximum Likelihood Estimation to estimate parameters.
Maximum Likelihood Estimation (MLE) is a statistical method that finds the parameter values that make the observed data most likely. Imagine you're a detective investigating whether a die is "loaded" (biased). You roll it 10 times and get the following results:
6, 2, 3, 6, 4, 6, 6, 5, 6, 1
You notice that 6 appears 5 times out of 10, which seems suspicious.
MLE asks: What is the probability of rolling a 6 (p) that best explains this data? If the die were fair, each face would have a probability of . But if it’s biased, we need to estimate p.
In logistic regression, we assume that the probability of a class y follows a Bernoulli distribution:
where:
MLE helps find the best parameters w that maximize the likelihood of the observed labels. This leads to the log-likelihood function, which is minimized using optimization techniques like gradient descent.

2. Mathematical Principles

2.1 Basic Concept of Logistic Regression

Logistic Regression is a supervised learning algorithm used for binary classification. The key steps include:
  1. Computing a weighted sum of input features using linear regression:
    1. where:
      • are input features,
      • are corresponding weights,
      • b is the bias term.
  1. Applying the Sigmoid function to map the sum into the (0,1) range as a probability:
  1. Setting a threshold (e.g., 0.5) for classification:
      • If ≥0.5, predict class 1.
      • If <0.5, predict class 0.
notion image

2.2 Sigmoid (S-Shaped) Function

The key to Logistic Regression is the Sigmoid (or Logistic) function, expressed as:
where, representing the linear combination of input features.
The Sigmoid function outputs values in the (0,1) range, representing the probability of class 1:
notion image
  • Monotonic increasing function: Larger input values yield outputs closer to 1, while smaller input values yield outputs closer to 0.
  • S-shaped curve: The function is symmetric and smooth, ensuring a gradual transition between 0 and 1.
  • Probability interpretation: Outputs lie in the (0,1) range, making them suitable for probability representation.

Derivative of the Sigmoid Function

This derivative is useful for gradient-based optimization.

2.3 Decision Boundary

The decision boundary in logistic regression is linear. Given the model:
At , we obtain:
This represents a straight-line equation in a two-dimensional space:
For higher-dimensional datasets, the decision boundary is a hyperplane that separates data into two classes.

2.4 Loss Function

Logistic regression uses the logarithmic loss function (Log Loss), also known as cross-entropy loss:
where:
  • m is the number of training samples,
  • is the actual label (0 or 1) of sample i,
  • is the predicted probability.
Understanding Cross-Entropy Loss
  • If y=1, the loss function reduces to:
    • If , the loss is close to 0 (correct prediction).
    • If , the loss becomes very large (incorrect prediction).
  • If y=0, the loss function reduces to:
    • If , the loss is close to 0 (correct prediction).
    • If , the loss becomes very large (incorrect prediction).
Why Use Cross-Entropy Instead of MSE?
  • Cross-entropy loss is better suited for classification, as it directly optimizes probability estimates.
  • Mean Squared Error (MSE) can cause vanishing gradients, slowing down training.

2.5 Parameter Optimization: Gradient Descent

The goal is to find the optimal parameters w and b that minimize the loss function. The most common optimization method is gradient descent.

Computing the Gradients

The gradients of the loss function with respect to parameters are:

Gradient Descent Update Rule

The parameters are updated using the gradient descent rule:
where:
  • is the learning rate, controlling the step size of updates.
  • The process repeats iteratively until convergence.
Types of Gradient Descent
  1. Batch Gradient Descent (BGD):
      • Uses all training data in each update.
      • More stable but computationally expensive for large datasets.
  1. Stochastic Gradient Descent (SGD):
      • Updates parameters using a single random sample per iteration.
      • Faster but introduces more noise in updates.
  1. Mini-Batch Gradient Descent (MBGD):
      • Uses a small batch of samples per update.
      • Balances speed and stability.

3.Example: Email Spam Classification

We want to classify emails as spam (1) or not spam (0) based on two features:
  • Number of times "Free" appears (Feature X1X_1X1)
  • Number of times "Win" appears (Feature X2X_2X2)
Training Data
Email
"Free" Count ()
"Win" Count (X2X_2X2)
Spam? (YYY)
1
3
2
1 (Spam)
2
0
0
0 (Not Spam)
3
2
1
1 (Spam)
4
1
0
0 (Not Spam)
We initialize model parameters:
  • Weights:
  • Bias:
  • Learning Rate:

Step 1: Compute Linear Combination z

Email
1
2
3
4

Step 2: Apply the Sigmoid Function

Email
z
1
2.5
2
0
3
1.5
4
0.5
These values represent the probability of an email being spam.

Step 3: Compute Loss (Cross-Entropy)

For each email:

Step 4: Gradient Descent (Parameter Update)

Compute gradients:

Compute Gradients

Update Parameters


Step 5: Prediction on a New Email

New email:
  • "Free" appears 2 times
  • "Win" appears 1 time
Compute:
Since , we classify it as spam.

4 Logistic Regression coding

Logistic Regression Hyperparameter Tuning Summary

1. Key Hyperparameters to Tune

  • C (Regularization Strength)
    • Controls the regularization strength (L2 regularization by default).
    • Lower values (C < 1) increase regularization (reduce overfitting), while higher values (C > 1) decrease regularization.
    • Typical range: [0.01, 0.1, 1, 10]
    • Example:
    • solver (Optimization Algorithm)
      • liblinear: Best for small datasets, supports L1 and L2.
      • lbfgs: Default, good for medium-sized datasets, supports L2.
      • saga: Efficient for large, sparse datasets, supports L1, L2, and elasticnet.
      • Example:
      • max_iter (Maximum Iterations)
        • Increases iterations to ensure convergence for large datasets.
        • Typical range: [500, 1000, 2000]
        • Example:
        • penalty (Regularization Type)
          • l2: Default, works for most cases.
          • l1: Used for feature selection (only works with liblinear and saga).
          • elasticnet: Combines l1 and l2, supported only with saga.
          • Example:

          2. How to Optimize These Hyperparameters?

          (1) Grid Search (GridSearchCV) - Exhaustive search for the best combination:

          (2) Random Search (RandomizedSearchCV) - Efficient for large parameter spaces:


          3. Summary Table

          Parameter
          Effect
          Typical Range
          Best Use Case
          C
          Regularization strength
          0.01 ~ 10
          Lower for preventing overfitting, higher for better fit
          solver
          Optimization algorithm
          lbfgs, liblinear, saga
          lbfgs for default, saga for large datasets
          max_iter
          Number of iterations
          500 ~ 2000
          Increase if convergence issues arise
          penalty
          Regularization type
          l1, l2, elasticnet
          l1 for sparse features
          For best results, use GridSearchCV or RandomizedSearchCV to automatically find the optimal

          5.Top Logistic Regression Interview Questions & Answers

          Below is a comprehensive list of interview questions related to Logistic Regression, along with detailed answers.

          Q1: What is Logistic Regression?

          Answer:
          Logistic Regression is a supervised learning algorithm used for binary classification problems. Instead of predicting continuous values like Linear Regression, it predicts the probability of a sample belonging to a particular class using the sigmoid function.
          Mathematically, the model is:
          where is the probability that the output belongs to class 1.

          Q2: Why can't we use Linear Regression for classification?

          Answer:
          Linear Regression provides continuous values, which are not suitable for classification. If we try to use it for classification:
          1. Unbounded Output: Linear Regression can output values beyond [0,1], making it unsuitable for probabilities.
          1. Poor Decision Boundaries: Linear Regression does not naturally map to distinct classes.
          1. Lack of Probability Interpretation: Logistic Regression outputs probabilities, making threshold-based classification more meaningful.

          Q3: What is the Sigmoid function and why is it used in Logistic Regression?

          Answer:
          The Sigmoid function is defined as:
          where z is the linear combination of input features.
          • Why is it used?
            • It maps any real number to a value between 0 and 1, making it suitable for probability estimation.
            • It introduces non-linearity, enabling classification.
            • It is differentiable, allowing optimization via Gradient Descent.

          Q4: What is the Decision Boundary in Logistic Regression?

          Answer:
          The decision boundary is the line (or surface in higher dimensions) that separates different classes.
          For a binary classification problem, it is defined by the equation:
          • If the result is ≥ 0.5, classify as class 1.
          • If the result is < 0.5, classify as class 0.
          The decision boundary is linear unless feature transformations (like polynomial terms) are introduced.

          Q5: What is the Loss Function used in Logistic Regression?

          Answer:
          Logistic Regression uses Log Loss (Cross-Entropy Loss):
          where:
          • is the predicted probability,
          • is the actual label.
          This function penalizes incorrect predictions and is convex, allowing optimization via Gradient Descent.

          Q6: How are the parameters of Logistic Regression optimized?

          Answer:
          Logistic Regression parameters are optimized using Gradient Descent:
          1. Compute gradient of the cost function with respect to each parameter
          1. Update parameters iteratively using:
            1. where is the learning rate.
          1. Repeat until convergence.
          Other optimization methods include:
          • Newton’s Method (Newton-Raphson)
          • Stochastic Gradient Descent (SGD)
          • Batch & Mini-Batch Gradient Descent

          Q7: What is Regularization in Logistic Regression? Why is it needed?

          Answer:
          Regularization prevents overfitting by adding a penalty term to the loss function:
          1. L1 Regularization (Lasso):
              • Helps in feature selection (some coefficients shrink to zero).
          1. L2 Regularization (Ridge):
              • Helps in reducing large coefficients but does not eliminate features.

          Q8: How do you evaluate a Logistic Regression model?

          Answer:
          Key evaluation metrics include:
          • Accuracy:
          • Precision:
          • Recall (Sensitivity):
          • F1-Score: Harmonic mean of Precision & Recall
          • ROC-AUC: Measures probability ranking of positive classes.

          Q9: How do you handle imbalanced datasets in Logistic Regression?

          Answer:
          For imbalanced data:
          1. Use class weights:
            1. Oversampling (SMOTE) or Undersampling
              1. Threshold Adjustment based on ROC Curve.

              Q10: What are some alternatives to Logistic Regression?

              Answer:
              • Naïve Bayes: Works well for text classification.
              • Decision Trees: Handles nonlinear data.
              • Support Vector Machines (SVMs): Works well with high-dimensional data.
              • Neural Networks: Useful for complex feature interactions.

              Q11: When should you NOT use Logistic Regression?

              Answer:
              • When the relationship is highly non-linear.
              • When features are heavily correlated (Multicollinearity).
              • When there are many categorical features with high cardinality.
              • When dealing with imbalanced data without proper handling.

              Q12: What is Multinomial Logistic Regression?

              Answer:
              Multinomial Logistic Regression is used for multi-class classification. Instead of using a single Sigmoid function, it applies the Softmax function:
              where K is the number of classes.

              Q13: Can Logistic Regression be used for Time Series Data?

              Answer:
              No, Logistic Regression assumes independent observations, whereas time series data exhibits dependencies over time. Instead, use Recurrent Neural Networks (RNNs) or Hidden Markov Models (HMMs).
               
              上一篇
              Tech Stack for E-commerce Data Analysts
              下一篇
              模板说明
              Catalog