Lazy loaded image
Lazy loaded imageMatrix Calculus for Machine Learning: From Gradients to Jacobians
Words 1505Read Time 4 min
Apr 26, 2025
Jul 16, 2025
type
status
date
slug
summary
tags
category
icon
password
Matrix calculus is the backbone of many machine learning algorithms, from gradient descent optimization to backpropagation in neural networks. This tutorial introduces scalar, vector, and matrix derivatives in a structured, intuitive way with practical examples and formulas.

1. Scalar Derivatives

Scalar derivatives refer to the classical derivatives of scalar functions. For example, if , the derivative is:
This is the foundation of all further extensions into vector and matrix calculus.
Key Properties of Scalar Derivatives
  • Linearity:
  • Product Rule:
  • Chain Rule:
Common Scalar Derivatives
Function
Derivative

Example: Compute the derivative of


Step 1: Break into parts
Let’s label each term for clarity:
Then,

🧠 Step 2: Differentiate each part

🔹 f1(x)=3x4f_1(x) = 3x^4f1(x)=3x4

Using the power rule:
f1′(x)=3⋅4x3=12x3f_1'(x) = 3 \cdot 4x^{3} = 12x^3
f1′(x)=3⋅4x3=12x3

🔹 f2(x)=2x⋅exf_2(x) = 2x \cdot e^xf2(x)=2x⋅ex

Using the product rule:
f2′(x)=2⋅ex+2x⋅ex=2ex+2xexf_2'(x) = 2 \cdot e^x + 2x \cdot e^x = 2e^x + 2x e^x
f2′(x)=2⋅ex+2x⋅ex=2ex+2xex

🔹 f3(x)=sin⁡(x2)f_3(x) = \sin(x^2)f3(x)=sin(x2)

Using the chain rule:
  • Outer function: sin⁡(u)\sin(u)sin(u), inner function: u=x2u = x^2u=x2
f3′(x)=cos⁡(x2)⋅2x=2xcos⁡(x2)f_3'(x) = \cos(x^2) \cdot 2x = 2x \cos(x^2)
f3′(x)=cos(x2)⋅2x=2xcos(x2)

✅ Final Answer:

f′(x)=12x3+2ex+2xex+2xcos⁡(x2)f'(x) = 12x^3 + 2e^x + 2x e^x + 2x \cos(x^2)
f′(x)=12x3+2ex+2xex+2xcos(x2)

📌 Summary of Concepts Used:

  • Power Rule for x4x^4x4
  • Product Rule for x⋅exx \cdot e^xx⋅ex
  • Chain Rule for sin⁡(x2)\sin(x^2)sin(x2)
 

2. Subderivatives and Non-differentiable Functions

In machine learning, especially in deep learning, we often use functions like ReLU that are not differentiable at certain points. In such cases, we use subderivatives.

Definition of Subderivative

For a convex function $f: \mathbb{R}^n \to \mathbb{R}$, a vector $g \in \mathbb{R}^n$ is a subgradient of $f$ at point $x$ if:
The set of all subgradients at $x$ is called the subdifferential, denoted $\partial f(x)$.

Example: ReLU Function

The ReLU function $f(x) = \max(0, x)$ has the following subderivative:
$$\partial f(x) = \begin{cases} 1 & \text{if } x > 0 \ [0,1] & \text{if } x = 0 \ 0 & \text{if } x < 0 \end{cases}$$

Example: Absolute Value Function

For $f(x) = |x|$:
$$\partial f(x) = \begin{cases} 1 & \text{if } x > 0 \ [-1,1] & \text{if } x = 0 \ -1 & \text{if } x < 0 \end{cases}$$
Subgradients allow optimization even when the function has sharp corners (non-smooth points).

3. Multivariate Derivatives

This section explores derivatives when the inputs or outputs are vectors or matrices. The structure of the derivative depends on the nature of the function.

Derivative Classification by Input/Output Types

Input Type
Output Type
Result Type
Name
Scalar
Scalar
Scalar
Ordinary derivative
Vector
Scalar
Vector
Gradient
Scalar
Vector
Vector
Vector derivative
Vector
Vector
Matrix
Jacobian
Matrix
Matrix
Tensor
Matrix derivative

3.1 Gradient: Scalar w.r.t. Vector

Let $f(\mathbf{x})$ be a scalar-valued function with a vector input $\mathbf{x} \in \mathbb{R}^n$. The gradient is:
$$\nabla f(\mathbf{x}) = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right]^T$$
Example: For $f(x_1, x_2) = x_1^2 + 2x_2^2$:
$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2x_1 \ 4x_2 \end{bmatrix}$$
Geometric Interpretation: The gradient points in the direction of the steepest increase of the function. For the elliptical function $f(x_1, x_2) = x_1^2 + 2x_2^2$, the gradient vectors point outward from the origin, perpendicular to the elliptical level curves.

Important Gradient Properties

  • Direction: Points toward steepest increase
  • Magnitude: Rate of change in that direction
  • Orthogonality: Perpendicular to level curves/surfaces

3.2 Scalar w.r.t. Vector: Linear Functions

For linear functions, the gradient has a simple form. If:
$$f(\mathbf{x}) = \mathbf{a}^T\mathbf{x}$$
Then the gradient is:
$$\frac{df}{d\mathbf{x}} = \mathbf{a}$$
This gives a column vector that represents the constant rate of change in each direction.

Quadratic Forms

For quadratic functions:
$$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$$
The gradient is:
$$\nabla f(\mathbf{x}) = (A + A^T)\mathbf{x}$$
If $A$ is symmetric, this simplifies to:
$$\nabla f(\mathbf{x}) = 2A\mathbf{x}$$

3.3 Vector w.r.t. Scalar

If $\mathbf{y}(x) \in \mathbb{R}^n$ and $x \in \mathbb{R}$, then:
$$\frac{d\mathbf{y}}{dx} = \begin{bmatrix} \frac{dy_1}{dx} \ \frac{dy_2}{dx} \ \vdots \ \frac{dy_n}{dx} \end{bmatrix}$$
Example: For $\mathbf{y}(x) = \begin{bmatrix} x^2 \ e^x \ \sin(x) \end{bmatrix}$:
$$\frac{d\mathbf{y}}{dx} = \begin{bmatrix} 2x \ e^x \ \cos(x) \end{bmatrix}$$
This type of derivative is commonly used in:
  • Dynamic systems modeling
  • Time-dependent neural networks
  • Ordinary differential equations

3.4 Vector w.r.t. Vector: Jacobian Matrix

Let $\mathbf{y}(\mathbf{x}) \in \mathbb{R}^m$ and $\mathbf{x} \in \mathbb{R}^n$. The Jacobian is the matrix of all partial derivatives:
$$J_{ij} = \frac{\partial y_i}{\partial x_j}$$
The complete Jacobian matrix is:
$$J = \frac{d\mathbf{y}}{d\mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}$$
Example: For $\mathbf{y} = \begin{bmatrix} x_1^2 + x_2 \ x_1x_2 \ x_1 + 2x_2 \end{bmatrix}$:
$$J = \begin{bmatrix} 2x_1 & 1 \ x_2 & x_1 \ 1 & 2 \end{bmatrix}$$

Applications of Jacobian Matrices

  • Neural Networks: Forward and backward propagation
  • Optimization: Newton's method and quasi-Newton methods
  • Numerical Analysis: Solving systems of nonlinear equations
  • Control Theory: Linearization of nonlinear systems

3.5 Matrix w.r.t. Matrix

Matrix derivatives involve functions where both inputs and outputs are matrices. These are essential for advanced machine learning techniques.

Common Matrix Derivative Formulas

Function
Derivative
$\text{tr}(AX)$
$A^T$
$\text{tr}(X^TA)$
$A$
$\text{tr}(AXB)$
$A^TB^T$
$\text{tr}(X^TAX)$
$AX + A^TX$
$\log \det(X)$
$(X^{-1})^T$
$\det(X)$
$\det(X)(X^{-1})^T$

Detailed Examples

Example 1: Trace of Linear Function For $f(X) = \text{tr}(AX)$:
Example 2: Quadratic Form For $f(X) = \text{tr}(X^TAX)$:
If $A$ is symmetric:
Example 3: Log-Determinant For $f(X) = \log \det(X)$:

4. Chain Rule in Matrix Calculus

The chain rule generalizes to matrix calculus and is fundamental to backpropagation in neural networks.

Scalar Chain Rule

For scalar functions:

Vector Chain Rule

For vector functions:
$$\frac{d\mathbf{z}}{d\mathbf{x}} = \frac{d\mathbf{z}}{d\mathbf{y}} \cdot \frac{d\mathbf{y}}{d\mathbf{x}}$$
Where the multiplication is matrix multiplication of Jacobians.

Multi-layer Chain Rule

For a composition of functions $\mathbf{z} = f_3(f_2(f_1(\mathbf{x})))$:
$$\frac{d\mathbf{z}}{d\mathbf{x}} = \frac{d\mathbf{z}}{d\mathbf{y_2}} \cdot \frac{d\mathbf{y_2}}{d\mathbf{y_1}} \cdot \frac{d\mathbf{y_1}}{d\mathbf{x}}$$
This is the mathematical foundation of backpropagation in deep neural networks.

5. Common Derivative Formulas (Cheat Sheet)

Vector Functions

Function
Derivative
$f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$
$\nabla f = \mathbf{a}$
$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$
$\nabla f = (A + A^T)\mathbf{x}$
$f(\mathbf{x}) = |\mathbf{x}|^2$
$\nabla f = 2\mathbf{x}$
$f(\mathbf{x}) = |\mathbf{x} - \mathbf{a}|^2$
$\nabla f = 2(\mathbf{x} - \mathbf{a})$
$f(\mathbf{x}) = \mathbf{a}^T\mathbf{x} + \mathbf{x}^TB\mathbf{x}$
$\nabla f = \mathbf{a} + (B + B^T)\mathbf{x}$

Matrix Functions

Function
Derivative
$\text{tr}(AX)$
$A^T$
$\text{tr}(X^TA)$
$A$
$\text{tr}(AXB)$
$A^TB^T$
$\text{tr}(X^TAX)$
$AX + A^TX$
$\log \det(X)$
$(X^{-1})^T$
$\text{tr}(X^{-1})$
$-(X^{-1})^TX^{-1}$

Activation Functions

Function
Derivative
$\text{ReLU}(x) = \max(0,x)$
$\begin{cases} 1 & x > 0 \ 0 & x \leq 0 \end{cases}$
$\text{Sigmoid}(x) = \frac{1}{1+e^{-x}}$
$\sigma(x)(1-\sigma(x))$
$\text{Tanh}(x)$
$1 - \tanh^2(x)$
$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$
$\text{softmax}(x_i)(\delta_{ij} - \text{softmax}(x_j))$

6. Applications in Machine Learning

6.1 Gradient Descent

The gradient descent algorithm uses the gradient to update parameters:
$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta)$$
Where:
  • $\alpha$ is the learning rate
  • $L(\theta)$ is the loss function
  • $\nabla_\theta L(\theta)$ is the gradient of the loss with respect to parameters

6.2 Backpropagation

Backpropagation applies the chain rule to compute gradients in neural networks:
For a network with layers $f_1, f_2, \ldots, f_L$:
$$\frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial W^{(l)}}$$
Where $\mathbf{z}^{(l)} = W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$

6.3 Logistic Regression

For logistic regression with loss function:
$$L(\theta) = \sum_{i=1}^n \log(1 + e^{-y_i \theta^T \mathbf{x}_i})$$
The gradient is:
$$\nabla_\theta L = -\sum_{i=1}^n y_i \mathbf{x}_i \sigma(-y_i \theta^T \mathbf{x}_i)$$
Where $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function.

6.4 Regularization

Ridge Regression (L2 Regularization): $$L(\mathbf{w}) = |\mathbf{y} - X\mathbf{w}|^2 + \lambda|\mathbf{w}|^2$$
Gradient: $$\nabla_\mathbf{w} L = -2X^T(\mathbf{y} - X\mathbf{w}) + 2\lambda\mathbf{w}$$
Lasso Regression (L1 Regularization): $$L(\mathbf{w}) = |\mathbf{y} - X\mathbf{w}|^2 + \lambda|\mathbf{w}|_1$$
Subgradient: $$\partial_\mathbf{w} L = -2X^T(\mathbf{y} - X\mathbf{w}) + \lambda \text{sign}(\mathbf{w})$$

6.5 Support Vector Machines

For SVM with hinge loss:
$$L(\mathbf{w}) = \sum_{i=1}^n \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) + \frac{\lambda}{2}|\mathbf{w}|^2$$
The subgradient is:
$$\partial_\mathbf{w} L = \lambda\mathbf{w} - \sum_{i: y_i \mathbf{w}^T \mathbf{x}_i < 1} y_i \mathbf{x}_i$$

7. Advanced Topics

7.1 Hessian Matrix

The Hessian matrix contains second-order derivatives:
$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$
For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$:
$$H = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \ \vdots & \vdots & \ddots \end{bmatrix}$$

7.2 Newton's Method

Newton's method uses both gradient and Hessian:

7.3 Automatic Differentiation

Modern deep learning frameworks use automatic differentiation:
  • Forward Mode: Computes derivatives alongside function values
  • Reverse Mode: Backpropagates gradients (used in backpropagation)

8. Practical Implementation Tips

8.1 Numerical Stability

  • Use log-sum-exp trick for softmax derivatives
  • Implement gradient clipping to prevent exploding gradients
  • Use appropriate initialization to avoid vanishing gradients

8.2 Vectorization

Always vectorize operations when possible:

8.3 Gradient Checking

Verify analytical gradients with numerical gradients:

9. Summary

Matrix calculus provides the mathematical foundation for understanding and implementing optimization algorithms in machine learning. Key takeaways:
  1. Scalar derivatives are the foundation for all matrix calculus
  1. Subgradients extend derivatives to non-differentiable functions
  1. Gradients point in the direction of steepest increase
  1. Jacobian matrices describe local linear approximations
  1. Chain rule enables backpropagation in neural networks
  1. Matrix derivatives are essential for advanced ML techniques
Mastery of these concepts enables you to:
  • Design and analyze optimization algorithms
  • Implement neural networks from scratch
  • Understand the mathematical foundations of ML libraries
  • Debug gradient computations effectively

10. Further Reading

Books

  • "The Matrix Cookbook" by Petersen & Pedersen - Comprehensive reference for matrix derivatives
  • "Deep Learning" by Goodfellow, Bengio, and Courville - Applications in deep learning
  • "Pattern Recognition and Machine Learning" by Bishop - Statistical perspective

Online Resources

  • CS231n: Convolutional Neural Networks (Stanford) - Practical applications
  • CS229: Machine Learning (Stanford) - Mathematical foundations
  • Matrix Calculus for Deep Learning (explained.ai) - Interactive explanations

Reference Materials


This comprehensive guide covers the essential matrix calculus concepts needed for machine learning. Practice with concrete examples and implement the derivatives yourself to deepen your understanding.
 
上一篇
Linear Algebra Guide with PyTorch
下一篇
PyTorch Tensors and Data Preprocessing: A Complete Guide