Matrix Calculus for Machine Learning: From Gradients to Jacobians

type

status

date

slug

summary

1. Scalar Derivatives

Scalar derivatives refer to the classical derivatives of scalar functions. For example, if , the derivative is:

This is the foundation of all further extensions into vector and matrix calculus.

Key Properties of Scalar Derivatives

Linearity:

Product Rule:

Chain Rule:

Common Scalar Derivatives

Function	Derivative

Example: Compute the derivative of

Step 1: Break into parts

Let’s label each term for clarity:

Then,

🧠 Step 2: Differentiate each part

🔹 f1(x)=3x4f_1(x) = 3x^4f1(x)=3x4

Using the power rule:

f1′(x)=3⋅4x3=12x3f_1'(x) = 3 \cdot 4x^{3} = 12x^3

f1′(x)=3⋅4x3=12x3

🔹 f2(x)=2x⋅exf_2(x) = 2x \cdot e^xf2(x)=2x⋅ex

Using the product rule:

f2′(x)=2⋅ex+2x⋅ex=2ex+2xexf_2'(x) = 2 \cdot e^x + 2x \cdot e^x = 2e^x + 2x e^x

f2′(x)=2⋅ex+2x⋅ex=2ex+2xex

🔹 f3(x)=sin⁡(x2)f_3(x) = \sin(x^2)f3(x)=sin(x2)

Using the chain rule:

Outer function: sin⁡(u)\sin(u)sin(u), inner function: u=x2u = x^2u=x2

f3′(x)=cos⁡(x2)⋅2x=2xcos⁡(x2)f_3'(x) = \cos(x^2) \cdot 2x = 2x \cos(x^2)

f3′(x)=cos(x2)⋅2x=2xcos(x2)

✅ Final Answer:

f′(x)=12x3+2ex+2xex+2xcos⁡(x2)f'(x) = 12x^3 + 2e^x + 2x e^x + 2x \cos(x^2)

f′(x)=12x3+2ex+2xex+2xcos(x2)

📌 Summary of Concepts Used:

Power Rule for x4x^4x4

Product Rule for x⋅exx \cdot e^xx⋅ex

Chain Rule for sin⁡(x2)\sin(x^2)sin(x2)

2. Subderivatives and Non-differentiable Functions

In machine learning, especially in deep learning, we often use functions like ReLU that are not differentiable at certain points. In such cases, we use subderivatives.

Definition of Subderivative

For a convex function $f: \mathbb{R}^n \to \mathbb{R}$, a vector $g \in \mathbb{R}^n$ is a subgradient of $f$ at point $x$ if:

The set of all subgradients at $x$ is called the subdifferential, denoted $\partial f(x)$.

Example: ReLU Function

The ReLU function $f(x) = \max(0, x)$ has the following subderivative:

$$\partial f(x) = \begin{cases} 1 & \text{if } x > 0 \ [0,1] & \text{if } x = 0 \ 0 & \text{if } x < 0 \end{cases}$$

Example: Absolute Value Function

For $f(x) = |x|$:

$$\partial f(x) = \begin{cases} 1 & \text{if } x > 0 \ [-1,1] & \text{if } x = 0 \ -1 & \text{if } x < 0 \end{cases}$$

Subgradients allow optimization even when the function has sharp corners (non-smooth points).

3. Multivariate Derivatives

This section explores derivatives when the inputs or outputs are vectors or matrices. The structure of the derivative depends on the nature of the function.

Derivative Classification by Input/Output Types

Input Type	Output Type	Result Type	Name
Scalar	Scalar	Scalar	Ordinary derivative
Vector	Scalar	Vector	Gradient
Scalar	Vector	Vector	Vector derivative
Vector	Vector	Matrix	Jacobian
Matrix	Matrix	Tensor	Matrix derivative

3.1 Gradient: Scalar w.r.t. Vector

Let $f(\mathbf{x})$ be a scalar-valued function with a vector input $\mathbf{x} \in \mathbb{R}^n$. The gradient is:

$$\nabla f(\mathbf{x}) = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right]^T$$

Example: For $f(x_1, x_2) = x_1^2 + 2x_2^2$:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2x_1 \ 4x_2 \end{bmatrix}$$

Geometric Interpretation: The gradient points in the direction of the steepest increase of the function. For the elliptical function $f(x_1, x_2) = x_1^2 + 2x_2^2$, the gradient vectors point outward from the origin, perpendicular to the elliptical level curves.

Important Gradient Properties

Direction: Points toward steepest increase

Magnitude: Rate of change in that direction

Orthogonality: Perpendicular to level curves/surfaces

3.2 Scalar w.r.t. Vector: Linear Functions

For linear functions, the gradient has a simple form. If:

$$f(\mathbf{x}) = \mathbf{a}^T\mathbf{x}$$

Then the gradient is:

$$\frac{df}{d\mathbf{x}} = \mathbf{a}$$

This gives a column vector that represents the constant rate of change in each direction.

Quadratic Forms

For quadratic functions:

$$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$$

The gradient is:

$$\nabla f(\mathbf{x}) = (A + A^T)\mathbf{x}$$

If $A$ is symmetric, this simplifies to:

$$\nabla f(\mathbf{x}) = 2A\mathbf{x}$$

3.3 Vector w.r.t. Scalar

If $\mathbf{y}(x) \in \mathbb{R}^n$ and $x \in \mathbb{R}$, then:

$$\frac{d\mathbf{y}}{dx} = \begin{bmatrix} \frac{dy_1}{dx} \ \frac{dy_2}{dx} \ \vdots \ \frac{dy_n}{dx} \end{bmatrix}$$

Example: For $\mathbf{y}(x) = \begin{bmatrix} x^2 \ e^x \ \sin(x) \end{bmatrix}$:

$$\frac{d\mathbf{y}}{dx} = \begin{bmatrix} 2x \ e^x \ \cos(x) \end{bmatrix}$$

This type of derivative is commonly used in:

Dynamic systems modeling

Time-dependent neural networks

Ordinary differential equations

3.4 Vector w.r.t. Vector: Jacobian Matrix

Let $\mathbf{y}(\mathbf{x}) \in \mathbb{R}^m$ and $\mathbf{x} \in \mathbb{R}^n$. The Jacobian is the matrix of all partial derivatives:

$$J_{ij} = \frac{\partial y_i}{\partial x_j}$$

The complete Jacobian matrix is:

$$J = \frac{d\mathbf{y}}{d\mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}$$

Example: For $\mathbf{y} = \begin{bmatrix} x_1^2 + x_2 \ x_1x_2 \ x_1 + 2x_2 \end{bmatrix}$:

$$J = \begin{bmatrix} 2x_1 & 1 \ x_2 & x_1 \ 1 & 2 \end{bmatrix}$$

Applications of Jacobian Matrices

Neural Networks: Forward and backward propagation

Optimization: Newton's method and quasi-Newton methods

Numerical Analysis: Solving systems of nonlinear equations

Control Theory: Linearization of nonlinear systems

3.5 Matrix w.r.t. Matrix

Matrix derivatives involve functions where both inputs and outputs are matrices. These are essential for advanced machine learning techniques.

Common Matrix Derivative Formulas

Function	Derivative
$\text{tr}(AX)$	$A^T$
$\text{tr}(X^TA)$	$A$
$\text{tr}(AXB)$	$A^TB^T$
$\text{tr}(X^TAX)$	$AX + A^TX$
$\log \det(X)$	$(X^{-1})^T$
$\det(X)$	$\det(X)(X^{-1})^T$

Detailed Examples

Example 1: Trace of Linear Function For $f(X) = \text{tr}(AX)$:

Example 2: Quadratic Form For $f(X) = \text{tr}(X^TAX)$:

If $A$ is symmetric:

Example 3: Log-Determinant For $f(X) = \log \det(X)$:

4. Chain Rule in Matrix Calculus

The chain rule generalizes to matrix calculus and is fundamental to backpropagation in neural networks.

Scalar Chain Rule

For scalar functions:

Vector Chain Rule

For vector functions:

$$\frac{d\mathbf{z}}{d\mathbf{x}} = \frac{d\mathbf{z}}{d\mathbf{y}} \cdot \frac{d\mathbf{y}}{d\mathbf{x}}$$

Where the multiplication is matrix multiplication of Jacobians.

Multi-layer Chain Rule

For a composition of functions $\mathbf{z} = f_3(f_2(f_1(\mathbf{x})))$:

$$\frac{d\mathbf{z}}{d\mathbf{x}} = \frac{d\mathbf{z}}{d\mathbf{y_2}} \cdot \frac{d\mathbf{y_2}}{d\mathbf{y_1}} \cdot \frac{d\mathbf{y_1}}{d\mathbf{x}}$$

This is the mathematical foundation of backpropagation in deep neural networks.

5. Common Derivative Formulas (Cheat Sheet)

Vector Functions

Function	Derivative
$f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$	$\nabla f = \mathbf{a}$
$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$	$\nabla f = (A + A^T)\mathbf{x}$
$f(\mathbf{x}) = \|\mathbf{x}\|^2$	$\nabla f = 2\mathbf{x}$
$f(\mathbf{x}) = \|\mathbf{x} - \mathbf{a}\|^2$	$\nabla f = 2(\mathbf{x} - \mathbf{a})$
$f(\mathbf{x}) = \mathbf{a}^T\mathbf{x} + \mathbf{x}^TB\mathbf{x}$	$\nabla f = \mathbf{a} + (B + B^T)\mathbf{x}$

Matrix Functions

Function	Derivative
$\text{tr}(AX)$	$A^T$
$\text{tr}(X^TA)$	$A$
$\text{tr}(AXB)$	$A^TB^T$
$\text{tr}(X^TAX)$	$AX + A^TX$
$\log \det(X)$	$(X^{-1})^T$
$\text{tr}(X^{-1})$	$-(X^{-1})^TX^{-1}$

Activation Functions

Function	Derivative
$\text{ReLU}(x) = \max(0,x)$	$\begin{cases} 1 & x > 0 \ 0 & x \leq 0 \end{cases}$
$\text{Sigmoid}(x) = \frac{1}{1+e^{-x}}$	$\sigma(x)(1-\sigma(x))$
$\text{Tanh}(x)$	$1 - \tanh^2(x)$
$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$	$\text{softmax}(x_i)(\delta_{ij} - \text{softmax}(x_j))$

6. Applications in Machine Learning

6.1 Gradient Descent

The gradient descent algorithm uses the gradient to update parameters:

$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta)$$

Where:

$\alpha$ is the learning rate

$L(\theta)$ is the loss function

$\nabla_\theta L(\theta)$ is the gradient of the loss with respect to parameters

6.2 Backpropagation

Backpropagation applies the chain rule to compute gradients in neural networks:

For a network with layers $f_1, f_2, \ldots, f_L$:

$$\frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial W^{(l)}}$$

Where $\mathbf{z}^{(l)} = W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$

6.3 Logistic Regression

For logistic regression with loss function:

$$L(\theta) = \sum_{i=1}^n \log(1 + e^{-y_i \theta^T \mathbf{x}_i})$$

The gradient is:

$$\nabla_\theta L = -\sum_{i=1}^n y_i \mathbf{x}_i \sigma(-y_i \theta^T \mathbf{x}_i)$$

Where $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function.

6.4 Regularization

Ridge Regression (L2 Regularization): $$L(\mathbf{w}) = |\mathbf{y} - X\mathbf{w}|^2 + \lambda|\mathbf{w}|^2$$

Gradient: $$\nabla_\mathbf{w} L = -2X^T(\mathbf{y} - X\mathbf{w}) + 2\lambda\mathbf{w}$$

Lasso Regression (L1 Regularization): $$L(\mathbf{w}) = |\mathbf{y} - X\mathbf{w}|^2 + \lambda|\mathbf{w}|_1$$

Subgradient: $$\partial_\mathbf{w} L = -2X^T(\mathbf{y} - X\mathbf{w}) + \lambda \text{sign}(\mathbf{w})$$

6.5 Support Vector Machines

For SVM with hinge loss:

$$L(\mathbf{w}) = \sum_{i=1}^n \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) + \frac{\lambda}{2}|\mathbf{w}|^2$$

The subgradient is:

$$\partial_\mathbf{w} L = \lambda\mathbf{w} - \sum_{i: y_i \mathbf{w}^T \mathbf{x}_i < 1} y_i \mathbf{x}_i$$

7. Advanced Topics

7.1 Hessian Matrix

The Hessian matrix contains second-order derivatives:

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$:

$$H = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \ \vdots & \vdots & \ddots \end{bmatrix}$$

7.2 Newton's Method

Newton's method uses both gradient and Hessian:

7.3 Automatic Differentiation

Modern deep learning frameworks use automatic differentiation:

Forward Mode: Computes derivatives alongside function values

Reverse Mode: Backpropagates gradients (used in backpropagation)

8. Practical Implementation Tips

8.1 Numerical Stability

Use log-sum-exp trick for softmax derivatives

Implement gradient clipping to prevent exploding gradients

Use appropriate initialization to avoid vanishing gradients

8.2 Vectorization

Always vectorize operations when possible:

8.3 Gradient Checking

Verify analytical gradients with numerical gradients:

9. Summary

Matrix calculus provides the mathematical foundation for understanding and implementing optimization algorithms in machine learning. Key takeaways:

Scalar derivatives are the foundation for all matrix calculus

Subgradients extend derivatives to non-differentiable functions

Gradients point in the direction of steepest increase

Jacobian matrices describe local linear approximations

Chain rule enables backpropagation in neural networks

Matrix derivatives are essential for advanced ML techniques

Mastery of these concepts enables you to:

Design and analyze optimization algorithms

Implement neural networks from scratch

Understand the mathematical foundations of ML libraries

Debug gradient computations effectively

10. Further Reading

Books

"The Matrix Cookbook" by Petersen & Pedersen - Comprehensive reference for matrix derivatives

"Deep Learning" by Goodfellow, Bengio, and Courville - Applications in deep learning

"Pattern Recognition and Machine Learning" by Bishop - Statistical perspective

Online Resources

CS231n: Convolutional Neural Networks (Stanford) - Practical applications

CS229: Machine Learning (Stanford) - Mathematical foundations

Matrix Calculus for Deep Learning (explained.ai) - Interactive explanations

Reference Materials

The Matrix Cookbook

CS231n: Backprop Notes

Explained.ai - Matrix Calculus

This comprehensive guide covers the essential matrix calculus concepts needed for machine learning. Practice with concrete examples and implement the derivatives yourself to deepen your understanding.