type
status
date
slug
summary
tags
category
icon
password
Matrix calculus is the backbone of many machine learning algorithms, from gradient descent optimization to backpropagation in neural networks. This tutorial introduces scalar, vector, and matrix derivatives in a structured, intuitive way with practical examples and formulas.
1. Scalar Derivatives
Scalar derivatives refer to the classical derivatives of scalar functions. For example, if , the derivative is:
This is the foundation of all further extensions into vector and matrix calculus.
Key Properties of Scalar Derivatives
- Linearity:
- Product Rule:
- Chain Rule:
Common Scalar Derivatives
Function | Derivative |
Example: Compute the derivative of
Step 1: Break into parts
Let’s label each term for clarity:
Then,
🧠 Step 2: Differentiate each part
🔹 f1(x)=3x4f_1(x) = 3x^4f1(x)=3x4
Using the power rule:
f1′(x)=3⋅4x3=12x3f_1'(x) = 3 \cdot 4x^{3} = 12x^3
f1′(x)=3⋅4x3=12x3
🔹 f2(x)=2x⋅exf_2(x) = 2x \cdot e^xf2(x)=2x⋅ex
Using the product rule:
f2′(x)=2⋅ex+2x⋅ex=2ex+2xexf_2'(x) = 2 \cdot e^x + 2x \cdot e^x = 2e^x + 2x e^x
f2′(x)=2⋅ex+2x⋅ex=2ex+2xex
🔹 f3(x)=sin(x2)f_3(x) = \sin(x^2)f3(x)=sin(x2)
Using the chain rule:
- Outer function: sin(u)\sin(u)sin(u), inner function: u=x2u = x^2u=x2
f3′(x)=cos(x2)⋅2x=2xcos(x2)f_3'(x) = \cos(x^2) \cdot 2x = 2x \cos(x^2)
f3′(x)=cos(x2)⋅2x=2xcos(x2)
✅ Final Answer:
f′(x)=12x3+2ex+2xex+2xcos(x2)f'(x) = 12x^3 + 2e^x + 2x e^x + 2x \cos(x^2)
f′(x)=12x3+2ex+2xex+2xcos(x2)
📌 Summary of Concepts Used:
- Power Rule for x4x^4x4
- Product Rule for x⋅exx \cdot e^xx⋅ex
- Chain Rule for sin(x2)\sin(x^2)sin(x2)
2. Subderivatives and Non-differentiable Functions
In machine learning, especially in deep learning, we often use functions like ReLU that are not differentiable at certain points. In such cases, we use subderivatives.
Definition of Subderivative
For a convex function $f: \mathbb{R}^n \to \mathbb{R}$, a vector $g \in \mathbb{R}^n$ is a subgradient of $f$ at point $x$ if:
The set of all subgradients at $x$ is called the subdifferential, denoted $\partial f(x)$.
Example: ReLU Function
The ReLU function $f(x) = \max(0, x)$ has the following subderivative:
$$\partial f(x) = \begin{cases}
1 & \text{if } x > 0 \
[0,1] & \text{if } x = 0 \
0 & \text{if } x < 0
\end{cases}$$
Example: Absolute Value Function
For $f(x) = |x|$:
$$\partial f(x) = \begin{cases}
1 & \text{if } x > 0 \
[-1,1] & \text{if } x = 0 \
-1 & \text{if } x < 0
\end{cases}$$
Subgradients allow optimization even when the function has sharp corners (non-smooth points).
3. Multivariate Derivatives
This section explores derivatives when the inputs or outputs are vectors or matrices. The structure of the derivative depends on the nature of the function.
Derivative Classification by Input/Output Types
Input Type | Output Type | Result Type | Name |
Scalar | Scalar | Scalar | Ordinary derivative |
Vector | Scalar | Vector | Gradient |
Scalar | Vector | Vector | Vector derivative |
Vector | Vector | Matrix | Jacobian |
Matrix | Matrix | Tensor | Matrix derivative |
3.1 Gradient: Scalar w.r.t. Vector
Let $f(\mathbf{x})$ be a scalar-valued function with a vector input $\mathbf{x} \in \mathbb{R}^n$. The gradient is:
$$\nabla f(\mathbf{x}) = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right]^T$$
Example:
For $f(x_1, x_2) = x_1^2 + 2x_2^2$:
$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2x_1 \ 4x_2 \end{bmatrix}$$
Geometric Interpretation: The gradient points in the direction of the steepest increase of the function. For the elliptical function $f(x_1, x_2) = x_1^2 + 2x_2^2$, the gradient vectors point outward from the origin, perpendicular to the elliptical level curves.
Important Gradient Properties
- Direction: Points toward steepest increase
- Magnitude: Rate of change in that direction
- Orthogonality: Perpendicular to level curves/surfaces
3.2 Scalar w.r.t. Vector: Linear Functions
For linear functions, the gradient has a simple form. If:
$$f(\mathbf{x}) = \mathbf{a}^T\mathbf{x}$$
Then the gradient is:
$$\frac{df}{d\mathbf{x}} = \mathbf{a}$$
This gives a column vector that represents the constant rate of change in each direction.
Quadratic Forms
For quadratic functions:
$$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$$
The gradient is:
$$\nabla f(\mathbf{x}) = (A + A^T)\mathbf{x}$$
If $A$ is symmetric, this simplifies to:
$$\nabla f(\mathbf{x}) = 2A\mathbf{x}$$
3.3 Vector w.r.t. Scalar
If $\mathbf{y}(x) \in \mathbb{R}^n$ and $x \in \mathbb{R}$, then:
$$\frac{d\mathbf{y}}{dx} = \begin{bmatrix} \frac{dy_1}{dx} \ \frac{dy_2}{dx} \ \vdots \ \frac{dy_n}{dx} \end{bmatrix}$$
Example:
For $\mathbf{y}(x) = \begin{bmatrix} x^2 \ e^x \ \sin(x) \end{bmatrix}$:
$$\frac{d\mathbf{y}}{dx} = \begin{bmatrix} 2x \ e^x \ \cos(x) \end{bmatrix}$$
This type of derivative is commonly used in:
- Dynamic systems modeling
- Time-dependent neural networks
- Ordinary differential equations
3.4 Vector w.r.t. Vector: Jacobian Matrix
Let $\mathbf{y}(\mathbf{x}) \in \mathbb{R}^m$ and $\mathbf{x} \in \mathbb{R}^n$. The Jacobian is the matrix of all partial derivatives:
$$J_{ij} = \frac{\partial y_i}{\partial x_j}$$
The complete Jacobian matrix is:
$$J = \frac{d\mathbf{y}}{d\mathbf{x}} = \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \
\vdots & \vdots & \ddots & \vdots \
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix} \in \mathbb{R}^{m \times n}$$
Example:
For $\mathbf{y} = \begin{bmatrix} x_1^2 + x_2 \ x_1x_2 \ x_1 + 2x_2 \end{bmatrix}$:
$$J = \begin{bmatrix}
2x_1 & 1 \
x_2 & x_1 \
1 & 2
\end{bmatrix}$$
Applications of Jacobian Matrices
- Neural Networks: Forward and backward propagation
- Optimization: Newton's method and quasi-Newton methods
- Numerical Analysis: Solving systems of nonlinear equations
- Control Theory: Linearization of nonlinear systems
3.5 Matrix w.r.t. Matrix
Matrix derivatives involve functions where both inputs and outputs are matrices. These are essential for advanced machine learning techniques.
Common Matrix Derivative Formulas
Function | Derivative |
$\text{tr}(AX)$ | $A^T$ |
$\text{tr}(X^TA)$ | $A$ |
$\text{tr}(AXB)$ | $A^TB^T$ |
$\text{tr}(X^TAX)$ | $AX + A^TX$ |
$\log \det(X)$ | $(X^{-1})^T$ |
$\det(X)$ | $\det(X)(X^{-1})^T$ |
Detailed Examples
Example 1: Trace of Linear Function
For $f(X) = \text{tr}(AX)$:
Example 2: Quadratic Form
For $f(X) = \text{tr}(X^TAX)$:
If $A$ is symmetric:
Example 3: Log-Determinant
For $f(X) = \log \det(X)$:
4. Chain Rule in Matrix Calculus
The chain rule generalizes to matrix calculus and is fundamental to backpropagation in neural networks.
Scalar Chain Rule
For scalar functions:
Vector Chain Rule
For vector functions:
$$\frac{d\mathbf{z}}{d\mathbf{x}} = \frac{d\mathbf{z}}{d\mathbf{y}} \cdot \frac{d\mathbf{y}}{d\mathbf{x}}$$
Where the multiplication is matrix multiplication of Jacobians.
Multi-layer Chain Rule
For a composition of functions $\mathbf{z} = f_3(f_2(f_1(\mathbf{x})))$:
$$\frac{d\mathbf{z}}{d\mathbf{x}} = \frac{d\mathbf{z}}{d\mathbf{y_2}} \cdot \frac{d\mathbf{y_2}}{d\mathbf{y_1}} \cdot \frac{d\mathbf{y_1}}{d\mathbf{x}}$$
This is the mathematical foundation of backpropagation in deep neural networks.
5. Common Derivative Formulas (Cheat Sheet)
Vector Functions
Function | Derivative |
$f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$ | $\nabla f = \mathbf{a}$ |
$f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$ | $\nabla f = (A + A^T)\mathbf{x}$ |
$f(\mathbf{x}) = |\mathbf{x}|^2$ | $\nabla f = 2\mathbf{x}$ |
$f(\mathbf{x}) = |\mathbf{x} - \mathbf{a}|^2$ | $\nabla f = 2(\mathbf{x} - \mathbf{a})$ |
$f(\mathbf{x}) = \mathbf{a}^T\mathbf{x} + \mathbf{x}^TB\mathbf{x}$ | $\nabla f = \mathbf{a} + (B + B^T)\mathbf{x}$ |
Matrix Functions
Function | Derivative |
$\text{tr}(AX)$ | $A^T$ |
$\text{tr}(X^TA)$ | $A$ |
$\text{tr}(AXB)$ | $A^TB^T$ |
$\text{tr}(X^TAX)$ | $AX + A^TX$ |
$\log \det(X)$ | $(X^{-1})^T$ |
$\text{tr}(X^{-1})$ | $-(X^{-1})^TX^{-1}$ |
Activation Functions
Function | Derivative |
$\text{ReLU}(x) = \max(0,x)$ | $\begin{cases} 1 & x > 0 \ 0 & x \leq 0 \end{cases}$ |
$\text{Sigmoid}(x) = \frac{1}{1+e^{-x}}$ | $\sigma(x)(1-\sigma(x))$ |
$\text{Tanh}(x)$ | $1 - \tanh^2(x)$ |
$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$ | $\text{softmax}(x_i)(\delta_{ij} - \text{softmax}(x_j))$ |
6. Applications in Machine Learning
6.1 Gradient Descent
The gradient descent algorithm uses the gradient to update parameters:
$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta)$$
Where:
- $\alpha$ is the learning rate
- $L(\theta)$ is the loss function
- $\nabla_\theta L(\theta)$ is the gradient of the loss with respect to parameters
6.2 Backpropagation
Backpropagation applies the chain rule to compute gradients in neural networks:
For a network with layers $f_1, f_2, \ldots, f_L$:
$$\frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial W^{(l)}}$$
Where $\mathbf{z}^{(l)} = W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$
6.3 Logistic Regression
For logistic regression with loss function:
$$L(\theta) = \sum_{i=1}^n \log(1 + e^{-y_i \theta^T \mathbf{x}_i})$$
The gradient is:
$$\nabla_\theta L = -\sum_{i=1}^n y_i \mathbf{x}_i \sigma(-y_i \theta^T \mathbf{x}_i)$$
Where $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function.
6.4 Regularization
Ridge Regression (L2 Regularization):
$$L(\mathbf{w}) = |\mathbf{y} - X\mathbf{w}|^2 + \lambda|\mathbf{w}|^2$$
Gradient:
$$\nabla_\mathbf{w} L = -2X^T(\mathbf{y} - X\mathbf{w}) + 2\lambda\mathbf{w}$$
Lasso Regression (L1 Regularization):
$$L(\mathbf{w}) = |\mathbf{y} - X\mathbf{w}|^2 + \lambda|\mathbf{w}|_1$$
Subgradient:
$$\partial_\mathbf{w} L = -2X^T(\mathbf{y} - X\mathbf{w}) + \lambda \text{sign}(\mathbf{w})$$
6.5 Support Vector Machines
For SVM with hinge loss:
$$L(\mathbf{w}) = \sum_{i=1}^n \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) + \frac{\lambda}{2}|\mathbf{w}|^2$$
The subgradient is:
$$\partial_\mathbf{w} L = \lambda\mathbf{w} - \sum_{i: y_i \mathbf{w}^T \mathbf{x}_i < 1} y_i \mathbf{x}_i$$
7. Advanced Topics
7.1 Hessian Matrix
The Hessian matrix contains second-order derivatives:
$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$
For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$:
$$H = \nabla^2 f = \begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \
\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \
\vdots & \vdots & \ddots
\end{bmatrix}$$
7.2 Newton's Method
Newton's method uses both gradient and Hessian:
7.3 Automatic Differentiation
Modern deep learning frameworks use automatic differentiation:
- Forward Mode: Computes derivatives alongside function values
- Reverse Mode: Backpropagates gradients (used in backpropagation)
8. Practical Implementation Tips
8.1 Numerical Stability
- Use log-sum-exp trick for softmax derivatives
- Implement gradient clipping to prevent exploding gradients
- Use appropriate initialization to avoid vanishing gradients
8.2 Vectorization
Always vectorize operations when possible:
8.3 Gradient Checking
Verify analytical gradients with numerical gradients:
9. Summary
Matrix calculus provides the mathematical foundation for understanding and implementing optimization algorithms in machine learning. Key takeaways:
- Scalar derivatives are the foundation for all matrix calculus
- Subgradients extend derivatives to non-differentiable functions
- Gradients point in the direction of steepest increase
- Jacobian matrices describe local linear approximations
- Chain rule enables backpropagation in neural networks
- Matrix derivatives are essential for advanced ML techniques
Mastery of these concepts enables you to:
- Design and analyze optimization algorithms
- Implement neural networks from scratch
- Understand the mathematical foundations of ML libraries
- Debug gradient computations effectively
10. Further Reading
Books
- "The Matrix Cookbook" by Petersen & Pedersen - Comprehensive reference for matrix derivatives
- "Deep Learning" by Goodfellow, Bengio, and Courville - Applications in deep learning
- "Pattern Recognition and Machine Learning" by Bishop - Statistical perspective
Online Resources
- CS231n: Convolutional Neural Networks (Stanford) - Practical applications
- CS229: Machine Learning (Stanford) - Mathematical foundations
- Matrix Calculus for Deep Learning (explained.ai) - Interactive explanations
Reference Materials
This comprehensive guide covers the essential matrix calculus concepts needed for machine learning. Practice with concrete examples and implement the derivatives yourself to deepen your understanding.
- Author:Entropyobserver
- URL:https://tangly1024.com/article/232d698f-3512-802d-9cd6-d5dcc37d8e91
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!