type
status
date
slug
summary
tags
category
icon
password
1.FNN
Neural networks are computational models inspired by the human brain, designed to recognize patterns and relationships in data. They consist of multiple neurons (nodes) connected in layers, which process input data and generate output. Different types of neural networks are suited to different tasks.
Traditional Neural Networks:
- Structure: Input → Output
- Characteristics: Traditional neural networks process input data and generate output but lack the ability to remember previous inputs. This means they don't consider the context of earlier inputs when making predictions.


1.1 Forward Pass
1. Inputs:
The initial data fed into the network. Depending on the task, this can be text, speech, or images. Each input feature corresponds to one neuron in the input layer.
2. Weights:
Numeric parameters that determine the strength of the connection between neurons. Each input is multiplied by its corresponding weight. Learning the optimal weights is how the network “learns.
3. Sum Function (Weighted Sum):
The operation that combines inputs and weights:
where = input, = weight, b = bias. This produces the total input to the neuron.
4. Bias:
A trainable parameter added to the weighted sum. It shifts the activation function so that the neuron can be activated even when all inputs are zero.
5. Activation Function:
Applies a non-linear transformation to the weighted sum + bias. This determines the output of the neuron and allows the network to model complex, non-linear relationships. Examples: Sigmoid, ReLU, Tanh.
6. Output:
The final value produced by the neuron (or output layer) after activation. This is the network’s prediction or response for the given input.
1.2 Loss Computation
A loss function (also called a cost function) is a mathematical way to measure how “wrong” a neural network’s predictions are compared to the true labels.
- Input: predicted outputs and true targets
- Output: a single scalar value representing the error
- Goal: minimize this value during training
Common Loss Functions
a) Mean Squared Error (MSE) – for regression
Measures the squared difference between predicted and true values.
Interpretation: Penalizes larger errors more than smaller ones.
Example: true values = [3, 5, 2], predicted = [2.5, 5.5, 1.5].
Errors = (0.5², 0.5², 0.5²) = (0.25, 0.25, 0.25).
MSE = (0.25 + 0.25 + 0.25)/3 =0.25.
Interpretation: the smaller the MSE, the closer the predictions are to the targets.
b) Cross-Entropy Loss – for classification
Measures how well the predicted probability distribution matches the true class.
For binary classification:
Example: sentiment analysis (positive = 1, negative = 0)
True label: Positive → y=1
Predicted probability:
Loss = → very low, model did well
If predicted probability was 0.1 → Loss ≈ 2.30 → very high, model did poorly
For multi-class classification:
Example: 3-class image classification (Cat, Dog, Rabbit)
True class = Dog → one-hot label [0,1,0][0, 1, 0][0,1,0]
Predicted = [0.7, 0.2, 0.1]
Loss = → not good
Predicted = [0.05, 0.9, 0.05]
Loss = ≈0.11 → much better
c) Other Losses
Hinge Loss → used in SVM-like models
KL Divergence → measures difference between two probability distributions
Why Loss Computation is Crucial
- It tells the network “how wrong it is”
- Without it, the network has no guidance for learning.
- It drives backpropagation
- Gradients of the loss w.r.t weights show how to adjust each weight to reduce error.
- It allows comparison across models
- Lower loss = better predictions.
1.3 Backward Pass (Backpropagation)

After forward propagation and loss computation, the network needs to learn by adjusting its weights and biases. This is done through backpropagation.
- The error signal (how wrong the prediction was) is sent backward through the network.
- Using the chain rule, each weight and bias gets a gradient — a measure of how changing it would affect the loss
- Weight Update (Gradient Descent):
After computing gradients during backpropagation, we update each weight and bias to reduce the loss. This is the heart of learning in neural networks.
Formulas
For a weight w and bias b:
Where:
= learning rate (controls step size)
= gradient of the loss w.r.t weight
= gradient of the loss w.r.t bias
Interpretation:
- Subtracting the gradient moves the weight in the direction that decreases the loss.
- Larger learning rates = bigger steps (can overshoot), smaller = slower learning.
Step-by-Step Example
Suppose after backpropagation:
- Weight gradient: ∂L∂w=−0.2\frac{\partial L}{\partial w} = -0.2∂w∂L=−0.2
- Bias gradient: ∂L∂b=0.1\frac{\partial L}{\partial b} = 0.1∂b∂L=0.1
- Learning rate: η=0.1\eta = 0.1η=0.1
- Current weight: w=0.5w = 0.5w=0.5
- Current bias: b=0.2b = 0.2b=0.2
Update:
wnew=0.5−0.1×(−0.2)=0.5+0.02=0.52w_{\text{new}} = 0.5 - 0.1 \times (-0.2) = 0.5 + 0.02 = 0.52
wnew=0.5−0.1×(−0.2)=0.5+0.02=0.52
bnew=0.2−0.1×0.1=0.2−0.01=0.19b_{\text{new}} = 0.2 - 0.1 \times 0.1 = 0.2 - 0.01 = 0.19
bnew=0.2−0.1×0.1=0.2−0.01=0.19
✅ The weight increased slightly because the gradient was negative → moving in the direction to reduce loss.
✅ The bias decreased slightly because the gradient was positive → also reducing loss.
Iterative Process
- Forward pass → compute output
- Compute loss
- Backward pass → compute gradients
- Update weights and biases
- Repeat for each batch/epoch until the network converges.
- Each repetition is called an iteration (per batch).
- Going through the entire dataset once is an epoch.
- Multiple epochs → network gradually learns to predict accurately.
Analogy
Think of climbing down a hill in the fog:
- Loss = height of the hill at your position.
- Gradient = slope of the hill.
- Weight update = take a step downhill following the slope.
- Learning rate = step size.
After many steps (iterations), you reach the bottom (minimal loss).
2. Activation Function

An activation function is a mathematical function applied to a neuron's output after computing its weighted sum of inputs and adding a bias. It determines the neuron’s final output that is passed to the next layer.
Purpose of Activation Functions
- Introduce Non-linearity
- Why: Without activation, multiple layers just perform linear operations: . The network could only learn linear relationships.
- Example: Suppose we want a network to learn XOR logic:
x1 | x2 | XOR(x1,x2) |
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
- XOR is non-linear. Without an activation function (e.g., ReLU or Sigmoid) in the hidden layer, the network cannot model XOR, no matter how many layers. With ReLU, it can learn the correct output.
- Control Output Range
- Why: Some tasks require outputs in a specific range, e.g., probability between 0 and 1.
- Example: Binary classification using Sigmoid:
- If z = 2, output = 0.88 → interpreted as 88% probability.
- If z = -1, output = 0.27 → interpreted as 27% probability.
- Another example: Tanh maps values to [-1,1], useful when inputs should be centered around 0.
3.Enhance Feature Representation
- Why: Different activations emphasize or suppress certain patterns, giving the network richer representations.
- Example:
- ReLU: Activates only positive signals → highlights “strong” features while ignoring weak/noisy ones.
- Input
[0.2, -0.5, 1.0]
→ ReLU →[0.2, 0, 1.0]
- Tanh: Preserves negative information and scales it → useful when both positive and negative signals carry meaning.
- Input
[0.2, -0.5, 1.0]
→ Tanh →[0.197, -0.462, 0.762]
- Effect: By applying non-linear transformations at each layer, the network can detect complex patterns that are combinations of raw inputs.

These are the functions you encounter daily in state-of-the-art NLP models like Transformers, BERT, and GPT.
- Rectifier (ReLU)
- ReLU (Rectified Linear Unit) is the default activation function for hidden layers in modern neural networks, especially in NLP models like Transformers.Introduces non-linearity so the network can learn complex relationships between words and tokens. Its power comes from a simple "shut-off" mechanism that enables the network to learn complex features efficiently and effectively.
- Key Mechanism: Sparsity Its function is f(x) = max(0, x). This means it allows positive values to pass through unchanged while forcing all negative values to zero. This process creates Sparsity (Sparse Activation), where only a fraction of neurons are active for any given input.In BERT, after self-attention, each token’s vector goes through a feed-forward network with ReLU. A vector value of 0.8 stays 0.8; a negative value, e.g., -0.2, becomes 0. This sparsity helps the network learn rich, high-dimensional features efficiently.
- Primary Consequence: Neuron Specialization Sparsity forces individual neurons to become Specialized. Instead of all neurons reacting to every input, each neuron learns to become sensitive to specific, independent features (like a "verb detector" or an "animal detector"). This effectively disentangles complex information into simpler, manageable parts.
- Key Benefits: Learning Complex Features: By combining the signals from these specialized neurons, deeper layers of the network can learn more abstract and complex concepts (e.g., combining "animal" and "action" features to understand "an animal is doing something"). Computational Efficiency: Activating only a few neurons (and outputting many zeros) significantly reduces computation, making models faster to train and run. Better Generalization: Sparsity helps prevent the model from overfitting because it learns more robust, general features instead of simply memorizing the training data. In essence, ReLU allows a network to build a modular and hierarchical understanding of data by teaching its neurons to become focused specialists.
- Logistic (Sigmoid)
- Role in NLP: Rarely used in hidden layers now, but critical in:
- Gating mechanisms in LSTMs/GRUs: Sigmoid outputs values in [0,1], controlling how much information passes through gates.
- Binary classification outputs: Produces probabilities for “yes/no” tasks.
- Sigmoid in Output Layers (Binary Decisions): Acts as the final decision maker for binary classification tasks, converting raw scores (logits) into probabilities. Output represents the model’s confidence for a “yes/no” or “positive/negative” prediction.
- Role: Functions as a controller for information flow inside the network rather than a final answer.
- Interpretation: Output is a gate value between 0 and 1, which multiplies another signal to decide how much information is allowed to pass.
- 1 → valve fully open, 100% of information passes
- 0 → valve fully closed, no information passes
- 0.6 → valve partially open, 60% passes
- Sentence: “The visuals were stunning, but the plot was terrible.”
- Step 1: Reading “…visuals were stunning”
- Forget gate Sigmoid outputs ~0.95
- Old memory (positive sentiment) is mostly retained:
- Step 2: Reading “but” (contrast signal)
- Forget gate Sigmoid outputs ~0.1
- Old positive memory is mostly forgotten:
- Step 3: Reading “…plot was terrible”
- New negative sentiment memory is written, unhindered by old positive memory
Example: Sentiment Analysis
Sentence: “This movie is amazing.”
Model raw score (logit): +3.5
Sigmoid output:
Conclusion: The model predicts a 97% probability of positive sentiment.
Metaphor: The output is the final verdict of the model—a judge issuing a decision.
Sigmoid in LSTM/GRU Gates (Dynamic Memory Control)
Analogy: A water faucet valve:
Example: Forget Gate in LSTM
Metaphor: The Sigmoid acts as a dynamic information manager, deciding selectively what to remember and what to discard—crucial for understanding long sequences and complex context.
- Hyperbolic Tangent (Tanh)
- Role in NLP: Once standard in RNN/LSTM hidden layers. Outputs in [-1,1], centered around 0, which sometimes helps convergence.
- Example: In LSTM candidate memory updates, tanh decides whether to enhance (positive) or suppress (negative) memory.
- Softmax (not in the diagram but essential)
- Role in NLP: Converts logits into a probability distribution for multi-class classification (e.g., word prediction).
- Example: GPT predicting the next word in “The cat sat on the ___” assigns probabilities: mat: 0.95, chair: 0.04, sky: 0.001. Softmax ensures all probabilities sum to 1.

3. Example Setup

We will assign specific initial values to each part of the network shown in your diagram. In a real-world scenario, the weights (W) and biases (b) are initialized randomly before training begins.
- Input Layer (Layer 0): 3 neurons
- Input Vector X = [x₁, x₂, x₃] = [0.5, 0.2, 0.9]
- Hidden Layer 1 (Layer 1): 2 neurons (s₁, s₂)
- Weight Matrix W₀→₁ (a 3x2 matrix): [[0.2, 0.3], [0.4, 0.1], [0.5, 0.6]]
- Bias Vector b₁ = [0.1, 0.2]
- Activation Function: ReLU (max(0, x))
- Hidden Layer 2 (Layer 2): 3 neurons (s₃, s₄, s₅)
- Weight Matrix W₁→₂ (a 2x3 matrix): [[0.1, 0.4, 0.6], [0.5, 0.2, 0.3]]
- Bias Vector b₂ = [0.05, 0.15, 0.25]
- Activation Function: ReLU (max(0, x))
- Output Layer (Layer 3): 2 neurons (y₁, y₂)
- Weight Matrix W₂→₃ (a 3x2 matrix): [[0.6, 0.2], [0.1, 0.4], [0.5, 0.3]]
- Bias Vector b₃ = [0.1, 0.1]
- Activation Function: Softmax (used to convert outputs into probabilities)
Forward Propagation Calculation
Now, let's follow the data as it travels through the network.
Step 1: From Input Layer to Hidden Layer 1 (Calculating s₁ & s₂)
Goal: Calculate the output value for each neuron in the first hidden layer.
- Calculate the Weighted Sum for s₁:
z_s₁ = (x₁ * w_x₁s₁) + (x₂ * w_x₂s₁) + (x₃ * w_x₃s₁) + b₁_₁
z_s₁ = (0.5 * 0.2) + (0.2 * 0.4) + (0.9 * 0.5) + 0.1
z_s₁ = 0.1 + 0.08 + 0.45 + 0.1 = 0.73
Apply ReLU Activation:
s₁ = max(0, 0.73) = 0.73
- Calculate the Weighted Sum for s₂:
z_s₂ = (x₁ * w_x₁s₂) + (x₂ * w_x₂s₂) + (x₃ * w_x₃s₂) + b₁_₂
z_s₂ = (0.5 * 0.3) + (0.2 * 0.1) + (0.9 * 0.6) + 0.2
z_s₂ = 0.15 + 0.02 + 0.54 + 0.2 = 0.91
Apply ReLU Activation:
s₂ = max(0, 0.91) = 0.91
Result: The output of Hidden Layer 1 is [s₁, s₂] = [0.73, 0.91].
Step 2: From Hidden Layer 1 to Hidden Layer 2 (Calculating s₃, s₄, & s₅)
Goal: Use the output from the previous layer, [0.73, 0.91], as the input to calculate the output values for the second hidden layer.
- Calculate s₃:
z_s₃ = (s₁ * w_s₁s₃) + (s₂ * w_s₂s₃) + b₂_₁
z_s₃ = (0.73 * 0.1) + (0.91 * 0.5) + 0.05
z_s₃ = 0.073 + 0.455 + 0.05 = 0.578
s₃ = max(0, 0.578) = 0.578
- Calculate s₄:
z_s₄ = (s₁ * w_s₁s₄) + (s₂ * w_s₂s₄) + b₂_₂
z_s₄ = (0.73 * 0.4) + (0.91 * 0.2) + 0.15
z_s₄ = 0.292 + 0.182 + 0.15 = 0.624
s₄ = max(0, 0.624) = 0.624
- Calculate s₅:
z_s₅ = (s₁ * w_s₁s₅) + (s₂ * w_s₂s₅) + b₂_₃
z_s₅ = (0.73 * 0.6) + (0.91 * 0.3) + 0.25
z_s₅ = 0.438 + 0.273 + 0.25 = 0.961
s₅ = max(0, 0.961) = 0.961
Result: The output of Hidden Layer 2 is [s₃, s₄, s₅] = [0.578, 0.624, 0.961].
Step 3: From Hidden Layer 2 to Output Layer (Calculating y₁ & y₂)
Goal: Calculate the final "raw scores" (also called logits) for the output neurons before the final activation.
- Calculate the Weighted Sum for y₁:
z_y₁ = (s₃ * w_s₃y₁) + (s₄ * w_s₄y₁) + (s₅ * w_s₅y₁) + b₃_₁
z_y₁ = (0.578 * 0.6) + (0.624 * 0.1) + (0.961 * 0.5) + 0.1
z_y₁ = 0.3468 + 0.0624 + 0.4805 + 0.1 = 0.9897
- Calculate the Weighted Sum for y₂:
z_y₂ = (s₃ * w_s₃y₂) + (s₄ * w_s₄y₂) + (s₅ * w_s₅y₂) + b₃_₂
z_y₂ = (0.578 * 0.2) + (0.624 * 0.4) + (0.961 * 0.3) + 0.1
z_y₂ = 0.1156 + 0.2496 + 0.2883 + 0.1 = 0.7535
Result: The raw scores (logits) for the output layer are [z_y₁, z_y₂] = [0.9897, 0.7535].
Step 4: Final Activation at Output Layer (Softmax)
Goal: Convert the raw scores into a probability distribution where the values sum to 1.
- Calculate the exponent of each score:
exp(z_y₁) = e^0.9897 ≈ 2.690
exp(z_y₂) = e^0.7535 ≈ 2.124
- Sum the exponents:
Sum = 2.690 + 2.124 = 4.814
- Calculate the final probability for each class:
y₁ (Prob for class 1) = 2.690 / 4.814 ≈ 0.5588 (55.88%)
y₂ (Prob for class 2) = 2.124 / 4.814 ≈ 0.4412 (44.12%)
Final Result & Explanation
After the complete forward propagation for the input [0.5, 0.2, 0.9], the network's final output is:
[y₁, y₂] ≈ [0.559, 0.441]
What does this mean?
The neural network predicts that the given input belongs to Class 1 with a probability of 55.9% and to Class 2 with a probability of 44.1%.
This detailed calculation perfectly illustrates how data is transformed layer by layer—by being multiplied by weights, added to by biases, and passed through activation functions—to produce a final, meaningful prediction. This is the essence of forward propagation.
- Author:Entropyobserver
- URL:https://tangly1024.com/article/272d698f-3512-803e-9593-e8243cdfd1cf
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!