Machine translation - FNN | EntropyObserver

type

status

date

slug

summary

1.FNN

Neural networks are computational models inspired by the human brain, designed to recognize patterns and relationships in data. They consist of multiple neurons (nodes) connected in layers, which process input data and generate output. Different types of neural networks are suited to different tasks.

Traditional Neural Networks:

Structure: Input → Output

Characteristics: Traditional neural networks process input data and generate output but lack the ability to remember previous inputs. This means they don't consider the context of earlier inputs when making predictions.

1.1 Forward Pass

1. Inputs:

The initial data fed into the network. Depending on the task, this can be text, speech, or images. Each input feature corresponds to one neuron in the input layer.

2. Weights:

Numeric parameters that determine the strength of the connection between neurons. Each input is multiplied by its corresponding weight. Learning the optimal weights is how the network “learns.

3. Sum Function (Weighted Sum):

The operation that combines inputs and weights:

where = input, = weight, b = bias. This produces the total input to the neuron.

4. Bias:

A trainable parameter added to the weighted sum. It shifts the activation function so that the neuron can be activated even when all inputs are zero.

5. Activation Function:

Applies a non-linear transformation to the weighted sum + bias. This determines the output of the neuron and allows the network to model complex, non-linear relationships. Examples: Sigmoid, ReLU, Tanh.

6. Output:

The final value produced by the neuron (or output layer) after activation. This is the network’s prediction or response for the given input.

1.2 Loss Computation

A loss function (also called a cost function) is a mathematical way to measure how “wrong” a neural network’s predictions are compared to the true labels.

Input: predicted outputs and true targets

Output: a single scalar value representing the error

Goal: minimize this value during training

Common Loss Functions

a) Mean Squared Error (MSE) – for regression

Measures the squared difference between predicted and true values.

Interpretation: Penalizes larger errors more than smaller ones.

Example: true values = [3, 5, 2], predicted = [2.5, 5.5, 1.5].

Errors = (0.5², 0.5², 0.5²) = (0.25, 0.25, 0.25).

MSE = (0.25 + 0.25 + 0.25)/3 =0.25.

Interpretation: the smaller the MSE, the closer the predictions are to the targets.

b) Cross-Entropy Loss – for classification

Measures how well the predicted probability distribution matches the true class.

For binary classification:

Example: sentiment analysis (positive = 1, negative = 0)

True label: Positive → y=1

Predicted probability:

Loss = → very low, model did well

If predicted probability was 0.1 → Loss ≈ 2.30 → very high, model did poorly

For multi-class classification:

Example: 3-class image classification (Cat, Dog, Rabbit)

True class = Dog → one-hot label [0,1,0][0, 1, 0][0,1,0]

Predicted = [0.7, 0.2, 0.1]

Loss = → not good

Predicted = [0.05, 0.9, 0.05]

Loss = ≈0.11 → much better

c) Other Losses

Hinge Loss → used in SVM-like models

KL Divergence → measures difference between two probability distributions

Why Loss Computation is Crucial

It tells the network “how wrong it is”

Without it, the network has no guidance for learning.

It drives backpropagation

Gradients of the loss w.r.t weights show how to adjust each weight to reduce error.

It allows comparison across models

Lower loss = better predictions.

1.3 Backward Pass (Backpropagation)

After forward propagation and loss computation, the network needs to learn by adjusting its weights and biases. This is done through backpropagation.

The error signal (how wrong the prediction was) is sent backward through the network.

Using the chain rule, each weight and bias gets a gradient — a measure of how changing it would affect the loss

Weight Update (Gradient Descent):

After computing gradients during backpropagation, we update each weight and bias to reduce the loss. This is the heart of learning in neural networks.

Formulas

For a weight w and bias b:

Where:

= learning rate (controls step size)

= gradient of the loss w.r.t weight

= gradient of the loss w.r.t bias

Interpretation:

Subtracting the gradient moves the weight in the direction that decreases the loss.

Larger learning rates = bigger steps (can overshoot), smaller = slower learning.

Step-by-Step Example

Suppose after backpropagation:

Weight gradient: ∂L∂w=−0.2\frac{\partial L}{\partial w} = -0.2∂w∂L=−0.2

Bias gradient: ∂L∂b=0.1\frac{\partial L}{\partial b} = 0.1∂b∂L=0.1

Learning rate: η=0.1\eta = 0.1η=0.1

Current weight: w=0.5w = 0.5w=0.5

Current bias: b=0.2b = 0.2b=0.2

Update:

wnew=0.5−0.1×(−0.2)=0.5+0.02=0.52w_{\text{new}} = 0.5 - 0.1 \times (-0.2) = 0.5 + 0.02 = 0.52

wnew=0.5−0.1×(−0.2)=0.5+0.02=0.52

bnew=0.2−0.1×0.1=0.2−0.01=0.19b_{\text{new}} = 0.2 - 0.1 \times 0.1 = 0.2 - 0.01 = 0.19

bnew=0.2−0.1×0.1=0.2−0.01=0.19

✅ The weight increased slightly because the gradient was negative → moving in the direction to reduce loss.

✅ The bias decreased slightly because the gradient was positive → also reducing loss.

Iterative Process

Forward pass → compute output

Compute loss

Backward pass → compute gradients

Update weights and biases

Repeat for each batch/epoch until the network converges.

Each repetition is called an iteration (per batch).

Going through the entire dataset once is an epoch.

Multiple epochs → network gradually learns to predict accurately.

Analogy

Think of climbing down a hill in the fog:

Loss = height of the hill at your position.

Gradient = slope of the hill.

Weight update = take a step downhill following the slope.

Learning rate = step size.

After many steps (iterations), you reach the bottom (minimal loss).

2. Activation Function

An activation function is a mathematical function applied to a neuron's output after computing its weighted sum of inputs and adding a bias. It determines the neuron’s final output that is passed to the next layer.

Purpose of Activation Functions

Introduce Non-linearity

Why: Without activation, multiple layers just perform linear operations: . The network could only learn linear relationships.

Example: Suppose we want a network to learn XOR logic:

x1	x2	XOR(x1,x2)
0	0	0
0	1	1
1	0	1
1	1	0

XOR is non-linear. Without an activation function (e.g., ReLU or Sigmoid) in the hidden layer, the network cannot model XOR, no matter how many layers. With ReLU, it can learn the correct output.

Control Output Range

Why: Some tasks require outputs in a specific range, e.g., probability between 0 and 1.

Example: Binary classification using Sigmoid:

If z = 2, output = 0.88 → interpreted as 88% probability.

If z = -1, output = 0.27 → interpreted as 27% probability.

Another example: Tanh maps values to [-1,1], useful when inputs should be centered around 0.

3.Enhance Feature Representation

Why: Different activations emphasize or suppress certain patterns, giving the network richer representations.

Example:

ReLU: Activates only positive signals → highlights “strong” features while ignoring weak/noisy ones.

Input [0.2, -0.5, 1.0] → ReLU → [0.2, 0, 1.0]

Tanh: Preserves negative information and scales it → useful when both positive and negative signals carry meaning.

Input [0.2, -0.5, 1.0] → Tanh → [0.197, -0.462, 0.762]

Effect: By applying non-linear transformations at each layer, the network can detect complex patterns that are combinations of raw inputs.

These are the functions you encounter daily in state-of-the-art NLP models like Transformers, BERT, and GPT.

Rectifier (ReLU)

ReLU (Rectified Linear Unit) is the default activation function for hidden layers in modern neural networks, especially in NLP models like Transformers.Introduces non-linearity so the network can learn complex relationships between words and tokens. Its power comes from a simple "shut-off" mechanism that enables the network to learn complex features efficiently and effectively.

Key Mechanism: Sparsity Its function is f(x) = max(0, x). This means it allows positive values to pass through unchanged while forcing all negative values to zero. This process creates Sparsity (Sparse Activation), where only a fraction of neurons are active for any given input.In BERT, after self-attention, each token’s vector goes through a feed-forward network with ReLU. A vector value of 0.8 stays 0.8; a negative value, e.g., -0.2, becomes 0. This sparsity helps the network learn rich, high-dimensional features efficiently.

Primary Consequence: Neuron Specialization Sparsity forces individual neurons to become Specialized. Instead of all neurons reacting to every input, each neuron learns to become sensitive to specific, independent features (like a "verb detector" or an "animal detector"). This effectively disentangles complex information into simpler, manageable parts.

Key Benefits: Learning Complex Features: By combining the signals from these specialized neurons, deeper layers of the network can learn more abstract and complex concepts (e.g., combining "animal" and "action" features to understand "an animal is doing something"). Computational Efficiency: Activating only a few neurons (and outputting many zeros) significantly reduces computation, making models faster to train and run. Better Generalization: Sparsity helps prevent the model from overfitting because it learns more robust, general features instead of simply memorizing the training data. In essence, ReLU allows a network to build a modular and hierarchical understanding of data by teaching its neurons to become focused specialists.

Logistic (Sigmoid)

Role in NLP: Rarely used in hidden layers now, but critical in:

Gating mechanisms in LSTMs/GRUs: Sigmoid outputs values in [0,1], controlling how much information passes through gates.
Binary classification outputs: Produces probabilities for “yes/no” tasks.

Sigmoid in Output Layers (Binary Decisions): Acts as the final decision maker for binary classification tasks, converting raw scores (logits) into probabilities. Output represents the model’s confidence for a “yes/no” or “positive/negative” prediction.

Example: Sentiment Analysis

Sentence: “This movie is amazing.”

Model raw score (logit): +3.5

Sigmoid output:

Conclusion: The model predicts a 97% probability of positive sentiment.

Metaphor: The output is the final verdict of the model—a judge issuing a decision.

Sigmoid in LSTM/GRU Gates (Dynamic Memory Control)

Role: Functions as a controller for information flow inside the network rather than a final answer.

Interpretation: Output is a gate value between 0 and 1, which multiplies another signal to decide how much information is allowed to pass.

Analogy: A water faucet valve:

1 → valve fully open, 100% of information passes

0 → valve fully closed, no information passes

0.6 → valve partially open, 60% passes

Example: Forget Gate in LSTM

Sentence: “The visuals were stunning, but the plot was terrible.”

Step 1: Reading “…visuals were stunning”

Forget gate Sigmoid outputs ~0.95
Old memory (positive sentiment) is mostly retained:

Step 2: Reading “but” (contrast signal)

Forget gate Sigmoid outputs ~0.1
Old positive memory is mostly forgotten:

Step 3: Reading “…plot was terrible”

New negative sentiment memory is written, unhindered by old positive memory

Metaphor: The Sigmoid acts as a dynamic information manager, deciding selectively what to remember and what to discard—crucial for understanding long sequences and complex context.

Hyperbolic Tangent (Tanh)

Role in NLP: Once standard in RNN/LSTM hidden layers. Outputs in [-1,1], centered around 0, which sometimes helps convergence.

Example: In LSTM candidate memory updates, tanh decides whether to enhance (positive) or suppress (negative) memory.

Softmax (not in the diagram but essential)

Role in NLP: Converts logits into a probability distribution for multi-class classification (e.g., word prediction).

Example: GPT predicting the next word in “The cat sat on the ___” assigns probabilities: mat: 0.95, chair: 0.04, sky: 0.001. Softmax ensures all probabilities sum to 1.

3. Example Setup

We will assign specific initial values to each part of the network shown in your diagram. In a real-world scenario, the weights (W) and biases (b) are initialized randomly before training begins.

Input Layer (Layer 0): 3 neurons

Input Vector X = [x₁, x₂, x₃] = [0.5, 0.2, 0.9]

Hidden Layer 1 (Layer 1): 2 neurons (s₁, s₂)

Weight Matrix W₀→₁ (a 3x2 matrix): [[0.2, 0.3], [0.4, 0.1], [0.5, 0.6]]
Bias Vector b₁ = [0.1, 0.2]
Activation Function: ReLU (max(0, x))

Hidden Layer 2 (Layer 2): 3 neurons (s₃, s₄, s₅)

Weight Matrix W₁→₂ (a 2x3 matrix): [[0.1, 0.4, 0.6], [0.5, 0.2, 0.3]]
Bias Vector b₂ = [0.05, 0.15, 0.25]
Activation Function: ReLU (max(0, x))

Output Layer (Layer 3): 2 neurons (y₁, y₂)

Weight Matrix W₂→₃ (a 3x2 matrix): [[0.6, 0.2], [0.1, 0.4], [0.5, 0.3]]
Bias Vector b₃ = [0.1, 0.1]
Activation Function: Softmax (used to convert outputs into probabilities)

Forward Propagation Calculation

Now, let's follow the data as it travels through the network.

Step 1: From Input Layer to Hidden Layer 1 (Calculating s₁ & s₂)

Goal: Calculate the output value for each neuron in the first hidden layer.

Calculate the Weighted Sum for s₁:

z_s₁ = (x₁ * w_x₁s₁) + (x₂ * w_x₂s₁) + (x₃ * w_x₃s₁) + b₁_₁

z_s₁ = (0.5 * 0.2) + (0.2 * 0.4) + (0.9 * 0.5) + 0.1

z_s₁ = 0.1 + 0.08 + 0.45 + 0.1 = 0.73

Apply ReLU Activation:

s₁ = max(0, 0.73) = 0.73

Calculate the Weighted Sum for s₂:

z_s₂ = (x₁ * w_x₁s₂) + (x₂ * w_x₂s₂) + (x₃ * w_x₃s₂) + b₁_₂

z_s₂ = (0.5 * 0.3) + (0.2 * 0.1) + (0.9 * 0.6) + 0.2

z_s₂ = 0.15 + 0.02 + 0.54 + 0.2 = 0.91

Apply ReLU Activation:

s₂ = max(0, 0.91) = 0.91

Result: The output of Hidden Layer 1 is [s₁, s₂] = [0.73, 0.91].

Step 2: From Hidden Layer 1 to Hidden Layer 2 (Calculating s₃, s₄, & s₅)

Goal: Use the output from the previous layer, [0.73, 0.91], as the input to calculate the output values for the second hidden layer.

Calculate s₃:

z_s₃ = (s₁ * w_s₁s₃) + (s₂ * w_s₂s₃) + b₂_₁

z_s₃ = (0.73 * 0.1) + (0.91 * 0.5) + 0.05

z_s₃ = 0.073 + 0.455 + 0.05 = 0.578

s₃ = max(0, 0.578) = 0.578

Calculate s₄:

z_s₄ = (s₁ * w_s₁s₄) + (s₂ * w_s₂s₄) + b₂_₂

z_s₄ = (0.73 * 0.4) + (0.91 * 0.2) + 0.15

z_s₄ = 0.292 + 0.182 + 0.15 = 0.624

s₄ = max(0, 0.624) = 0.624

Calculate s₅:

z_s₅ = (s₁ * w_s₁s₅) + (s₂ * w_s₂s₅) + b₂_₃

z_s₅ = (0.73 * 0.6) + (0.91 * 0.3) + 0.25

z_s₅ = 0.438 + 0.273 + 0.25 = 0.961

s₅ = max(0, 0.961) = 0.961

Result: The output of Hidden Layer 2 is [s₃, s₄, s₅] = [0.578, 0.624, 0.961].

Step 3: From Hidden Layer 2 to Output Layer (Calculating y₁ & y₂)

Goal: Calculate the final "raw scores" (also called logits) for the output neurons before the final activation.

Calculate the Weighted Sum for y₁:

z_y₁ = (s₃ * w_s₃y₁) + (s₄ * w_s₄y₁) + (s₅ * w_s₅y₁) + b₃_₁

z_y₁ = (0.578 * 0.6) + (0.624 * 0.1) + (0.961 * 0.5) + 0.1

z_y₁ = 0.3468 + 0.0624 + 0.4805 + 0.1 = 0.9897

Calculate the Weighted Sum for y₂:

z_y₂ = (s₃ * w_s₃y₂) + (s₄ * w_s₄y₂) + (s₅ * w_s₅y₂) + b₃_₂

z_y₂ = (0.578 * 0.2) + (0.624 * 0.4) + (0.961 * 0.3) + 0.1

z_y₂ = 0.1156 + 0.2496 + 0.2883 + 0.1 = 0.7535

Result: The raw scores (logits) for the output layer are [z_y₁, z_y₂] = [0.9897, 0.7535].

Step 4: Final Activation at Output Layer (Softmax)

Goal: Convert the raw scores into a probability distribution where the values sum to 1.

Calculate the exponent of each score:

exp(z_y₁) = e^0.9897 ≈ 2.690

exp(z_y₂) = e^0.7535 ≈ 2.124

Sum the exponents:

Sum = 2.690 + 2.124 = 4.814

Calculate the final probability for each class:

y₁ (Prob for class 1) = 2.690 / 4.814 ≈ 0.5588 (55.88%)

y₂ (Prob for class 2) = 2.124 / 4.814 ≈ 0.4412 (44.12%)

Final Result & Explanation

After the complete forward propagation for the input [0.5, 0.2, 0.9], the network's final output is:

[y₁, y₂] ≈ [0.559, 0.441]

What does this mean?

The neural network predicts that the given input belongs to Class 1 with a probability of 55.9% and to Class 2 with a probability of 44.1%.

This detailed calculation perfectly illustrates how data is transformed layer by layer—by being multiplied by weights, added to by biases, and passed through activation functions—to produce a final, meaningful prediction. This is the essence of forward propagation.