type
status
date
slug
summary
tags
category
icon
password
Transformer was initially introduced for machine translation, a task that demands processing two sequences (both input and output are sequences). Thus, the transformer model had two parts: encoder for processing the input and decoder for generating the output.

Encoder
Encoder is responsible for this task. Its main job is to "read" and "understand" the entire input sequence and compress this understanding into a set of contextualized vectors.
- Encoding
Imagine the Input Sequence is the English sentence: “The cat sat.”
Word embeddings:
V_the = [1, 0, 1, 0]
V_cat = [0, 1, 0, 1]
V_sat = [1, 1, 0, 0]
Positional encodings (toy example):
P_0 = [0, 0, 0, 0]
P_1 = [0, 0, 1, 0]
P_2 = [0, 1, 0, 0]
Final inputs = embedding + position:
Input matrix (3 words × 4 dims):
- Multi-Head Attention: This is the core of the Encoder. The attention mechanism allows each word to "look at" all other words in the input sequence to better understand its own meaning in this specific context.
- For example, when processing the word "sat," the attention mechanism might pay more attention to "cat" (the one doing the sitting) than to "The."
- This process refines the vector for each word. The output is a new set of vectors where each vector is now enriched with contextual information from the entire sentence.
Weight matrices for Q, K, V
All are 4×4, for simplicity.
Compute each component:
Let’s compute K vectors (same as ):
V vectors (same as ):
3.Attention scores
he attention score (the dot product Qi⋅KjQ_i \cdot K_jQi⋅Kj) measures how much the model thinks word jjj is relevant to word iii in the current context.
- High score → word j is important for understanding word iii → more “attention” will be paid to it.
- Low score → word j is less relevant → less influence.
Think of it like this:
Word iii (Query) asks: “Which words in this sentence should I pay attention to?”Each word j (Key) answers: “Here is my content.”Dot product = “How well does my content align with what you need?”
After applying softmax, the scores become weights (percentages) for a weighted sum over the value vectors:
- So the attention score is like a raw measure of relevance before normalization.
- Softmax turns it into a probability distribution → tells how much each word contributes to the new representation of word i.
For sat as query:
Score vector for “sat”: [4, 10, 10]
- In a standard Transformer, attention scores are computed separately in each layer.
- Encoder layers: 6 layers × 1 self-attention per layer → 6 attention computations.
- Decoder layers: 6 layers × 2 attentions per layer (masked self-attention + encoder-decoder attention) → 12 attention computations.
- Each layer takes the output of the previous layer as input, so attention scores are recalculated at every layer, progressively refining the contextual representation.
- Within each layer, multi-head attention computes multiple sets of scores in parallel, which are then combined into a single vector per word.
4.Softmax (Attention Weights)
Softmax:
Approximate:
- ≈ 54.598
- ≈ 22026.466
Sum ≈ 54.598 + 22026.466 + 22026.466 ≈ 44007.53
Weights:
- The → 54.598 / 44007.53 ≈ 0.00124
- Cat → 22026.466 / 44007.53 ≈ 0.5007
- Sat → 22026.466 / 44007.53 ≈ 0.5007
Observation: “sat” mostly attends to “cat” and itself equally, very little to “the.”
5. Weighted Sum (Output Vector Z_sat)
V_the = [2,0,2,0]
V_cat = [1,2,1,2]
V_sat = [1,2,1,2]
Compute each dimension:
- First: 0.001242 + 0.50071 + 0.5007*1 ≈ 0.00248 + 0.5007 + 0.5007 ≈ 1.004
- Second: 0.001240 + 0.50072 + 0.5007*2 ≈ 0 + 1.0014 + 1.0014 ≈ 2.0028 ≈ 2.003
- Third: 0.001242 + 0.50071 + 0.5007*1 ≈ 0.00248 + 0.5007 + 0.5007 ≈ 1.004
- Fourth: 0.001240 + 0.50072 + 0.5007*2 ≈ 0 + 1.0014 + 1.0014 ≈ 2.0028 ≈ 2.003
- This is the contextualized vector for “sat”.
Step 4: Add & Norm (Residual)
This block stands for "Add & Layer Normalization." The "Add" part refers to adding the original input of the attention layer to its output (this is called a residual connection, which helps with training). "Norm" (Normalization) is a technique to stabilize the network. This is essentially a housekeeping step to ensure the model trains smoothly.
Residual Connection (Add)
Formula:
- In your example:
Add them:
Why do we add the original input?
- If we only used
Z_sat
from attention, the model might "forget" the original information, especially during early training.
- Adding
X_sat
acts like a shortcut: “Here’s your original vector, I just computed an adjustment.”
- This is called a residual connection.
Layer Normalization (Norm)
Simplified formula:
- Normalize each dimension of the vector so its mean is 0 and variance is 1.
- Then optionally scale (
γ
) and shift (β
) with learnable parameters.
- Purpose: stabilize training, prevent some dimensions from being too large or too small, which can cause gradient explosion or vanishing.
Analogy: Like seasoning a dish—make sure no flavor is too strong, everything is balanced.
5.MLPs (Feed-Forward Network):
- Each word’s contextualized vector from the attention + Add&Norm step (e.g.,
X_sat + Z_sat
after LayerNorm) is sent independently through the same small MLP.
- Typically, the MLP has two linear layers with a non-linear activation in between (e.g., ReLU or GELU).
- Purpose: refine each word vector further, combine features, and make it more expressive.
Encoder Output
- After the MLP (and its own Add & Norm), you get the final encoder output vector for that word, e.g.:
- This vector now encodes:
- The original meaning of “sat”
- Context from the whole sentence (“cat” is the subject)
- Non-linear transformations from the MLP
In short: this is the “contextual embedding” of the word coming out of the encoder.
Backpropagation
- During training, all parameters are updated via backpropagation:
- Attention weights (
W_Q
,W_K
,W_V
) - MLP weights
- Any LayerNorm scale & shift parameters (
γ
andβ
)
- The gradients flow through the attention, residual connections, and MLP, updating all weights to minimize the loss.
- Author:Entropyobserver
- URL:https://tangly1024.com/article/272d698f-3512-80db-8fe4-e57ba9ba2762
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!