Machine translation - Transformer

type

status

date

slug

summary

Encoder

Encoder is responsible for this task. Its main job is to "read" and "understand" the entire input sequence and compress this understanding into a set of contextualized vectors.

Encoding

Imagine the Input Sequence is the English sentence: “The cat sat.”

Word embeddings:

V_the = [1, 0, 1, 0]

V_cat = [0, 1, 0, 1]

V_sat = [1, 1, 0, 0]

Positional encodings (toy example):

P_0 = [0, 0, 0, 0]

P_1 = [0, 0, 1, 0]

P_2 = [0, 1, 0, 0]

Final inputs = embedding + position:

Input matrix (3 words × 4 dims):

Multi-Head Attention: This is the core of the Encoder. The attention mechanism allows each word to "look at" all other words in the input sequence to better understand its own meaning in this specific context.

For example, when processing the word "sat," the attention mechanism might pay more attention to "cat" (the one doing the sitting) than to "The."

This process refines the vector for each word. The output is a new set of vectors where each vector is now enriched with contextual information from the entire sentence.

Weight matrices for Q, K, V

All are 4×4, for simplicity.

Compute each component:

Let’s compute K vectors (same as ):

V vectors (same as ):

3.Attention scores

he attention score (the dot product Qi⋅KjQ_i \cdot K_jQi⋅Kj) measures how much the model thinks word jjj is relevant to word iii in the current context.

High score → word j is important for understanding word iii → more “attention” will be paid to it.

Low score → word j is less relevant → less influence.

Think of it like this:

Word iii (Query) asks: “Which words in this sentence should I pay attention to?”
Each word j (Key) answers: “Here is my content.”
Dot product = “How well does my content align with what you need?”

After applying softmax, the scores become weights (percentages) for a weighted sum over the value vectors:

So the attention score is like a raw measure of relevance before normalization.

Softmax turns it into a probability distribution → tells how much each word contributes to the new representation of word i.

For sat as query:

Score vector for “sat”: [4, 10, 10]

In a standard Transformer, attention scores are computed separately in each layer.

Encoder layers: 6 layers × 1 self-attention per layer → 6 attention computations.

Decoder layers: 6 layers × 2 attentions per layer (masked self-attention + encoder-decoder attention) → 12 attention computations.

Each layer takes the output of the previous layer as input, so attention scores are recalculated at every layer, progressively refining the contextual representation.

Within each layer, multi-head attention computes multiple sets of scores in parallel, which are then combined into a single vector per word.

4.Softmax (Attention Weights)

Softmax:

Approximate:

≈ 54.598

≈ 22026.466

Sum ≈ 54.598 + 22026.466 + 22026.466 ≈ 44007.53

Weights:

The → 54.598 / 44007.53 ≈ 0.00124

Cat → 22026.466 / 44007.53 ≈ 0.5007

Sat → 22026.466 / 44007.53 ≈ 0.5007

Observation: “sat” mostly attends to “cat” and itself equally, very little to “the.”

5. Weighted Sum (Output Vector Z_sat)

V_the = [2,0,2,0]

V_cat = [1,2,1,2]

V_sat = [1,2,1,2]

Compute each dimension:

First: 0.001242 + 0.50071 + 0.5007*1 ≈ 0.00248 + 0.5007 + 0.5007 ≈ 1.004

Second: 0.001240 + 0.50072 + 0.5007*2 ≈ 0 + 1.0014 + 1.0014 ≈ 2.0028 ≈ 2.003

Third: 0.001242 + 0.50071 + 0.5007*1 ≈ 0.00248 + 0.5007 + 0.5007 ≈ 1.004

Fourth: 0.001240 + 0.50072 + 0.5007*2 ≈ 0 + 1.0014 + 1.0014 ≈ 2.0028 ≈ 2.003

This is the contextualized vector for “sat”.

Step 4: Add & Norm (Residual)

This block stands for "Add & Layer Normalization." The "Add" part refers to adding the original input of the attention layer to its output (this is called a residual connection, which helps with training). "Norm" (Normalization) is a technique to stabilize the network. This is essentially a housekeeping step to ensure the model trains smoothly.

Residual Connection (Add)

Formula:

In your example:

Add them:

Why do we add the original input?

If we only used Z_sat from attention, the model might "forget" the original information, especially during early training.

Adding X_sat acts like a shortcut: “Here’s your original vector, I just computed an adjustment.”

This is called a residual connection.

Layer Normalization (Norm)

Simplified formula:

Normalize each dimension of the vector so its mean is 0 and variance is 1.

Then optionally scale (γ) and shift (β) with learnable parameters.

Purpose: stabilize training, prevent some dimensions from being too large or too small, which can cause gradient explosion or vanishing.

Analogy: Like seasoning a dish—make sure no flavor is too strong, everything is balanced.

5.MLPs (Feed-Forward Network):

Each word’s contextualized vector from the attention + Add&Norm step (e.g., X_sat + Z_sat after LayerNorm) is sent independently through the same small MLP.