Machine translation - NLLB-200-distilled-600M with LoRA

type

Post

date

Apr 21, 2024

slug

summary

NLLB-200-distilled-600M with LoRA: Architecture Overview

1. Base Model: NLLB-200-distilled-600M

NLLB-200-distilled-600M is a distilled sequence-to-sequence Transformer developed by Meta AI, designed for multilingual neural machine translation across 200 languages. The model follows the standard encoder-decoder architecture with the following specifications:

Hidden dimension: d_model = 1024

Attention heads: 16 per layer, with d_k = d_v = 64

Encoder: 12 transformer layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward network (d_ff = 4096)

Decoder: 12 transformer layers, each containing a masked self-attention sublayer, a cross-attention sublayer, and a feed-forward network

Vocabulary: 256,206 tokens via SentencePiece, shared across all 200 languages

Positional encoding: learned absolute positional embeddings

Output: linear projection to vocabulary size followed by softmax

The model was obtained via knowledge distillation from the larger NLLB-200-1.3B, preserving multilingual translation capability at reduced computational cost.

2. LoRA Adaptation

Rather than updating all 600M parameters, we apply Low-Rank Adaptation (LoRA) to inject trainable parameters into the attention layers while keeping the rest of the model frozen.

2.1 Mathematical Formulation

For a frozen weight matrix W ∈ ℝ^(d×k), LoRA introduces a low-rank decomposition:

where:

A ∈ ℝ^(r×k) is initialized with random Gaussian

B ∈ ℝ^(d×r) is initialized with zeros

r is the rank (r=8 in our optimal configuration)

α is the scaling factor (α=64), controlling the magnitude of the update

At initialization, BA = 0, so the adapted model starts identical to the base model. During training, only A and B are updated.

2.2 Injection Points

LoRA is applied to all Q, K, V, O projection matrices across all attention sublayers:

Layer Type	Count	Modules	LoRA
Encoder self-attention	12	Q, K, V, O	✅
Decoder masked self-attention	12	Q, K, V, O	✅
Decoder cross-attention	12	Q, K, V, O	✅
Encoder FFN	12	—	frozen
Decoder FFN	12	—	frozen

Total LoRA modules: 36 layers × 4 projections = 144 adapter pairs (A, B)

2.3 Parameter Budget

For each projection matrix of shape (1024 × 1024) with r=8:

Note: the figure in the paper cites ≈1M trainable parameters, which may reflect that not all projection matrices are square (e.g., cross-attention K/V dimensions differ), or that only a subset of layers was counted. The key point is that trainable parameters remain < 0.5% of the 600M backbone.

3. Why This Design?

Why apply LoRA to Q, K, V, O but not the FFN?

Attention projections determine how the model attends to different tokens and how information flows through the network. By adapting these projections, the model can adjust how it focuses on domain-specific relationships between tokens. In contrast, FFN layers mainly function as a form of static knowledge storage, so modifying them is usually less important when adapting the model to a new domain.

Why use α = 64 with r = 8?

This setting results in an effective scaling factor of α/r = 8, which amplifies the contribution of the LoRA updates. This is particularly helpful for petroleum-domain adaptation, where the vocabulary and terminology differ significantly from general-domain text. Our hyperparameter search also showed that α had the strongest impact on performance (importance score = 0.97 based on fANOVA).

Why freeze the base model?

Freezing the base model helps preserve the multilingual representations learned during pre-training. This reduces the risk of catastrophic forgetting, while still allowing the model to specialize to the petroleum domain efficiently using only a small number of additional parameters.