Lazy loaded image
Lazy loaded imageMachine translation - NLLB-200-distilled-600M with LoRA
Words 1214Read Time 4 min
Apr 21, 2024
Mar 17, 2026
type
Post
date
Apr 21, 2024
slug
summary
tags
Deep Learning
Machine Learning
category
Technology
status
Published
icon
password
notion image

NLLB-200-distilled-600M with LoRA: Architecture Overview

1. Base Model: NLLB-200-distilled-600M

NLLB-200-distilled-600M is a distilled sequence-to-sequence Transformer developed by Meta AI, designed for multilingual neural machine translation across 200 languages. The model follows the standard encoder-decoder architecture with the following specifications:
  • Hidden dimension: d_model = 1024
  • Attention heads: 16 per layer, with d_k = d_v = 64
  • Encoder: 12 transformer layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward network (d_ff = 4096)
  • Decoder: 12 transformer layers, each containing a masked self-attention sublayer, a cross-attention sublayer, and a feed-forward network
  • Vocabulary: 256,206 tokens via SentencePiece, shared across all 200 languages
  • Positional encoding: learned absolute positional embeddings
  • Output: linear projection to vocabulary size followed by softmax
The model was obtained via knowledge distillation from the larger NLLB-200-1.3B, preserving multilingual translation capability at reduced computational cost.

2. LoRA Adaptation

Rather than updating all 600M parameters, we apply Low-Rank Adaptation (LoRA) to inject trainable parameters into the attention layers while keeping the rest of the model frozen.

2.1 Mathematical Formulation

For a frozen weight matrix W ∈ ℝ^(d×k), LoRA introduces a low-rank decomposition:
where:
  • A ∈ ℝ^(r×k) is initialized with random Gaussian
  • B ∈ ℝ^(d×r) is initialized with zeros
  • r is the rank (r=8 in our optimal configuration)
  • α is the scaling factor (α=64), controlling the magnitude of the update
At initialization, BA = 0, so the adapted model starts identical to the base model. During training, only A and B are updated.

2.2 Injection Points

LoRA is applied to all Q, K, V, O projection matrices across all attention sublayers:
Layer Type
Count
Modules
LoRA
Encoder self-attention
12
Q, K, V, O
Decoder masked self-attention
12
Q, K, V, O
Decoder cross-attention
12
Q, K, V, O
Encoder FFN
12
frozen
Decoder FFN
12
frozen
Total LoRA modules: 36 layers × 4 projections = 144 adapter pairs (A, B)

2.3 Parameter Budget

For each projection matrix of shape (1024 × 1024) with r=8:
Note: the figure in the paper cites ≈1M trainable parameters, which may reflect that not all projection matrices are square (e.g., cross-attention K/V dimensions differ), or that only a subset of layers was counted. The key point is that trainable parameters remain < 0.5% of the 600M backbone.

3. Why This Design?

Why apply LoRA to Q, K, V, O but not the FFN?
Attention projections determine how the model attends to different tokens and how information flows through the network. By adapting these projections, the model can adjust how it focuses on domain-specific relationships between tokens. In contrast, FFN layers mainly function as a form of static knowledge storage, so modifying them is usually less important when adapting the model to a new domain.
Why use α = 64 with r = 8?
This setting results in an effective scaling factor of α/r = 8, which amplifies the contribution of the LoRA updates. This is particularly helpful for petroleum-domain adaptation, where the vocabulary and terminology differ significantly from general-domain text. Our hyperparameter search also showed that α had the strongest impact on performance (importance score = 0.97 based on fANOVA).
Why freeze the base model?
Freezing the base model helps preserve the multilingual representations learned during pre-training. This reduces the risk of catastrophic forgetting, while still allowing the model to specialize to the petroleum domain efficiently using only a small number of additional parameters.
 
 
 
上一篇
Machine translation - Transformer
下一篇
Perceptrons