type
Post
date
Apr 21, 2024
slug
summary
tags
Deep Learning
Machine Learning
category
Technology
status
Published
icon
password

NLLB-200-distilled-600M with LoRA: Architecture Overview
1. Base Model: NLLB-200-distilled-600M
NLLB-200-distilled-600M is a distilled sequence-to-sequence Transformer developed by Meta AI, designed for multilingual neural machine translation across 200 languages. The model follows the standard encoder-decoder architecture with the following specifications:
- Hidden dimension: d_model = 1024
- Attention heads: 16 per layer, with d_k = d_v = 64
- Encoder: 12 transformer layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward network (d_ff = 4096)
- Decoder: 12 transformer layers, each containing a masked self-attention sublayer, a cross-attention sublayer, and a feed-forward network
- Vocabulary: 256,206 tokens via SentencePiece, shared across all 200 languages
- Positional encoding: learned absolute positional embeddings
- Output: linear projection to vocabulary size followed by softmax
The model was obtained via knowledge distillation from the larger NLLB-200-1.3B, preserving multilingual translation capability at reduced computational cost.
2. LoRA Adaptation
Rather than updating all 600M parameters, we apply Low-Rank Adaptation (LoRA) to inject trainable parameters into the attention layers while keeping the rest of the model frozen.
2.1 Mathematical Formulation
For a frozen weight matrix W ∈ ℝ^(d×k), LoRA introduces a low-rank decomposition:
where:
- A ∈ ℝ^(r×k) is initialized with random Gaussian
- B ∈ ℝ^(d×r) is initialized with zeros
- r is the rank (r=8 in our optimal configuration)
- α is the scaling factor (α=64), controlling the magnitude of the update
At initialization, BA = 0, so the adapted model starts identical to the base model. During training, only A and B are updated.
2.2 Injection Points
LoRA is applied to all Q, K, V, O projection matrices across all attention sublayers:
Layer Type | Count | Modules | LoRA |
Encoder self-attention | 12 | Q, K, V, O | ✅ |
Decoder masked self-attention | 12 | Q, K, V, O | ✅ |
Decoder cross-attention | 12 | Q, K, V, O | ✅ |
Encoder FFN | 12 | — | frozen |
Decoder FFN | 12 | — | frozen |
Total LoRA modules: 36 layers × 4 projections = 144 adapter pairs (A, B)
2.3 Parameter Budget
For each projection matrix of shape (1024 × 1024) with r=8:
Note: the figure in the paper cites ≈1M trainable parameters, which may reflect that not all projection matrices are square (e.g., cross-attention K/V dimensions differ), or that only a subset of layers was counted. The key point is that trainable parameters remain < 0.5% of the 600M backbone.
3. Why This Design?
Why apply LoRA to Q, K, V, O but not the FFN?
Attention projections determine how the model attends to different tokens and how information flows through the network. By adapting these projections, the model can adjust how it focuses on domain-specific relationships between tokens. In contrast, FFN layers mainly function as a form of static knowledge storage, so modifying them is usually less important when adapting the model to a new domain.
Why use α = 64 with r = 8?
This setting results in an effective scaling factor of α/r = 8, which amplifies the contribution of the LoRA updates. This is particularly helpful for petroleum-domain adaptation, where the vocabulary and terminology differ significantly from general-domain text. Our hyperparameter search also showed that α had the strongest impact on performance (importance score = 0.97 based on fANOVA).
Why freeze the base model?
Freezing the base model helps preserve the multilingual representations learned during pre-training. This reduces the risk of catastrophic forgetting, while still allowing the model to specialize to the petroleum domain efficiently using only a small number of additional parameters.
- Author:NotionNext
- URL:https://tangly1024.com/article/325d698f-3512-8053-8f1d-e5cc1e9b503f
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
