type
Post
status
Published
date
Apr 21, 2024
slug
summary
tags
Deep Learning
Machine Learning
category
Technology
icon
password

NLLB-200-distilled-600M with LoRA: Architecture Overview
1. Base Model: NLLB-200-distilled-600M
NLLB-200-distilled-600M is a distilled sequence-to-sequence Transformer developed by Meta AI, designed for multilingual neural machine translation across 200 languages. The model follows the standard encoder-decoder architecture with the following specifications:
- Hidden dimension: d_model = 1024
- Attention heads: 16 per layer, with d_k = d_v = 64
- Encoder: 12 transformer layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward network (d_ff = 4096)
- Decoder: 12 transformer layers, each containing a masked self-attention sublayer, a cross-attention sublayer, and a feed-forward network
- Vocabulary: 256,206 tokens via SentencePiece, shared across all 200 languages
- Positional encoding: learned absolute positional embeddings
- Output: linear projection to vocabulary size followed by softmax
The model was obtained via knowledge distillation from the larger NLLB-200-1.3B, preserving multilingual translation capability at reduced computational cost.
2. LoRA Adaptation
Rather than updating all 600M parameters, we apply Low-Rank Adaptation (LoRA) to inject trainable parameters into the attention layers while keeping the rest of the model frozen.
2.1 Mathematical Formulation
For a frozen weight matrix W ∈ ℝ^(d×k), LoRA introduces a low-rank decomposition:
where:
- A ∈ ℝ^(r×k) is initialized with random Gaussian
- B ∈ ℝ^(d×r) is initialized with zeros
- r is the rank (r=8 in our optimal configuration)
- α is the scaling factor (α=64), controlling the magnitude of the update
At initialization, BA = 0, so the adapted model starts identical to the base model. During training, only A and B are updated.
2.2 Injection Points
LoRA is applied to all Q, K, V, O projection matrices across all attention sublayers:
Layer Type | Count | Modules | LoRA |
Encoder self-attention | 12 | Q, K, V, O | ✅ |
Decoder masked self-attention | 12 | Q, K, V, O | ✅ |
Decoder cross-attention | 12 | Q, K, V, O | ✅ |
Encoder FFN | 12 | — | frozen |
Decoder FFN | 12 | — | frozen |
Total LoRA modules: 36 layers × 4 projections = 144 adapter pairs (A, B)
2.3 Parameter Budget
For each projection matrix of shape (1024 × 1024) with r=8:
Note: the figure in the paper cites ≈1M trainable parameters, which may reflect that not all projection matrices are square (e.g., cross-attention K/V dimensions differ), or that only a subset of layers was counted. The key point is that trainable parameters remain < 0.5% of the 600M backbone.
3. Why This Design?
Why Q, K, V, O but not FFN?
Attention projections control how the model attends to and routes information — adapting these allows the model to re-weight domain-specific token relationships. FFN layers act more as static "memory" and are less critical for domain shift.
Why α=64 with r=8?
The effective scaling ratio is α/r = 8, which amplifies the LoRA update magnitude. This is particularly important for petroleum-domain adaptation, where the lexical shift from general-domain text is large. Our hyperparameter search confirmed that α dominates performance (importance = 0.97 via fANOVA).
Why freeze the base model?
Freezing preserves the multilingual representations learned during pre-training, preventing catastrophic forgetting while allowing efficient domain specialization with minimal compute.
- Author:NotionNext
- URL:https://tangly1024.com/article/325d698f-3512-8053-8f1d-e5cc1e9b503f
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
