Lazy loaded image
Lazy loaded imageMachine translation - NLLB-200-distilled-600M with LoRA
Words 1236Read Time 4 min
Apr 21, 2024
Mar 16, 2026
type
Post
status
Published
date
Apr 21, 2024
slug
summary
tags
Deep Learning
Machine Learning
category
Technology
icon
password
notion image

NLLB-200-distilled-600M with LoRA: Architecture Overview

1. Base Model: NLLB-200-distilled-600M

NLLB-200-distilled-600M is a distilled sequence-to-sequence Transformer developed by Meta AI, designed for multilingual neural machine translation across 200 languages. The model follows the standard encoder-decoder architecture with the following specifications:
  • Hidden dimension: d_model = 1024
  • Attention heads: 16 per layer, with d_k = d_v = 64
  • Encoder: 12 transformer layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward network (d_ff = 4096)
  • Decoder: 12 transformer layers, each containing a masked self-attention sublayer, a cross-attention sublayer, and a feed-forward network
  • Vocabulary: 256,206 tokens via SentencePiece, shared across all 200 languages
  • Positional encoding: learned absolute positional embeddings
  • Output: linear projection to vocabulary size followed by softmax
The model was obtained via knowledge distillation from the larger NLLB-200-1.3B, preserving multilingual translation capability at reduced computational cost.

2. LoRA Adaptation

Rather than updating all 600M parameters, we apply Low-Rank Adaptation (LoRA) to inject trainable parameters into the attention layers while keeping the rest of the model frozen.

2.1 Mathematical Formulation

For a frozen weight matrix W ∈ ℝ^(d×k), LoRA introduces a low-rank decomposition:
where:
  • A ∈ ℝ^(r×k) is initialized with random Gaussian
  • B ∈ ℝ^(d×r) is initialized with zeros
  • r is the rank (r=8 in our optimal configuration)
  • α is the scaling factor (α=64), controlling the magnitude of the update
At initialization, BA = 0, so the adapted model starts identical to the base model. During training, only A and B are updated.

2.2 Injection Points

LoRA is applied to all Q, K, V, O projection matrices across all attention sublayers:
Layer Type
Count
Modules
LoRA
Encoder self-attention
12
Q, K, V, O
Decoder masked self-attention
12
Q, K, V, O
Decoder cross-attention
12
Q, K, V, O
Encoder FFN
12
frozen
Decoder FFN
12
frozen
Total LoRA modules: 36 layers × 4 projections = 144 adapter pairs (A, B)

2.3 Parameter Budget

For each projection matrix of shape (1024 × 1024) with r=8:
Note: the figure in the paper cites ≈1M trainable parameters, which may reflect that not all projection matrices are square (e.g., cross-attention K/V dimensions differ), or that only a subset of layers was counted. The key point is that trainable parameters remain < 0.5% of the 600M backbone.

3. Why This Design?

Why Q, K, V, O but not FFN? Attention projections control how the model attends to and routes information — adapting these allows the model to re-weight domain-specific token relationships. FFN layers act more as static "memory" and are less critical for domain shift.
Why α=64 with r=8? The effective scaling ratio is α/r = 8, which amplifies the LoRA update magnitude. This is particularly important for petroleum-domain adaptation, where the lexical shift from general-domain text is large. Our hyperparameter search confirmed that α dominates performance (importance = 0.97 via fANOVA).
Why freeze the base model? Freezing preserves the multilingual representations learned during pre-training, preventing catastrophic forgetting while allowing efficient domain specialization with minimal compute.
上一篇
Machine translation - Transformer
下一篇
Perceptrons