LoRA | EntropyObserver

type

status

date

slug

summary

Full Parameter Fine-Tuning

Full parameter fine-tuning refers to adjusting all of the pre-trained model's weights for a specific task. In this process, the original weight matrixis optimized based on data from the current task. The model adjusts its weights to better suit the particular task (e.g., sentiment analysis, image classification, etc.).

Step 1: Initialize Weights

The pre-trained model already has initial weights W0W_0W0, which were learned during pre-training. These weights contain some general knowledge, such as basic image features (e.g., edges, colors) or relationships between words in text.

For example, assume the initial weight matrix is:

This matrix contains the original parameters learned during pre-training. However, it may not be fully optimized for your specific task, so fine-tuning is required.

Step 2: Calculate the Update Matrix

During the fine-tuning process, we adjust the weights based on the current task's data. The update matrix represents the changes to the original weights. It is computed via backpropagation, where the model calculates the gradients (derivatives of the loss function) with respect to each weight. These gradients indicate how much the weights should change to minimize the loss.

For example, the update matrix might look like:

This matrix represents the adjustments to each of the original weights, learned during the fine-tuning process.

Step 3: Update the Weights

The updated weight matrix is obtained by adding the update matrix to the original weight matrix :

The element-wise addition is as follows:

In full parameter fine-tuning,every weight parameter is updated (either increased or decreased). In this example, the updated weight matrix

is obtained by adding the update matrix

to the original weight matrix . Therefore,all the parameters are updated, which is a key characteristic of full parameter fine-tuning.

LoRA (Low-Rank Adaptation)

LoRA is a method designed to efficiently fine-tune large pre-trained models with reduced computational cost. Instead of updating all the parameters of a model during fine-tuning, LoRA adds two small matrices, A and B, which adapt the pre-trained model's weights with a low-rank approximation.

Weight Update Formula:

Where:

is the original weight matrix.

r is the rank (which is much smaller than ddd, typically r≪d).

Number of Parameters in LoRA:

The number of parameters to be trained in LoRA is:

Where:

d is the original dimension of the weight matrix.

r is the rank of the low-rank adaptation matrices.

Parameter Comparison Example

Assume that:

d=100 (dimension of the original weight matrix).

r=2 (rank).

For full parameter fine-tuning:

For LoRA fine-tuning:

This means that LoRA reduces the number of parameters to train by a factor of 25.

Matrix Representation:

Full Parameter Fine-Tuning:Wnew=W0+ΔW

Update matrix is 100×100, and we train 10,000 parameters.

LoRA Fine-Tuning:

Update matrix involves B () and A (), resulting in 400 parameters to train.

Numerical Example (Partial Matrices)

Assume:

The update matrix ΔW\Delta WΔW is calculated as:

Finally, the updated weight matrix is:

Determining the Rank r

The rank r is a crucial hyperparameter in LoRA, as it controls the size of the low-rank matrices A and B. Here's how r impacts the model:

Role of r:

r determines the dimensionality of the matrices A and B that adjust the pre-trained model's weights.

A has dimensions , and Bhas dimensions , where d is the original weight matrix dimension.

How to Choose r:

Empirical Selection: Typically, r is chosen to be relatively small. Common values are r=2,4,8,16 etc.

Efficiency vs. Performance:

If r is too small, it may not capture enough information, leading to reduced performance.
If r is too large, the advantages of LoRA diminish, as it approaches full parameter fine-tuning in terms of the number of parameters trained, which would make it computationally expensive.

Balance Between Performance and Efficiency:

For simple tasks or smaller models, a smaller rank like r=2 or r=4 is often sufficient and much more efficient.

For complex tasks, a larger rank such as r=8 or r=16 may be necessary to capture more information and improve performance.

Task Dependency:

Simple tasks or smaller models: Smaller ranks like r=2 are typically enough.

Complex tasks: Larger ranks like r=8 or r=16 may be needed to fully capture task-specific information.

Experimentation:

Selecting the optimal rank r is usually an experimental process. Start with a smaller r, then gradually increase and test performance on a validation set to determine the best value for r.

r is a hyperparameter that controls the trade-off between training efficiency and model performance. It is usually chosen to be small to keep training efficient while ensuring the model still performs well on the task at hand.