Machine translation - RNN&LSTM

type

status

date

slug

summary

What is Encoding?

Encoding is the process of converting an input sentence (e.g., a sentence in one language) into a vector (numerical representation). This vector is a compressed representation of the input, containing all the information needed for translation. The encoding process occurs in the encoder part of the NMT model, which typically uses RNNs or more advanced models like Long Short-Term Memory (LSTM) networks or Gated Recurrent Units (GRUs).

Example: Encoding the Sentence "Interesting Machine Learning!"

We will illustrate how to encode the sentence "Interesting Machine Learning!" through the following steps:

Input Sentence: The sentence "Interesting Machine Learning!" is in English.

Tokenization: The sentence is split into words (or subwords). For example: ["Interesting", "Machine", "Learning", "!"]

Word Embeddings: Each word or subword is converted into a vector using pre-trained word embeddings (e.g., Word2Vec or GloVe). The sentence is then transformed into a sequence of vectors:

"Interesting" → [0.12, 0.45, ..., 0.88]

"Machine" → [0.67, 0.24, ..., 0.98]

"Learning" → [0.54, 0.33, ..., 0.76]

"!" → [0.44, 0.87, ..., 0.55]

Input to RNN Encoder: These word embedding vectors are input into the RNN. At each time step, the RNN updates its hidden state based on the input word and the previous hidden state. After processing the entire sentence, the RNN generates a final context vector that is a compressed representation of the entire sentence. This context vector contains all the crucial information for translation and is passed to the decoder for generating the translated output.

Sequence to Sequence (Seq2Seq)

1.Introduction

In Seq2Seq tasks like machine translation, we are given an input sequence (e.g., a sentence in one language) and tasked with generating an output sequence (e.g., the translation of that sentence in another language).

Example:

Input: "I love ice cream."

Output: "J'adore la glace." (French translation)

The goal of Seq2Seq models is to generate the most probable output sequence given the input sequence, i.e., maximizing the conditional probability , where:

x is the input (e.g., "I love ice cream.")

y is the target (e.g., "J'adore la glace.")

2. Encoder-Decoder Framework

The Encoder-Decoder framework is commonly used in Seq2Seq tasks. The encoder reads the entire input sequence and encodes it into a vector (or set of vectors), which serves as the summary of the input. The decoder then uses this summary to generate the output sequence.

Encoder: Reads the input sequence and produces a fixed-length vector (for example, with RNN or LSTM).

Decoder: Generates the output sequence, using the encoder's representation and previously generated tokens.

Example:

For the sentence "I love ice cream," the encoder produces a context vector that contains information about the entire input sentence. The decoder then uses this context to generate the translated sentence "J'adore la glace."

3. Training with Cross-Entropy Loss

During training, Seq2Seq models learn to predict the next token in the sequence given the previous tokens. The cross-entropy loss is used to compare the predicted probability distribution with the actual token.

Example:

If the target sequence is "J'adore la glace," and the model predicts:

"J'" with a probability of 0.7,

"adore" with a probability of 0.6,

"la" with a probability of 0.8,

"glace" with a probability of 0.9,

The cross-entropy loss measures how well the predicted probabilities match the true target sequence.

Detailed Calculation Process for RNN in Neural Machine Translation (NMT)

In this example, we will walk through the step-by-step process of calculating the hidden states and output in a Recurrent Neural Network (RNN), which is commonly used in Neural Machine Translation (NMT).

We assume the following values for the calculation:

Assumed Values:

Input vectors (each word is encoded as a 3-dimensional vector):

(corresponding to the word "I")
(corresponding to the word "want")
(corresponding to the word "to")
(corresponding to the word "go")

Weight Matrices:

(Weights from input to hidden state)
(Weights from previous hidden state to current hidden state)

Biases:

(Biases for the hidden state)
(Weights from hidden state to output)
(Bias for the output)

Step 1: Time Step 1 Calculation (Input "I")

Initial hidden state:

Input vector:

Calculate Hidden State:

The hidden state at time step t is calculated using the following equation:

For time step 1:

Step 2: Time Step 2 Calculation (Input "want")

Previous hidden state:

Input vector:

Calculate Hidden State:

Step 3: Time Step 3 Calculation (Input "to")

Previous hidden state:

Input vector:

Calculate Hidden State:

Step 4: Time Step 4 Calculation (Input "go")

Previous hidden state:

Input vector:

Calculate Hidden State:

Step 5: Output Calculation

Final hidden state:

Now, we calculate the output vector using the final hidden state:

Thus, the output value is .

Step 6: Softmax Transformation

For the next step, we would typically apply the softmax function to convert the output vector into probabilities for word prediction. However, for simplicity, we will skip the softmax calculation in this example.

Let's now walk through the Decoder phase in detail using the final hidden state from the Encoder, which was:

We'll use this as the initial hidden state of the Decoder, and simulate the generation of a translated output sequence:

Target sentence: <SOS> → Je → veux → aller → <EOS>

Decoder Setup

We'll use similar assumptions as in the Encoder:

Word embeddings (3-dimensional) for Decoder input tokens.

The same RNN structure as the encoder (same dimensions and activation).

Each time step of the Decoder generates one output word using:

Assumed Decoder Embeddings

Let’s assign embeddings for decoder input tokens:

<SOS> → = [0.5, 0.1, 0.0]

Je → = [0.2, 0.4, 0.1]

veux → = [0.6, 0.3, 0.2]

aller → = [0.7, 0.5, 0.4]

We also use the same:

, , , , and as before.

Decoder Step 1: (input = <SOS>, output = Je)

Initial Hidden State:

Input Vector:

Compute Hidden State:

Term 1 :

Term 2 ():

Adding Bias (

Activation (tanh):

Compute Output:

Output Calculation ():

The model predicts "Je" since the score for "Je" is the highest.

Cross-Entropy Loss: Now, for calculating cross-entropy loss at this step:

The true target is Je. We assume a one-hot encoding of the target word for Je, where the probability for Je is 1 and all other words are 0.

Predicted output for Je is 0.514.

Decoder Step 2: (input = Je, output = veux)

Previous Hidden State:

Input Vector:

Compute Hidden State:

Term 1 ():

Term 2 ():

Adding Bias:

Activation (tanh):

Compute Output:

Output Calculation:

The model predicts "veux" because the score for "veux" is the highest.

2.Cross-Entropy Loss: For veux:

The true target is veux, and the probability of veux is 1, with 0 for all others.

Predicted output for veux is 0.4818.

Decoder Step 3: (input = veux, output = aller)

Previous Hidden State:

Input Vector:

Compute Hidden State:

Term 1 ():

Term 2 ():

Adding Bias:

[0.27+0.374+0.1,0.39+0.377+0.2,0.51+0.511+0.3]=[0.744,0.967,1.321]

Activation (tanh):

Compute Output:

Output Calculation:

The model predicts "aller" because the score for "aller" is the highest.

2.Cross-Entropy Loss: For aller:

The true target is aller, and the probability of aller is 1, with 0 for all others.

Predicted output for aller is 0.5031.

Decoder Step 4: (input = aller, output = <EOS>)

Previous Hidden State:

Input Vector:

Compute Hidden State:

Term 1 ():

Term 2 ():

Adding Bias:

Activation (tanh):

Compute Output:

Output Calculation:

The model predicts <EOS> because it has the highest score among all tokens.

2. Cross-Entropy Loss: For <EOS>

The true target is <EOS>, and the probability for <EOS> is 1, with 0 for all others.

Predicted output for <EOS> is 0.5046.

Final Output Sequence

Step	Input Word	Output Word	Logit / Score	Predicted Prob.	Cross-Entropy Loss
1	`<SOS>`	`Je`	0.514	0.514	0.666
2	`Je`	`veux`	0.4818	0.4818	0.733
3	`veux`	`aller`	0.5031	0.5031	0.686
4	`aller`	`<EOS>`	0.5018	0.5046	0.683

A total loss of 2.768 over 4 tokens gives you an average loss per token of ~0.692, which roughly corresponds to a prediction confidence of ~50–52%.It shows the model is learning (better than random guessing)