type
status
date
slug
summary
tags
category
icon
password
1. The Earliest "Word-for-Word" Translation Method
Concept:
The word-for-word translation method is one of the earliest approaches to machine translation. In this method, each word in the source language is directly replaced with its corresponding word in the target language, typically using a dictionary. However, this approach doesn't take into account grammatical structure or context, so it cannot handle word order variations or changes in meaning effectively.
Problems:
- Word Order Issues: Different languages have different word orders, and word-for-word translation leads to grammatical errors when the word order is not aligned between languages.
- Semantic Issues: Many words have multiple meanings or exist as fixed expressions. Word-for-word translation does not understand the actual context of the words, so it cannot handle nuances like idioms, synonyms, or homonyms properly.
Example:
- Spanish:
Quiero ir a la playa más bonita
- Word-for-word translation (literal translation):
I want to go to the beach more pretty
- Correct translation:
I want to go to the prettiest beach
Calculation Process:
Quiero
→I want
ir
→to go
a la
→to the
playa
→beach
más
→more
bonita
→pretty
In this literal translation, the word
más bonita
is directly translated as "more pretty," which is grammatically incorrect in English. The correct translation should be "prettiest" because más bonita
is the superlative form of the adjective "pretty" in Spanish. Therefore, word-for-word translation fails to address the grammar rules related to comparative and superlative adjectives, which are crucial for proper meaning in both languages.2. Rule-Based Translation Systems
Concept:
Rule-based translation systems rely on linguists to manually create a large set of language rules, such as grammatical rules, word order rules, and others, which are then applied to translate sentences from a source language to a target language. The system uses predefined rules to convert various components of the source sentence into their corresponding parts in the target language. These rules are typically based on linguistic principles, covering aspects such as word order, word class changes, tense, and voice.
Problems:
- Limited Coverage: To cover all language expressions, a massive number of rules must be written. However, colloquial expressions, slang, or non-standard language often cannot be anticipated and may not be properly translated.
- High Maintenance Cost: As more languages are added, the rule set becomes increasingly complex and large, requiring a significant amount of time and effort to update and maintain.
- Poor Adaptability: When dealing with complex sentence structures, ambiguities, or idiomatic expressions, rule-based systems struggle to produce accurate translations.
Historical Background:
Rule-based translation systems were widely used during the Cold War period, especially in government and military sectors. Linguists designed many of these systems to translate foreign documents (e.g., Russian, German) into the target language. These systems were primarily applied to technical documents, diplomatic communications, and other formal, structured texts that followed specific language patterns.
Example:
If we need to translate English into French, early rule-based methods relied on the following rules:
- Word Order Rules:
- English typically follows a subject-verb-object (SVO) structure, while French also usually follows the SVO pattern. However, certain structures (e.g., questions, inverted sentences) require special handling with additional rules.
- Word Class Rules:
- While English nouns have no gender, French nouns are either masculine or feminine. Therefore, it is necessary to apply the correct article or adjective form based on the gender of the noun. For instance, in translating "the dog," the gender in French would determine that it should be "le chien" (masculine).
Calculation Process:
Input Sentence:
I want to go to the beach
Step 1: Handling Word Order
The system applies the rules to convert the words:
I
→Je
want
→veux
to go
→aller
to the
→à la
beach
→plage
Step 2: Apply Word Class Rules
Since
beach
is a masculine noun in French, the system uses the appropriate article “à la” to match the gender.Generated French Sentence:
Je veux aller à la plage
Complex Sentence Issues:
Although simple sentences can be effectively handled by rule-based translation systems, when sentences become more complex, especially those involving idiomatic expressions or regional dialects, the system faces difficulties.
Example 1: Idiomatic Expressions
- English:
He kicked the bucket
- Literal Translation:
Il a donné un coup de pied au seau
- Problem: This is an idiom in English meaning "He passed away," but a rule-based system would fail to recognize the idiomatic meaning, resulting in a misleading translation.
Example 2: Complex Sentence Structures
- English:
The man who called you yesterday is my friend
- Literal Translation:
L'homme qui vous a appelé hier est mon ami
- This sentence is relatively simple and rule-based systems can handle it effectively. However, as the sentence becomes more complex, involving multiple clauses or additional descriptive elements, rule-based systems struggle.
For example:
- English:
The man who I met at the party, who was wearing a red shirt, is my friend
- Problem: This sentence has multiple clauses and added descriptions, which can overwhelm the rule-based system, leading to incomplete or incorrect translations.
Conclusion:
Rule-based translation systems depend on linguistically defined rules to translate sentences, making them well-suited for simple, structured texts. However, they face significant challenges in translating complex sentence structures, colloquial expressions, ambiguities, idioms, and metaphorical language. As a result, their applicability and efficiency are limited when faced with the unpredictable nature of natural language.
3. Statistical Machine Translation (SMT)
Concept: Statistical machine translation (SMT) relies on the use of parallel corpora, which are pairs of translated texts in two languages. By analyzing the frequency of word block (phrase) translations in these corpora, SMT models compute the probability of one language's word block translating into another language's word block.
Explanation of Key Terms:
- Parallel Corpus: A collection of text in two languages that are translations of each other. For example, the European Parliament's multilingual records.
Breakdown of the Translation Steps
Step 1: Chunking
- This involves breaking a sentence into smaller chunks (word blocks or phrases) that are easier to translate.
Step 2: Finding Multiple Translations for Each Chunk
- For each chunk, we look for possible translations from the parallel corpus, and assign weights based on the frequency of each translation appearing.
Step 3: Combining and Generating All Possible Sentences
- After obtaining the translations for each chunk, we combine them to generate all possible sentence structures. A language model is used to select the most natural one.
Example:
For the English sentence
I want to go to the beach today
, the chunks would be broken down as follows:- English Chunks:
I want
to go
to the
beach
today
- French Chunks:
Je veux
aller
à la
plage
aujourd'hui
Step 1: Chunking:
We split the sentence into smaller chunks that are easier to translate.
Step 2: Frequency Calculation:
Next, based on parallel corpora, we calculate the frequency of each chunk's translation. For example:
I want
→Je veux
occurs 400 times out of 500.
to go
→aller
occurs 350 times out of 450.
to the
→à la
occurs 480 times out of 500.
beach
→plage
occurs 500 times out of 600.
today
→aujourd'hui
occurs 450 times out of 500.
From these counts, we can calculate translation probabilities.
Step 3: Probability Calculation:
For each chunk, we calculate the translation probability based on the frequency of that chunk’s appearance in the corpus. The probability is given by:
For example:
Step 4: Generating All Possible Sentences:
We can now generate multiple candidate sentences by combining the translated chunks. For example:
Je veux aller à la plage aujourd'hui
Je veux partir à la plage aujourd'hui
(incorrect phrase structure)
Each possible sentence’s probability is the product of the probabilities of translating each chunk:
Using the values we calculated earlier:
=0.447
This gives us a probability score for the sentence.
Step 5: Language Model Selection:
After generating multiple sentence candidates, a language model is used to select the most natural sentence. The language model calculates the fluency of each sentence in the target language. For example,
Je veux aller à la plage aujourd'hui
is more likely to be natural in French compared to other generated candidates.Statistical Machine Translation (SMT) computes translation probabilities by analyzing word chunks in parallel corpora. For each word block, the system calculates how frequently it translates into another language's word block and combines those to generate possible sentences. The final sentence is selected based on the highest combined probability and the fluency in the target language, using a language model.
4. Neural Machine Translation (NMT)
Neural networks are computational models inspired by the human brain, designed to recognize patterns and relationships in data. They consist of multiple neurons (nodes) connected in layers, which process input data and generate output. Different types of neural networks are suited to different tasks.
Traditional Neural Networks:
- Structure: Input → Output
- Characteristics: Traditional neural networks process input data and generate output but lack the ability to remember previous inputs. This means they don't consider the context of earlier inputs when making predictions.
Recurrent Neural Networks (RNN):
- Structure: Input → Hidden State → Output
- Characteristics: RNNs are a special type of neural network that introduces a feedback mechanism, allowing them to maintain the state of previous inputs. This memory mechanism makes RNNs particularly suited for handling sequential data, such as language.
RNNs are designed to process sequential data, passing information from one state to the next at each time step. This memory feature makes RNNs ideal for tasks like language modeling, text generation, speech recognition, and machine translation, where the context of previous inputs is crucial.
In Neural Machine Translation (NMT), encoding is a critical step that determines how the input sentence is transformed into a vector representation that is suitable for translation.
What is Encoding?
Encoding is the process of converting an input sentence (e.g., a sentence in one language) into a vector (numerical representation). This vector is a compressed representation of the input, containing all the information needed for translation. The encoding process occurs in the encoder part of the NMT model, which typically uses RNNs or more advanced models like Long Short-Term Memory (LSTM) networks or Gated Recurrent Units (GRUs).
Example: Encoding the Sentence "Interesting Machine Learning!"
We will illustrate how to encode the sentence "Interesting Machine Learning!" through the following steps:
- Input Sentence: The sentence "Interesting Machine Learning!" is in English.
- Tokenization: The sentence is split into words (or subwords). For example:
["Interesting", "Machine", "Learning", "!"]
- Word Embeddings: Each word or subword is converted into a vector using pre-trained word embeddings (e.g., Word2Vec or GloVe). The sentence is then transformed into a sequence of vectors:
- "Interesting" → [0.12, 0.45, ..., 0.88]
- "Machine" → [0.67, 0.24, ..., 0.98]
- "Learning" → [0.54, 0.33, ..., 0.76]
- "!" → [0.44, 0.87, ..., 0.55]
- Input to RNN Encoder: These word embedding vectors are input into the RNN. At each time step, the RNN updates its hidden state based on the input word and the previous hidden state. After processing the entire sentence, the RNN generates a final context vector that is a compressed representation of the entire sentence. This context vector contains all the crucial information for translation and is passed to the decoder for generating the translated output.
Sequence to Sequence (Seq2Seq)
1.Introduction
In Seq2Seq tasks like machine translation, we are given an input sequence (e.g., a sentence in one language) and tasked with generating an output sequence (e.g., the translation of that sentence in another language).
Example:
- Input: "I love ice cream."
- Output: "J'adore la glace." (French translation)
The goal of Seq2Seq models is to generate the most probable output sequence given the input sequence, i.e., maximizing the conditional probability , where:
- x is the input (e.g., "I love ice cream.")
- y is the target (e.g., "J'adore la glace.")
2. Encoder-Decoder Framework
The Encoder-Decoder framework is commonly used in Seq2Seq tasks. The encoder reads the entire input sequence and encodes it into a vector (or set of vectors), which serves as the summary of the input. The decoder then uses this summary to generate the output sequence.
- Encoder: Reads the input sequence and produces a fixed-length vector (for example, with RNN or LSTM).
- Decoder: Generates the output sequence, using the encoder's representation and previously generated tokens.

Example:
For the sentence "I love ice cream," the encoder produces a context vector that contains information about the entire input sentence. The decoder then uses this context to generate the translated sentence "J'adore la glace."
3. Training with Cross-Entropy Loss
During training, Seq2Seq models learn to predict the next token in the sequence given the previous tokens. The cross-entropy loss is used to compare the predicted probability distribution with the actual token.
Example:
If the target sequence is "J'adore la glace," and the model predicts:
- "J'" with a probability of 0.7,
- "adore" with a probability of 0.6,
- "la" with a probability of 0.8,
- "glace" with a probability of 0.9,
The cross-entropy loss measures how well the predicted probabilities match the true target sequence.
Detailed Calculation Process for RNN in Neural Machine Translation (NMT)
In this example, we will walk through the step-by-step process of calculating the hidden states and output in a Recurrent Neural Network (RNN), which is commonly used in Neural Machine Translation (NMT).
We assume the following values for the calculation:
Assumed Values:
- Input vectors (each word is encoded as a 3-dimensional vector):
- (corresponding to the word "I")
- (corresponding to the word "want")
- (corresponding to the word "to")
- (corresponding to the word "go")
- Weight Matrices:
- (Weights from input to hidden state)
- (Weights from previous hidden state to current hidden state)
- Biases:
- (Biases for the hidden state)
- (Weights from hidden state to output)
- (Bias for the output)
Step 1: Time Step 1 Calculation (Input "I")
- Initial hidden state:
- Input vector:
Calculate Hidden State:
The hidden state at time step t is calculated using the following equation:
For time step 1:
Step 2: Time Step 2 Calculation (Input "want")
- Previous hidden state:
- Input vector:
Calculate Hidden State:
Step 3: Time Step 3 Calculation (Input "to")
- Previous hidden state:
- Input vector:
Calculate Hidden State:
Step 4: Time Step 4 Calculation (Input "go")
- Previous hidden state:
- Input vector:
Calculate Hidden State:
Step 5: Output Calculation
- Final hidden state:
Now, we calculate the output vector using the final hidden state:
Thus, the output value is .
Step 6: Softmax Transformation
For the next step, we would typically apply the softmax function to convert the output vector into probabilities for word prediction. However, for simplicity, we will skip the softmax calculation in this example.
Let's now walk through the Decoder phase in detail using the final hidden state from the Encoder, which was:
We'll use this as the initial hidden state of the Decoder, and simulate the generation of a translated output sequence:
Target sentence:
<SOS> → Je → veux → aller → <EOS>
Decoder Setup
We'll use similar assumptions as in the Encoder:
- Word embeddings (3-dimensional) for Decoder input tokens.
- The same RNN structure as the encoder (same dimensions and activation).
- Each time step of the Decoder generates one output word using:
Assumed Decoder Embeddings
Let’s assign embeddings for decoder input tokens:
<SOS>
→ = [0.5, 0.1, 0.0]
Je
→ = [0.2, 0.4, 0.1]
veux
→ = [0.6, 0.3, 0.2]
aller
→ = [0.7, 0.5, 0.4]
We also use the same:
- , , , , and as before.
Decoder Step 1: (input =
<SOS>
, output = Je
)- Initial Hidden State:
- Input Vector:
Compute Hidden State:
- Term 1 :
- Term 2 ():
- Adding Bias (
- Activation (tanh):
Compute Output:
- Output Calculation ():
The model predicts "Je" since the score for "Je" is the highest.
- Cross-Entropy Loss: Now, for calculating cross-entropy loss at this step:
- The true target is
Je
. We assume a one-hot encoding of the target word forJe
, where the probability forJe
is 1 and all other words are 0. - Predicted output for
Je
is 0.514.
Decoder Step 2: (input =
Je
, output = veux
)- Previous Hidden State:
- Input Vector:
Compute Hidden State:
- Term 1 ():
- Term 2 ():
- Adding Bias:
- Activation (tanh):
Compute Output:
- Output Calculation:
The model predicts "veux" because the score for "veux" is the highest.
2.Cross-Entropy Loss: For
veux
:- The true target is
veux
, and the probability ofveux
is 1, with 0 for all others.
- Predicted output for
veux
is 0.4818.
Decoder Step 3: (input =
veux
, output = aller
)- Previous Hidden State:
- Input Vector:
Compute Hidden State:
- Term 1 ():
- Term 2 ():
- Adding Bias:
[0.27+0.374+0.1,0.39+0.377+0.2,0.51+0.511+0.3]=[0.744,0.967,1.321]
- Activation (tanh):
Compute Output:
- Output Calculation:
The model predicts "aller" because the score for "aller" is the highest.
2.Cross-Entropy Loss: For
aller
:- The true target is
aller
, and the probability ofaller
is 1, with 0 for all others.
- Predicted output for
aller
is 0.5031.
Decoder Step 4: (input =
aller
, output = <EOS>
)- Previous Hidden State:
- Input Vector:
Compute Hidden State:
- Term 1 ():
- Term 2 ():
- Adding Bias:
- Activation (tanh):
Compute Output:
- Output Calculation:
The model predicts
<EOS>
because it has the highest score among all tokens.2. Cross-Entropy Loss: For
<EOS>
- The true target is
<EOS>
, and the probability for<EOS>
is 1, with 0 for all others.
- Predicted output for
<EOS>
is 0.5046.
Final Output Sequence
Step | Input Word | Output Word | Logit / Score | Predicted Prob. | Cross-Entropy Loss |
1 | <SOS> | Je | 0.514 | 0.514 | 0.666 |
2 | Je | veux | 0.4818 | 0.4818 | 0.733 |
3 | veux | aller | 0.5031 | 0.5031 | 0.686 |
4 | aller | <EOS> | 0.5018 | 0.5046 | 0.683 |
A total loss of 2.768 over 4 tokens gives you an average loss per token of ~0.692, which roughly corresponds to a prediction confidence of ~50–52%.It shows the model is learning (better than random guessing)
- Author:Entropyobserver
- URL:https://tangly1024.com/article/23fd698f-3512-8044-bbb0-dedd8afa486a
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!