This project aims to automate French-to-English text translation using Seq2Seq with attention mechanisms, covering data preparation, tokenization, text vectorization, and model training. The goal is to build an efficient and accurate translation pipeline that understands French sentences and generates high-quality English translations. I followed the TensorFlow tutorials to implement the Seq2Seq architecture, along with attention mechanisms. It enabled the model to capture language patterns and context effectively, leading to improved translation quality. The outcome is a powerful tool for automatic translation, contributing to advancements in natural language processing (NLP).
einops
and tensorflow-text
packages. These packages are essential for data manipulation and natural language processing with TensorFlow.
einops
is a Python library that allows flexible and expressive manipulation of tensor axes. It facilitates the rearrangement of dimensions and data processing in neural networks.
tensorflow-text
is an extension of TensorFlow specifically designed for natural language processing (NLP). It provides various text preprocessing functionalities and text encoding methods for use with NLP models.
numpy
is a Python library used for numerical calculations and operations on multidimensional arrays (real numbers, vectors, matrices, etc.).
typing
is a Python module that provides features for annotating types in the code. It is used here to specify the types of function arguments and return values.
einops
has already been explained previously during its installation.
matplotlib.pyplot
is used to create visualizations, including graphs and plots.
matplotlib.ticker
is used for managing marks and labels on graph axes.
Finally, we import tensorflow
and tensorflow_text
, which are the main libraries for creating neural network models and natural language processing with TensorFlow.
ShapeChecker
that helps us verify the shapes of tensors during data manipulation. This class is particularly useful to ensure dimension compatibility when using neural network models.
ShapeChecker
class has a __call__
method, which takes a tensor and a list of axis names and performs shape checking. If TensorFlow is in eager execution mode (interactive mode), the verification is performed. Otherwise, nothing happens, which is convenient when training models.
pathlib
library to handle file paths and the tf.keras.utils.get_file()
function to download the file. The download link points to a commonly used translation dataset.
context_raw
) and targets (target_raw
). We then display the last French and English sentences to verify that the data loading was done correctly.
is_train
and is_val
, using a uniform random distribution to distribute examples between the two sets. Approximately 80% of the examples are for training (is_train=True
), and the rest are for validation (is_val=False
).
context_raw
and target_raw
lists.
train_raw
and val_raw
) using the corresponding indices for training and validation sets. These datasets will be used to train and validate our translation model.
BATCH_SIZE
) to improve the efficiency of the training process.
tf_text.normalize_utf8()
function to decompose the characters into their compatible forms (NFKD) and transform them into normalized Unicode text. Then, we convert the text to lowercase and remove any character that is not an English alphabet letter, space, period, question mark, comma, or exclamation point. We also add spaces around punctuation to separate them as distinct tokens.
[START]
and [END]
around the text to indicate the beginning and end of the token sequence. This step is essential for translation models to know when to start and end text generation.
Preprocessing standardizes the text and transforms it into a sequence of tokens ready to be used by the translation model.
max_vocab_size
) that will limit the number of words considered for indexing. Then, we create two text processors, one for context (English) and the other for the target (French).
tf.keras.layers.TextVectorization
function for this, specifying the normalization function, the maximum vocabulary size, and the option ragged=True
to indicate that the sequences will have variable lengths.
.adapt()
method with the training dataset. This allows the text processors to learn the vocabulary using the training data.
UNITS
represents the number of units (neurons) for the encoding and attention layers. This parameter is set to 256, but it can be adjusted based on the needs and complexity of the model.
self.embedding
) is used to convert tokens (words) into dense vectors. This allows representing the text in a continuous way and facilitates learning the relationships between words. The embedding is specified by the number of units (units
) that each word will be represented with.
self.rnn
) processes the embedding vectors sequentially. It takes the embedding vectors as input and returns a sequence of hidden states, capturing the contextual information in both forward and backward directions of the text. The merge_mode='sum'
option means that the outputs from both directions are summed.
call
method of the encoder takes an input sequence x
and performs the following operations:
self.rnn
).
convert_input
method is used to convert raw text into its encoded representation using the encoder. It takes a text as input, converts it into tokens, and passes it to the encoder to obtain the corresponding embedding vectors.
CrossAttention
). This layer allows the model to focus on specific parts of the context during translation.
- The layer uses tf.keras.layers.MultiHeadAttention
(self.mha
), which is an attention mechanism that processes information in multiple ways simultaneously.
- self.layernorm
is a normalization layer that improves the stability of learning.
- self.add
combines the attention outputs with the previous outputs.
The call
method of the attention layer takes a sequence x
and the context context
as input. It performs the following operations:
1. Obtain the attention weights and output using tf.keras.layers.MultiHeadAttention
.
2. Combine the outputs using self.add
.
3. Normalize the outputs using self.layernorm
.
This layer is used in the decoder to focus on the relevant context during translation generation.
self.word_to_id
and self.id_to_word
are layers for converting words to unique identifiers and vice versa. They are used to manage the vocabulary of the target sequences.
self.start_token
and self.end_token
represent the identifiers of the start and end tokens of the sequence. They are used to indicate when to start and stop generating the translation.
self.embedding
is an embedding layer to convert token identifiers into embedding vectors.
self.rnn
is an RNN layer (GRU) used to process the target sequences.
self.attention
is the attention layer (CrossAttention
) used to focus on the context during translation generation.
self.output_layer
is a dense layer that predicts the next token based on the decoder outputs.
call
method of the decoder takes the encoded context context
, input tokens x
, decoder state state
, and a return_state
option. It performs the following operations:
x
.
get_initial_state
method is used to initialize the decoder state before translation. It returns the start token for each batch sequence, initializes the "done" variable to false for all sequences, and returns the initial state of the RNN.
tokens_to_text
method converts tokens to text using the reversed dictionary self.id_to_word
. It joins the words to form a sentence and removes the start and end tokens.
get_next_token
method is used to predict the next token during translation generation. It takes the context, the next token, the decoder state, the "done" variable (indicating if a sequence is finished), and a temperature option for random generation. If the temperature is 0, the token is chosen with the highest probability (deterministic mode). Otherwise, the token is chosen randomly based on the logits (stochastic mode). This method is used to iterate over the tokens and generate the complete translation word by word.
Translator
to combine the encoder and decoder into a complete translation model. The Translator
class inherits from tf.keras.Model
, which allows us to define the call
method to perform translation.
call
method takes a tuple inputs
containing the context and the target sequence x
as input. It performs the following operations:
context = self.encoder(context)
).
logits = self.decoder(context, x)
).
translate
) for the Translator
model. This method takes raw text as input and returns its translation using the trained model. It uses the encoder to convert the raw text into its encoded representation (context = self.encoder.convert_input(texts)
), and then uses the decoder to generate the translation word by word using the get_next_token
method of the decoder.
plot_attention
). This method takes raw text as input, uses the model to translate it, and displays the attention weights on a matrix to visualize which parts of the context were used to generate each word of the translation.
masked_loss
function calculates the loss while ignoring padding tokens. This is necessary because the target sequences can have different lengths and are padded with padding tokens. By masking these padding tokens, the loss is only calculated on relevant tokens.
masked_acc
function also calculates accuracy while ignoring padding tokens. Just like masked loss, masked accuracy ensures that the metric is only calculated on relevant tokens and ignores padding tokens.
fit
method to train the model on the training data. We repeat the training data for multiple epochs and use validation to monitor performance. We also use the tf.keras.callbacks.EarlyStopping
callback to stop training if the loss does not improve for a certain number of consecutive epochs. This prevents unnecessary overfitting and allows us to choose the best model based on its performance on the validation data.
Translator
class. This method takes raw text as input and returns its translation using the trained model.
get_next_token
method of the decoder to generate the next token based on the attention weights computed by the attention layer. Finally, we concatenate the lists of tokens to obtain the complete translation and return it.
plot_attention
method to visualize the attention weights during translation. This method takes raw text as input, uses the model to translate it, and displays the attention weights on a matrix to visualize which parts of the context were used to generate each word of the translation.
plot_attention
method of the model to display the attention weights between the input text and the generated translation. We first test on a short example and then on a longer text.
save
method of Keras, as well as the training history as a pickle file for future reuse.