What is this post about?
In this post we will analyze an amazing paper entitled:
“Neural Machine Translation by Jointly Learning to Align and Translate”
Why? Well, I was reading the paper ‘Attention is All You Need!’ by Google and quite frankly I was annoyed by the whole tuning-to-death strategy throughout their paper. So, I found out that they have cited this paper as the foundation of their work. So, I decided to read this paper, and I have indeed enjoyed it.
This paper has been accepted in ICLR (International Conference on Learning Representations) in 2015 with nearly 19,256 citations. One of the authors is the legend Prof. Yoshua Bengio. I want you to appreciate the genius of their proposal but also think about their thought process in terms of 1) Defining the problem 2) Proposing a solution.
This is a paper about learning neural translation models, it highlights the use of ‘Attention’ mechanism to train a neural network for the task of English-to-French translation. Their proposal challenges the paper’s predecessors, who at time had only been using simple encoder-decoder architectures. Let’s dig into it, so I am ready when you are 😉

The Problem Statement
The authors point out a general issue with most common neural machine translation techniques. These techniques are encoder-decoder based approaches, where the encoder (i.e., a neural network) reads a source sentence and learns to map it into a fixed-length vector. Next, the decoder network learns to output the correct translation from this fixed-length vector. The whole structure is trained as a whole in an end-to-end fashion in order to maximize the probability of producing the correct translation. What is the issue then?
The network is forced and restricted to encode the source sentence and all of its jucy information into a FIXED- LENGTH representation!
This becomes a serious issue especially as the length of the source sentence increases.
My thoughts: We know that there are only so many possible words in a sentence, when it comes to human-generated languages (e.g., French, English, Persian,…). However, I would like to think that neural translation should be applicable to non-language sequences! For instance, translating a sequence of amino-acids to their corresponding protein structure. This is where the length of the source sentence (i.e., sequence) can really grow and the traditional encoder-decoder models will suffer as they have to compress the sequence into a fixed-length vector.
Now let’s how Prof. Bengio and his collegues have gone about fixing this issue.
Learning to Align and Transpate
Make no mistake, in this paper we still have the idea of encoders and decoders. However, the authors propose a method where the decode will use the ‘Attention’ mechanism, where it relieves the encoder from the burden of being forced to encode ‘ALL’ information in the source sentence/sequence into a fixed length vector. In addition, in their proposed architecture, a bidirectional RNN is used as an encoder, and the decoder is responsible for searching through the source sentence/sequence (i.e., learning where to focus its Attention in the input!) while decoding the correct treanslation.
My thoughts: It is astonishing to me that the works prior to this paper had not thought of using a bi-directional RNN! I mean you would like to learn patterns in your sequence from every goddamn direction possible, right? Having said that, I am not sure, whether while reading a sentence like: ‘I am terribly hungry’, our brain cares about the opposite direction in the sequence: ‘hungry terribly am I’!
Now let’s explore the decoder and then the encoder. It makes it easier to dicuss the decoder first, trust me.
The Decoder
Don’t panic Ok? This is the architecture of the decoder:

Before doing anything, let me define each variable in the figure in plain English. Then we will study the math and how these relate to oneanother. Then you will realize the genius in their design:
: the
word in the sequence
: the output translation word at time
: the hidden state of an RNN(Recurrent Neural Network) model at time
. (NOTE: When in
the RNN model produces
)
: an annotation vector (yes! A bloody vector!) for the whole sequence
, however, with a strong focus on
(And its proximity). This is generated by the encoder network, which I will dissect in the next section. So for now think of it as a rich representation of
.
Now. let’s understand this whole decoder model. Mathematically speaking, the model is pushed to maximize the probability of producing the correct at time
. This probability depends on a few factors, as shown below:
You notice that this is a recursive definition and aslo we are representing this probability as a function , whose output depends on
,and
. Indeed,
is reponsible for emitting
.
Note: The output translated word at time stamp
depends on the translated word at time stamp
(Remember: This is a recursive definition so
also depends on
and so on) and the current state of the RNN model at time
. So you can say that the translated word at stamp
, depends on all of the already translated works includint that mysterious
at time
.
So is the hidden state of our RNN decoder (denoted as
) at time
. It is defined as:
In other words, the hidden state of the decoder RNN at time depends on the hiden state in
, the translated work at
and the value of the context vector
at time
.
NOTE: Unlike traditional encoder-decoder neurotranslators, here for producing EVERY translated worf at time
,
, there is a distinct context vector
at time
,
.
defines the context of the input sequence at time
, however, not using the input sequence directly. It uses a sequence of annotations ,
, to which the encoder network maps the input sequence (you will see how smartly it does so!).
Note:You might be asking yourself, what the heck is going on with those 2 rows of
in the encoder’s plot. Trust me, I will tell you all about it in the next section. For now, focus on understanding the meaning of these annotations.
Each one of these annotation vectors, , contains information about the entire input sequence of length
, however, with a strong focus on the parts surrounding the
element of the input sequence.
Note: There is a very good reason why the keyword ‘surrounding’ is used in here. You will see how the encoder makes this happen in the next section.
Now let’s define exavtly how the context vector is computed using these annotations, mathematically speaking:
In plain English:At every time stamp
, we are computing a weighted sum accross aLL of the annotation vectors, each of which, has information about the ENTIRE input sequence. But then, how are these weights learned?
The weight of each annotation vector
. It is a learnable parameter and what it means is that at time
, when I want to produce
, how much should I pay attention (This is what the authors refer to learning to Align) to the word
and its surrounding in the input sequence (i.e., determined by the corresponding annotation
). In fact the decoder learns to assign these weights appropriately, geven the annotations produced by the encoder. Mathematially speaking, the authors insist on these weights to sum up to 1, hence they have used
it the definition of these weights:
I know! What the heck is ? It is called the allignment model. In plain English, it is a score that tells us:
How well the inputs around position
and the translapted output word
at time step match! In other words, how important are the inputs around position
in producing the correct translation
. Effectively, this implements a mechanism of attention in the decoder!
Mathematically, this is represented as:
and is parametrized as a feedforward neural network (i.e., ), which is jointly trained alongside all of the other components of the entire model.
You can see that the score at time
really depends on the previous state of the RNN decoder,
(just before emitting
), and the annoptation
.
Note: The alignment model is very interesting as it directly computes a soft alignment (i.e., aligning
not just with the input at position
but with inputs around it as well). Since it is a feed-forward neural net, it can use the gradient of the cost of the ENTIRE model and even let it through all the way to the encoder network as well. So, the gradients can be used to train the entire model including the alignment model.
So we can say that, is the probability (since outputted from softmax and sum up to 1) that the target word
is aligned with, or translated from, a source word
. Then how can we explain the context vector
in plain English:
is indeed the expected annotation over all the annotations,
. This is an expectation defined by the weights
, where
.
Last important point before we finish this section is that, by letting the decoder have an attention mechanism, the encoder does not need to encode the entire input sequence into a fixed-length vector. But rather, the information can be spread throughout the annotation vectors, which then can be selectively paid attention to, by the decoder. The decoder learns where to focus its attention in other words, given the annotations related to the entire input sequence.
Let’s now start talking about that mysterious encoder and see how it manages to generate all of these annotations.
ENCODER: BIDIRECTIONAL RNN FOR ANNOTATING SEQUENCES
As we mentioned, we would like each annotation vector for each word to summarize both the preceding words and the following words. This is why the encoder networkd needs to use a bidirectional RNN (BiRNN) to capture the summary of the sequence in 2 directions. A BiRNN has 2 main components:
1) Forward RNN: This RNN reads the input sequence as it is ordered (from x1 to xT) and produices a sequence of forward hidden states
. But this is only half of the picture.
2) Backward RNN: This RNN reads the input sequence in reverse (from xT,…,x1) and computes a sequence of backward hidden states
MAKE SURE to REVERSE the Arrows
Then the final annotation for a given word xj can be produced by concatenating the forward hidden state hj and backward hidden state hj, that is, hj = [hjT;hjT]. This is interesting as now the nnotation hj contains the summaries of bothe the preceding words and the following words.
My thoughts:Remember that when we look at the jth element in the forward hidden states, yes it is an annotation for xj. However, remember that, it captures a summary of the ENTIRE input sequence (while reading from x1 to xT). It is as if it is summarizing a left-to-right sort of a summary as to what the logic on the input sequece is if I read from left to right. It is the same story with the jth element of the backward annotations hj, however, it summarizes the entire sequence while trying to understand the logic in the sequence while reading from right to left.
Now here is an annoying question: if the concatenated hj captures thesummary of the entire sequence, then WHY is that an annotation for the jth element of the input sequence? I think the authors explain this by saying that: ‘Due to the tendency of RNNs to better represent recent inputs, the annotation hj will be focused on the words around xj’. As a result, the forward hj capture the entire input sequence while focusing on xj and how the sequence developed from left to right to reach to xj, while forcusing on the left-hand proximity of xj (since RNNs better remember recent inputs).
Finally, these annotations are used by the alignment model (i.e., the weighted sum of these annotations where the weights are learned using a forward-pass neural net) to produce the context vector, which is then used by the decoder to produce the current translated word at time stamp t.
Experiment Setting
Be can divide the experiment setup into the following:
- What is the task? The experiments are for the task of English-to-French translation.
- What is the dataset? The ACL WMT’14 bi-lingual corpora is used.
- What is the metric of evaluation? The BLEU score is used as the performance evaluation metric.
- What are the models used? Two types of models are used:
- A classic RNN Encoder-Decoder model referred to as RNNencdec (proposed by Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).). In particular, different variations of this model is used, based on the maximum length of the input sequence. For example RNNencdec-50, means that the RNNencdec model is trained using input sequences of a length of up to 50 words. In addition, the encoder and decoder of the RNNencdec have 1000 hidden units each.
- The proposed model referred to as RNNsearch. In particular, different variations of this model is used, based on the maximum length of the input sequence. For example RNNesearch-50, means that the RNNsearch model is trained using input sequences of a length of up to 50 words. To make comparison fair with RNNencdec models, the encoder of the RNNsearch consists of forward and backward RNNs, each having 1000 hidden units. The decoder again has 1000 hidden units.
Quantitative Results
Take a look at the figure below. See how the traditional encoder-decoder RNNs (i.e., RNNenc) drop in their performance as the length of the input sequence increases.
Interestingly this might be due to the fact that RNNenc models are always forced to compress a summary of the input sequence in a fixed-length context vector. Hence, they struggle to do so when the input sequence becomes longer and longer! They fail to compress those long sequences into a fixed-length vector!
Note how fast the performance of the RNNenc models (i.e., RNNenc-30 and RNNenc-50). See how RNNsearch-50 keeps having a stable and solid performance irrespective of the input sequence (I think it will drop eventually if we go beyon sequence of lengh of 60, however, it is beating all others pretty impressively!).

Conclusions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.


Author: Mehran
Dr. Mehran H. Bazargani is a researcher and educator specialising in machine learning and computational neuroscience. He earned his Ph.D. from University College Dublin, where his research centered on semi-supervised anomaly detection through the application of One-Class Radial Basis Function (RBF) Networks. His academic foundation was laid with a Bachelor of Science degree in Information Technology, followed by a Master of Science in Computer Engineering from Eastern Mediterranean University, where he focused on molecular communication facilitated by relay nodes in nano wireless sensor networks. Dr. Bazargani’s research interests are situated at the intersection of artificial intelligence and neuroscience, with an emphasis on developing brain-inspired artificial neural networks grounded in the Free Energy Principle. His work aims to model human cognition, including perception, decision-making, and planning, by integrating advanced concepts such as predictive coding and active inference. As a NeuroInsight Marie Skłodowska-Curie Fellow, Dr. Bazargani is currently investigating the mechanisms underlying hallucinations, conceptualising them as instances of false inference about the environment. His research seeks to address this phenomenon in neuropsychiatric disorders by employing brain-inspired AI models, notably predictive coding (PC) networks, to simulate hallucinatory experiences in human perception.
Responses