Attention Is All You Need

Introduction

In the last three years, the Transformer architecture has become an influential paradigm within deep learning. It has been applied prolifically within natural language processing (NLP), is beginning to see promising applications in computer vision (CV), and is also used within many other modalities and fields of deep learning. The paper which introduced the Transformer is “Attention is All You Need” [1] by Vaswani et al. Attention is All You Need (from here, AAYN) uses the Transformer architecture to perform machine translation.

Historically, the work in AAYN was done when recurrent neural networks (RNN) were the dominant force in NLP. Common modifications of these included the Long short-term memory (LSTM) [5] and gated recurrent unit (GRU) [6]. However, these models have a big problem—they compute along the length of a sequence, so they cannot be parallelized easily. Additionally, RNNs struggle to learn long-term dependencies. In order to rectify this issue, the authors propose the key idea (and title): attention is all you need. Although there had been previous work on using attention, most of those papers combined it with RNNs, so it still had the drawbacks from that method.

Task – Machine Translation

In AAYN, the primary goal of the model is translation, making this a sequence-to-sequence generation problem. In particular, they focus on English to German and English to French tasks from WMT2014. Machine translation is trained on bitext – data where each sample consists of the same sentence in the source and the target language. Then, the model is evaluated on a test set where it has to translate sentences. The results are compared to several human reference translations which are used to compute the BLEU score. It is defined as follows [2,7]

Definition of BLEU from [2]

Here, the key things to note are the brevity and n-gram overlap. Note that if the model outputs something very short, then it has a high probability of completely overlapping with n-grams in a reference translation. To penalize this, the brevity penalty is added, so when the output translation is shorter than the reference translation, then the exponent will be to the power of a negative number making a smaller brevity term. If the reverse is true then the brevity will be 1 due to the minimum and be ignored. The other important term in BLEU is the n-gram overlap. This essentially measures how well an output matches the references. The different lengths of n-grams measure different things; unigrams measure adequacy and the longer n-grams measure fluency. Note that this definition allows the candidate output to combine parts from different reference translations and have a good score.

BLEU score interpretation from [2]

Preliminaries

The Transformer model uses an encoder-decoder sequence-to-sequence architecture. It can be described as mathematically as follows:

Input: Length $n$ sequence of symbolic representations: $(x_1,...,x_n)$
Encoder: produces latent representations: $z=(z_1,...,z_n)$
Decoder: uses $z$ to produce length $m$ output sequence: $(y_1,...,y_m)$

Model

The Transformer is the following model:

In the model, the encoder is on the left, and it feeds into the decoder on the right. In AAYN, the encoder and decoder are each stacked six times. We will examine the construction of both the encoder and decoder, so first let’s look at how the pieces of each layer are built.

Scaled Dot-Product Attention

Attention is at the key of the Transformer architecture. The intuition behind this approach is that it allows the model to decide what other symbols in the sequence are most important to look at for solving whatever problem. In AAYN, the attention mechanism is implemented using multiplicative attention (dot product). In addition, the major modification from the paper is to scale the dot products—this is done by dividing by the square root of the number of dimensions. This is due to the author’s observation that the the dot product grows too large in magnitude for a high number of dimensions, which would limit the model’s efficacy.

Dot-product attention works by learning a query, key, and value projection from some input. The query and key values are used to compute the attention—how much weight the model gives each token in a sequence. The multiplication between $Q$ and $K^T$ produces a sequence length by sequence length array of logits. Then Softmax is applied, which essentially turns the logits into probability distributions (one for each symbol in the sequence). These probabilities are used to compute a weighted average of the values $V$ . $V$ contains a representation for each symbol in the sequence, so the attention distribution decides how much one symbol $i$ should pay to any other symbol $j$ . A visual of this will be shown later in the results section.

Multi-Head Attention

The authors notice that the weighted average in attention prevents the model from looking at different representation subspaces—it can’t consider multiple different parts of the sequence without averaging them. To fix this, the authors propose multi-head attention.

Multi-head attention essentially allows the model to look at multiple things at the same time. Each attention head can learn to look for different things, such as connecting adjectives to nouns or connecting verbs and objects. Naively using multi-head attention, however, would increase the computational costs of the model.

To address this issue, the authors decrease the representation dimension of each head by $h$ , where $h$ is the number of heads. This results in the same total number of parameters in the attention mechanism. The following dimensions are used for each head:

The output representation from each head is concatenated together to create a vector of the original length, $d_{model}$ . This is projected again.

Positional Embeddings

One key issue of only using attention is that, according to attention, one symbol in a sequence is the same distance away from any other symbol. To allow the model to determine distance, the authors introduce positional embeddings.

Sinusoidal positional embeddings.

Visualization of sinusoidal positional embeddings from [3]. Each column is a positional embedding.

Equations for the sinusoidal position embeddings

The authors selected a sinusoidal embedding function because the offset between embeddings can be represented as a linear function. However, many positional embeddings are possible and the authors also experiment with learned positional embeddings (achieving similar results). The sinusoidal embeddings are used in the paper because the authors hypothesize that they will allow the model to extrapolate to sequence lengths longer than those the model was trained on. Note that later work shows that position is not necessarily as important as intuition suggests [8].

Types of Attention

In AAYN, three types of attention are used:

Encoder-decoder attention: Queries $Q$ come from the last layer of the decoder, keys $K$ and values $V$ come from encoder. This allows the decoder to look at the input source language in order to translate it.
Encoder self-attention layer: Each position can attend to every other position in the previous layer of the encoder.
Decoder self-attention layer: Same as encoder self-attention but also mask out all connections in the Softmax that cannot have been seen.
- This maintains the autoregressive property of the model by preventing the model from looking at words it hasn’t seen yet.

Why Self-Attention?

Self-attention allows the model to learn dependencies between different symbols in the sequence. This is shown in the following table:

Note that self-attention achieves the best complexity in terms of both maximum path length and sequential operations (which indicates parallelizability). Additionally, the complexity per layer is low if $n$ is significantly smaller than $d$ , which often occurs in practice (although many researchers are working on models where this assumption no longer holds).

Putting It All Together

Alright, we’ve looked at all the pieces. Now, let’s put everything together!

The Encoder

The encoder combines all the parts we talked about. Then, it is stacked $N$ times (6 in the base Transformer model). A couple notable things that we didn’t mention yet:

The model uses either byte-pair or word-piece tokenization to create token sequences from raw character strings.
The model uses residual connections by adding an earlier representation. It follows this up with layer normalization.
The model uses a simple feed-forward network with two layers after attention.

The Decoder

The decoder is mostly the same as the encoder. However, it uses masking to ignore symbols that the model shouldn’t have seen yet from the input. For example, if I say “The cat sat” then the model would need to generate the next word. This is called autoregression. If this model generates “on”, then “The cat sat on” would be fed in the model to generate the next word (maybe “the”). When we’re training the model, we have the full sentence that the model is learning to generate, so we can’t let it know what’s coming.

In addition, the decoder also has an attention mechanism that looks at what it’s trying to translate. This allows it to see what it should be doing, and the decoder can use attention to connect the source input to its translated output. For example, “gato”, or cat in Spanish, might be connected to “cat” in the example above using attention.

Training

The training details for the model are as follows:

Sentences are encoded (convert a string of characters to a sequence of symbols):
- English-German uses BytePair encoding for 37,000 tokens on 4.5M sentence pairs.
- English-French uses WordPiece encoding for 32,000 tokens on 36M sentence pairs.
Batch size is determined in order to have 25,000 source and target tokens (symbols).
8 NVIDIA T100 GPUs are used to train the model.
- Base models trained for 12 hours, big models for 3.5 days.
Adam optimizer is used with a special learning rate:
- Linear warmup followed by inverse square root decay.
Regularization: Dropout of 0.1 applied to residual connections and sum of positional encoding and embeddings. Label smoothing is performed.
The last 5 checkpoints are averaged (for the base model). Beam search is used to select the best translation.

The most interesting details are that the batch size is dynamic so that the source and target tokens number approximately 25,000. The summation of the last checkpoints and using a learning rate decay are also interesting.

Results

The details are finally out of the way! Let’s look at some pretty pictures and tables.

As can be seen in the above figure, Transformer is able to set record BLEU scores in much less computation. Additionally, the “big” variant can even beat some expensive ensemble models! We can see the effect of different hyperparameters (such as those used in base versus big) on the performance in the following table:

Alright, now time for some visualizations from the paper:

As shown in the visualizations, the attention heads learn different, meaningful tasks, such as anaphora resolution or connecting determiners and their objects. In fact, the authors show that Transformer can be used directly for this type of task—English constituency parsing (extracting the syntactic structure of a sentence in the form of a tree).

Results are competitive to previous methods, even without task-specific fine-tuning.

Final Takeaway

Transformers precipitated a major change in the landscape of natural language processing and even other fields like computer vision. They lead to even more powerful general language models, such as BERT [4]. The paper can be summarized by the following points.

Motivation: RNNs are not easily parallelizable and don’t learn long dependencies well.
Models that only use attention are more effective and train faster.
Transformer can generalize to other tasks.
Multi-head attention helps address some of the problems of traditional attention. It allows multiple different attention tasks to be learned.
Transformers have a constant dependency path from one position to any other position.

References

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

[2] BLEU score definition: https://cloud.google.com/translate/automl/docs/evaluate

[3] https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

[4] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[5] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.

[6] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[7] Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).

[8] Sinha, K., Jia, R., Hupkes, D., Pineau, J., Williams, A., & Kiela, D. (2021). Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. arXiv preprint arXiv:2104.06644.