Attention is All You Need Whitepaper

“Attention is All You Need” is a groundbreaking whitepaper that introduces the Transformer model in natural language processing. The model relies on a self-attention mechanism and an encoder-decoder architecture, eschewing traditional recurrent and convolutional networks. Despite its simplicity, it achieved superior results in machine translation tasks.

Sequences and Encoding

The transformer model operates on sequences of input data, like sentences or paragraphs in a text. Each word is transformed into a high-dimensional vector (encoding) that the model can work with.

  • Word Embeddings: The first step in a Transformer is to convert words into vectors using an embedding layer. It’s a learnable parameter matrix that transforms one-hot encoded words into dense vectors.
  • Positional Encoding: Since Transformer doesn’t have the inherent notion of sequential order like RNNs or LSTMs, it adds positional encodings to the word embeddings to incorporate the order of words in a sequence.

Self-Attention Mechanism

At the heart of the Transformer model is the self-attention mechanism. It helps the model to weigh the importance of words in a sequence in relation to each other.

  • Scaled Dot-Product Attention: This is a specific type of self-attention mechanism used by the Transformer. For each word, it computes a weighted sum of all words in the sequence, where the weights are determined by the similarity of the words.
  • Multi-Head Attention: To capture different types of relationships in the data, the Transformer uses multiple ‘attention heads’. Each head computes its own version of the self-attention mechanism on the input, and the results are concatenated and linearly transformed to form the final output.

The Encoder-Decoder Architecture

The Transformer follows an encoder-decoder structure.

  • Encoders: Each encoder in the model applies self-attention mechanism, followed by a position-wise feed-forward network. The output of each encoder is the input to the next. The sequence of encoders is what makes up the ‘Encoder’ part of the model.
  • Decoders: Each decoder, on the other hand, has an extra layer to perform multi-head attention over the encoder’s output (known as Encoder-Decoder Attention), along with self-attention and feed-forward networks. The sequence of decoders forms the ‘Decoder’ part of the model.

Final Linear and Softmax Layer

The output from the top decoder goes through a final linear layer and a softmax operation to generate probabilities for the output sequence.

Loss Function and Backpropagation

The model is trained using a suitable loss function, often the cross-entropy loss between the predicted and actual outputs. The error is then propagated back through the model using the backpropagation algorithm, and the model’s parameters (weights) are updated using an optimization algorithm like Adam.


During training, to prevent the model from seeing future tokens in a sequence, a technique called masking is used. It’s applied in the self-attention mechanism of the decoders.