The Complete Guide to Transformers

Let’s talk about Transformers. And no, I’m not talking about the half-car-half-robot-movie-I-never-watched. I promise you, this is a lot cooler.

Introduced in a paper called “Attention is all you need” by researchers at Google DeepMind, the transformer is a novel type of AI model architecture, which outperforms previous state-of-the-art models at an impressive level. It started the most recent wave of advancements in AI.

Today, I’m explaining Transformers comprehensively. I’m diving into the step-by-step mathematical procedure a Transformer takes to read in something like words and output a ‘smart’ series of words as a response.

To understand this article, it helps if you have a solid understanding of Linear Algebra and (some) understanding of the basics of Deep Learning.

Why someone should read this article

This article could be characterized as dense and long. It very well might be boring at times. But I promise you, it will do a great job of explaining how transformers work.

So why should you read this article?

You should read this article if you have a specific desire to understand transformers. Maybe it’s for a university class. Maybe its to fulfill a passion on the subject. Heck, maybe its because you’re going insane with boredom and the endless cat videos have provided no relief.

Regardless, this is not meant to be an entertaining article. It’s meant to be informative. Treat it as such.

Introduction / Overview

Transformers belong in a class of AI called Natural Language Processing (or NLP for short). We call the models in this field Language Models, because they operate on language. What does that mean?

It means the input of our model are words, and the output should also be words. However, our Language Models are mathematical — how can they operate on words instead of numbers?

The answer is that when we feed the model a sentence, we must first turn our words into a sequence of tokens, or a list of numbers. Essentially, the sentence will be sliced up into little bits, and each “bit” will be represented by its own number.

This is important, as our Transformer Model does not know how to operate on text. We need to map our words to numbers, which the model will then be able to understand. After the model operates on these numbers, it will output its own set of numbers. These numbers will then be translated back into words using the same “dictionary” that we used to turn words into numbers.

Screenshot 2024-05-23 at 12.59.08 PM.png

…

There are different flavors of the Transformer architecture. There are encoder-only models, encoder-decoder models, and decoder-only models. For chat-bots such as chatGPT, the decoder-only architecture is used. As such, that is what we will be focused on.

GPT is a Prediction Model

In Language Models, there is a distinction between how “training” is done and how “inference” is done. During training, we want our Language Model to accurately predict the next word. Remember how the Transformer Model outputs a set of numbers? More specifically, for each token, the model outputs a discrete probability distribution of what it thinks the next word is.

If the sentence “The dog barked up a tree” is passed into the model during training, then the model is trying to predict that dog comes after the, that barked comes after dog, and so on.