Understanding Transformer Model Process in Simple Terms

How does the transformer model perform training and inference?

Learning the Internal Mechanisms of Transformer Model

2025-4-10

transformerattention mechanismmasking

Transformer Model

Problem Description

How to understand the training process of the transformer model?

Problem Analysis

1. Encoder Layer

Let's say our training data is:

const data = [{text:'Cats like to eat fish<eos>'},{text:'Dogs like to eat bones<eos>'}]
  • Then use a tokenizer to organize our training data
const trainsData = [
  ['Cat','like','to','eat','fish','<eos>'],
  ['Dog','like','to','eat','bones','<eos>']
]
  • Obtain contextual relationships between words through calculating attention scores (explaining with a single sample)

Cat's score with Cat is 0, Cat's score with like is 2, Cat's score with eat is 3, Cat's score with fish is 4, and so on

Catliketoeatfish<eos>
Cat023450
like105670
to230780
eat345090
fish456700
<eos>456700
  • Then we'll get a sample with probabilities, remember our output dimension is still the same as the input dimension
['Cat*0.2','like*0.2','to*0.1','eat*0.3','fish*0.2','<eos>*0.1']

2. Decoder Layer

  • The decoding process is similar to the input, but with some changes. The input becomes the following data:
['< SOS >','Cat','like','to','eat','fish']
  • Notice that our <eos> is gone and < SOS > is added at the beginning, this is called a right shift operation
  • Why do we need to do this? Remember our previous attention scores? We need to do a special operation here:
< SOS >Catliketoeatfish
< SOS >0-inf-inf-inf-inf-inf
Cat00-inf-inf-inf-inf
like000-inf-inf-inf
to0000-inf-inf
eat00000-inf
fish000000
  • -inf represents an extremely small value, meaning no attention score, so < SOS > cannot see the word Cat, it cannot see the future to calculate loss
  • We need loss to continuously adjust parameters
sample=['Cat','like','to','eat','fish','<eos>']
label=['< SOS >','Cat','like','to','eat','fish']

3. Final Output

  • We have a vocabulary, which is the vocabulary that the tokenizer initially adapted to. Through this vocabulary, we can know which word is being predicted