Understanding Transformer Model Process in Simple Terms

Transformer Model

How to understand the training process of the transformer model?

Let's say our training data is:

const data = [{text:'Cats like to eat fish<eos>'},{text:'Dogs like to eat bones<eos>'}]

const trainsData = [
  ['Cat','like','to','eat','fish','<eos>'],
  ['Dog','like','to','eat','bones','<eos>']
]

Obtain contextual relationships between words through calculating attention scores (explaining with a single sample)

Cat's score with Cat is 0, Cat's score with like is 2, Cat's score with eat is 3, Cat's score with fish is 4, and so on

	Cat	like	to	eat	fish
Cat	0	2	3	4	5
like	1	0	5	6	7
to	2	3	0	7	8
eat	3	4	5	0	9
fish	4	5	6	7	0
`<eos>`	4	5	6	7	0

Then we'll get a sample with probabilities, remember our output dimension is still the same as the input dimension

['Cat*0.2','like*0.2','to*0.1','eat*0.3','fish*0.2','<eos>*0.1']

The decoding process is similar to the input, but with some changes. The input becomes the following data:

['< SOS >','Cat','like','to','eat','fish']

Notice that our <eos> is gone and < SOS > is added at the beginning, this is called a right shift operation
Why do we need to do this? Remember our previous attention scores? We need to do a special operation here:

	Cat	like	to	eat	fish
`< SOS >`	-inf	-inf	-inf	-inf	-inf
Cat	0	-inf	-inf	-inf	-inf
like	0	0	-inf	-inf	-inf
to	0	0	0	-inf	-inf
eat	0	0	0	0	-inf
fish	0	0	0	0	0

-inf represents an extremely small value, meaning no attention score, so < SOS > cannot see the word Cat, it cannot see the future to calculate loss
We need loss to continuously adjust parameters

sample=['Cat','like','to','eat','fish','<eos>']
label=['< SOS >','Cat','like','to','eat','fish']

We have a vocabulary, which is the vocabulary that the tokenizer initially adapted to. Through this vocabulary, we can know which word is being predicted

Table Of Content