## 机器学习代写|自然语言处理代写NLP代考|Input embedding

The input embedding sub-layer converts the input tokens to vectors of dimension $d_{\text {modd }}=512$ using learned embeddings in the original Transformer model. The structure of the input embedding is classical:

The embedding sub-layer works like other standard transduction models. A tokenizer will transform a sentence into tokens. Each tokenizer has its methods, but the results are similar. For example, a tokenizer applied to the sequence “the Transformer is an innovative NLP model!” will produce the following tokens in one type of model:You will notice that this tokenizer normalized the string to lower case and truncated it into subparts. A tokenizer will generally provide an integer representation that will be used for the embedding process. For example:

There is not enough information in the tokenized text at this point to go further. The tokenized text must be embedded.
The Transformer contains a learned embedding sub-layer. Many embedding methods can be applied to the tokenized input.
I chose the skip-gram architecture of the word2vec embedding approach Google made available in 2013 to illustrate the embedding sublayer of the Transformer. A skip-gram will focus on a center word in a window of words and predicts context words. For example, if word(i) is the center word in a two-step window, a skipgram model will analyze word(i-2), word(i-1), word(i+1), and word(i+2). Then the window will slide and repeat the process. A skip-gram model generally contains an input layer, weights, a hidden layer, and an output containing the word cmbeddings of the tokenized input words.
Suppose we need to perform embedding for the following sentence:
The black cat sat on the couch and the brown dog slept on the rug.
We will focus on two words, black and brown. The word embedding vectors of these two words should be similar.
Since we must produce a vector of size $d_{\text {madel }}=512$ for each word, we will obtain a size 512 vector embedding for each word:The word black is now represented by 512 dimensions. Other embedding methods could be used and $d_{\text {mudel }}$ could have a higher number of dimensions.

## 机器学习代写|自然语言处理代写NLP代考|Positional encoding

We enter this positional encoding function of the Transformer with no idea of the position of a word in a sequence:

We cannot create independent positional vectors that would have a high cost on the training speed of the Transformer and make attention sub-layers very complex to work with. The idea is to add a positional encoding value to the input embedding instead of having additional vectors to describe the position of a token in a sequence.
We also know that the Transformer expects a fixed size $d_{\text {madel }}=512$ (or other constant value for the model) for each vector of the output of the positional encoding function.
If we go back to the sentence we used in the word embedding sub-layer, we can see that black and brown may be similar, but they are far apart:
The black cat sat on the couch and the brown dog slept on the rug.
The word black is in position 2, pos $=2$, and the word brown is in position 10 , pos $=10$.
Our problem is to find a way to add a value to the word embedding of each word so that it has that information. However, we need to add a value to the $d_{\text {madel }}=512$ dimensions! For each word embedding vector, we need to find a way to provide information to $i$ in the range $(\theta, 512)$ dimensions of the word embedding vector of black and brown.

There are many ways to achieve this goal. The designers found a clever way to use a unit sphere to represent positional encoding with sine and cosine values that will thus remain small but very useful.

Transformer 包含一个学习的嵌入子层。许多嵌入方法可以应用于标记化输入。

## 机器学习代写|自然语言处理代写NLP代考|Positional encoding

black 这个词在位置 2，pos=2, 单词 brown 在位置 10 , pos=10.

