## 计算机代写|自然语言处理代写natural language processing代考|Hierarchical Softmax

Mikolov ET AL. also present hierarchical softmax as a much more efficient alternative to the normal softmax. In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower dimensional vectors.
Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf of the tree is a word, and there is a unique path from root to leaf. In this model, there is no output representation for words. Instead, each node of the graph (except the root and the leaves) is associated to a vector that the model is going to learn.
In this model, the probability of a word $w$ given a vector $w_i$, $P\left(w \mid w_i\right)$, is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to $w$. The main advantage in computing the probability this way is that the cost is only $O(\log (|V|))$, corresponding to the length of the path.

Let’s introduce some notation. Let $L(w)$ be the number of nodes in the path from the root to the leaf $w$. For instance, $L\left(w_2\right)$ in Figure 4 is 3 . Let’s write $n(w, i)$ as the $i$-th node on this path with associated vector $v_{n(w, i)}$. So $n(w, 1)$ is the root, while $n(w, L(w))$ is the father of $w$. Now for each inner node $n$, we arbitrarily choose one of its children and call it $\operatorname{ch}(n)$ (e.g. always the left node). Then, we can compute the probability as
$$P\left(w \mid w_i\right)=\prod_{j=1}^{L(w)-1} \sigma\left([n(w, j+1)=\operatorname{ch}(n(w, j))] \cdot v_{n(w, j)}^T v_{w_i}\right)$$
where
$$[x]=\left{\begin{array}{l} 1 \text { if } x \text { is true } \ -1 \text { otherwise } \end{array}\right.$$
and $\sigma(\cdot)$ is the sigmoid function.
This formula is fairly dense, so let’s examine it more closely.
First, we are computing a product of terms based on the shape of the path from the root $(n(w, 1))$ to the leaf $(w)$. If we assume $\operatorname{ch}(n)$ is always the left node of $n$, then term $[n(w, j+1)=\operatorname{ch}(n(w, j))]$ returns 1 when the path goes left, and $-1$ if right.

Furthermore, the term $[n(w, j+1)=\operatorname{ch}(n(w, j))]$ provides normalization. At a node $n$, if we sum the probabilities for going to the left and right node, you can check that for any value of $v_n^T v_{w_i \text { ‘ }}$
$$\sigma\left(v_n^T v_{w_i}\right)+\sigma\left(-v_n^T v_{w_i}\right)=1$$
The normalization also ensures that $\sum_{w=1}^{|V|} P\left(w \mid w_i\right)=1$, just as in the original softmax.

## 计算机代写|自然语言处理代写natural language processing代考|Natural Language Processing with Deep

Keyphrases: Global Vectors for Word Representation (GloVe). Intrinsic and extrinsic evaluations. Effect of hyperparameters on analogy evaluation tasks. Correlation of human judgment with word vector distances. Dealing with ambiguity in word using contexts. Window classification.
This set of notes first introduces the GloVe model for training word vectors. Then it extends our discussion of word vectors (interchangeably called word embeddings) by seeing how they can be evaluated intrinsically and extrinsically. As we proceed, we discuss the example of word analogies as an intrinsic evaluation technique and how it can be used to tune word embedding techniques. We then discuss training model weights/parameters and word vectors for extrinsic tasks. Lastly we motivate artificial neural networks as a class of models for natural language processing tasks.

So far, we have looked at two main classes of methods to find word embeddings. The first set are count-based and rely on matrix factorization (e.g. LSA, HAL). While these methods effectively leverage global statistical information, they are primarily used to capture word similarities and do poorly on tasks such as word analogy, indicating a sub-optimal vector space structure. The other set of methods are shallow window-based (e.g. the skip-gram and the CBOW models), which learn word embeddings by making predictions in local context windows. These models demonstrate the capacity to capture complex linguistic patterns beyond word similarity, but fail to make use of the global co-occurrence statistics.

In comparison, GloVe consists of a weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics. The model produces a word vector space with meaningful sub-structure. It shows state-of-the-art performance on the word analogy task, and outperforms other current methods on several word similarity tasks.

