Research Paper

A Neural Probabilistic Language Model

One of the first papers to introduce embeddings as a concept, and neural networks as a general solution to language modelling....

Content given below are raw annotations with comments. It is suggested to view the paper PDF first.

December 23, 2024

A Neural Probabilistic Language Model

In bigram modelling in karpathy's first video of makemore OR in any traditional software, the amount of possible cases increases exponentially, which makes the probabilistic modelling more and more expensive.

Example - in bigram we had a 27x27 array because it only considered 2 chars at a time. now if we were to consider 3 chars to make the model better, the matrix grows to 27x27x27.

This paper solves that by using vector embeddings, such that each possible case is represented by a vector on size 30.

I am yet to understand this fully (underlined in abstract) but this model trains the weights AND the vector embeddings. it collectively reduces the loss due to both.

Curse Of Dimensionality

curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.”

the general meaning is different -
https://en.wikipedia.org/wiki/Curse_of_dimensionality

contextual meaning -
Very simple concept - in bigram modelling in karpathy's first video of makemore OR in any traditional software, the amount of possible cases increases exponentially, but in that huge space, the actual data-points from training will be sparsely populated, which makes the probabilistic modelling more and more expensive

The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations.”

When modeling continuous variables, we obtain generalization more easily (e.g. with smooth classes of functions like multi-layer neural networks or Gaussian mixture models) because the function to be learned can be expected to have some local smoothness properties.”

For discrete spaces, the generalization structure is not as obvious: any change of these discrete variables may have a drastic impact on the value of the function to be estimated, and when the number of values that each discrete variable can take is large, most observed objects are almost maximally far from each other in hamming distance.”

Typically researchers have used n = 3, i.e. trigrams, and obtained state-of-the-art results, but see Goodman (2001) for how combining many tricks can yield to substantial improvements.”

by adding a weight decay penalty.”

this is simple L2 Regularization / Ridge Regression
we add a term in the loss function which keeps the absolute value of the weights in control.
It just discourages larger weight values.
loss = (1/2) * sum((y_predicted - y_true)^2) + (lambda/2) * sum(w^2)

because the probability function is a smooth function of these feature values, a small change in the features will induce a small change in the probability.”

perplexity”

perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution

Follow the solid Green arrows only first. dotted lines are tangential concepts and explained later.

In the above model, the number of free parameters only scales linearly with V , the number of words in the vocabulary. It also only scales linearly with the order n : the scaling factor could be reduced to sub-linear if more sharing structure were introduced, e.g. using a time-delay neural network or a recurrent neural network (or a combination of both).”