Language Modeling in the Limit

I think that there are many people who are surprised that Large Language Models (LLMs) that predict the next word act intelligently and are capable of solving many challenging tasks. In my opinion, the success of LLMs is not surprising. In this blog post, I will explain why I think LLM’s success was obvious.

So, around two years ago, during a discussion with some colleges, I coined the term “GPT-\(\infty\)” to represent language modeling in the limit of infinite power and infinite data.

And I think that this “in the limit” thinking is helpful in understanding the success of modern Large Language Models. So first, let us discuss what language modeling is and what it would mean to take language modeling to the extreme “in the limit.”

What is Language Modeling?

So, what is generative language modeling? Language modeling is a probabilistic model over a sequence of words, usually written as follows:

\(P(w_i | w_{i-1}, w_{i-2}, \ldots, w_0) \propto \exp(f(w_i, w_{i-1}, w_{i-2}, \ldots, w_0))\)

Where \(f(\cdots)\) is a function that returns a real number and is typically learned from data. The “\(\propto \exp(\cdots)\)” in this equation converts the real number returned from \(f(\cdots)\) into a probability distribution over possible next tokens \(w_i\). A long sequence of words can be generated by repetitively evaluating \(P(w_i | w_{i-1}, w_{i-2}, \ldots, w_0)\) to determine what is the most likely next word in the sequence conditioned on the previously generated words.

For example, suppose that we have the sequence “The moon is made of,” and we want to determine the next word. We will evaluate \(P(\text{rocks} | \text{The moon is made of})\), as well as \(P(\text{cheese} | \text{The moon is made of})\). Both the words “rocks” and “cheese” are given some probability of being the next word. In the case of generation, we choose the word that has a high probability, either greedily or according to some randomized process.

\(P(\text{rocks} | \text{The moon is made of}) > P(\text{cheese} | \text{The moon is made of})\)

In the case that we want to evaluate the probability of a longer sequence of words, we can multiply the \(P(w_i | \cdots)\) together, where each \(P(\cdots)\) is used to evaluate a single word. For example,

\(P(\text{the} | \emptyset) * P(\text{moon} | \text{the}) * P(\text{is} | \text{the moon}) * P(\text{made} | \text{the moon is}) *\)
\(P(\text{of} | \text{the moon is made}) * P(\text{rocks} | \text{the moon is made of})\)

models the probability of the phrase “the moon is made of rocks.”

Backing Off A Language Model

When building language models, researchers have to make many decisions to develop a model that is tractable. For example, there are several ways that the function \(f(\cdots)\) can be defined. These days, \(f(\cdots)\) is usually a large neural network (a transformer), which is trained using gradient descent. However, neural nets are not the only way to define a language model. For example, early n-gram language modes defined \(f(\cdots)\) as the ratio between counts of phases that appeared in a corpus. For example, a bi-gram model only conditions on the previous word and is defined as follows:

\(f(w_i, w_{i-1}) = \frac{\text{count}(w_i, w_{i-1} )}{\text{count}(w_{i-1} ) + \epsilon}\)

We can observe that the bi-gram model is very backed off, as it only depends on the previous word. Hence, the probability distribution represented \(P(w_i | w_{i-1}, w_{i-2}, \ldots, w_0)\) is the same as \(P(w_i | w_{i-1})\) under the bigram model. As a result, when people write the \(P(\cdot)\) equations using backed off language models in a paper, they will usually remove the \(w_{i-2}, \ldots, w_0\) from the equation they are writing. For example:

\(P(\text{the} | \emptyset) * P(\text{moon} | \text{the}) * P(\text{is} | \text{moon}) * P(\text{made} | \text{is}) * P(\text{of} | \text{made}) * P(\text{rocks} | \text{of})\)

Writing the \(P(w_i | \cdots)\) this way indicates which parts the \(f(\cdots)\) function is able to use. Engineers will claim that this approximation is “good enough” for whichever task they are attempting to solve. Note, just because we choose to ignore the \(w_{i-2}, w_{i-3}, \ldots, w_{0}\) does not mean it does not exist.

\(P(w_i | w_{i-1}, w_{i-2}, \ldots, w_0) \approx P(w_i | w_{i-1}) \propto \exp(f(w_i, w_{i-1}))\)

Language Modeling In The Limit

So, now that we have a basic understanding of language modeling, and backing off a language model, what does it mean to language model in the limit? Let us imagine a theoretical language model that does not back off from anything. In other words, it conditions on absolutely everything.

\(P(w_i | \text{EVERYTHING BEFORE } w_i) \propto \exp(f(w_i, \text{EVERYTHING BEFORE } w_i))\)

Furthermore, in the limit, we will say that this language model has been trained with everything. This includes all data that existed in the past and all data that will exist in the future (hence everything). Because the infinite language model has access to all data, this means it always accurately predicts the next word.

For example, the infinite language model conditions on who is speaking which changes the probability of the next word. In the case of the moon rocks example, if we are modeling a scientist vs a 5-year-old, then there are likely to be different answers

\begin{align*} P(\text{rocks} | \ldots\text{made of}, speaker=\text{5-year-old}) &< P(\text{cheese} | \ldots\text{made of}, speaker=\text{5-year-old}) \\ P(\text{rocks} | \ldots\text{made of}, speaker=\text{scientist}) &> P(\text{cheese} | \ldots\text{made of}, speaker=\text{scientist}) \end{align*}

As a more extreme example, suppose I prompt the infinite language model with “For breakfast, I ate,” and the language model completes the word “eggs.”

\begin{align*} “\text{eggs}” = \underset{w_i}{\text{argmax }} P\left( w_i \left| \begin{array}{c} \text{For breakfast I ate}, \\ speaker=\text{matthew}, \\ day=\text{feburary 20th}, \\ \vdots \end{array} \right. \right) \end{align*}

Here, the language model knows that I (Matthew) am the person speaking. It also knows what the day is. It even has information about what I actually ate, and it knows that I will answer this statement truthfully. Note that what I actually ate is not recorded anywhere. I did not write it down on a piece of paper or tweet about it online. In other words, the infinite language model has access to all data, not just the data that is online.

It might be better to call the infinite language model an omnipotent model in that it has access to everything and even knows the next word with 100% accuracy. Hence, it is not really appropriate to think of the “infinite language model” as a probabilistic or learned model.

Rather, the thing that we are interested in is that our “omnipotent and infinite” model is a function \(f_\infty(\cdots)\) that takes the entire world prior to \(w_i\) as an argument and returns a real-valued number that selects the next word.

Using the “In The Limit” Language Model

So how do we make use of the \(f_\infty(\cdots)\) function?

Neural networks are universal function approximators. This means that for any function \(\hat{f}:\mathbb{R}^n \to \mathbb{R}\), that takes a real-valued vector (\(x \in \mathbb{R}^n\)) as an argument and returns a real value (\(y \in \mathbb{R}\)) as the result, there exists a neural network \(f_{\text{neural}}:\mathbb{R}^n \to \mathbb{R}\), that can approximate the function \(\hat{f}\) to a desired degree of accuracy.

To train the neural network \(f_{\text{neural}}:\mathbb{R}^n \to \mathbb{R}\), one simply needs to collect many inputs and outputs samples \(\langle x, y \rangle\) from the function \(\hat{f}\), and then use those samples to train the \(f_{\text{neural}}\) function. This is exactly what we already do when training a neural network!

In other words, if we collect a lot of samples from the “omnipotent and infinite” language model \(f_\infty\) and use that to train \(f_{\text{neural}}\), then we can approximate this function. Thankfully this is easy! All text that exists are samples from the \(f_\infty\) model.

For example, suppose that we prompt the \(f_\infty\) model with “reddit post written by bob123 on July 16, 2006”. The \(f_\infty\) model will exactly know the post by b0b123, and the \(f_\infty\) model will generate “Religion and politics are always interesting.” This sentence can then be used as training data for the \(f_{\text{neural}}\) model.

\begin{align*} P_\infty\left(\text{“Religion and politics are always interesting.”} \left| \begin{array}{c} speaker=\text{bob123}, \\ source=\text{reddit}, \\ date=\text{July 16, 2006}, \\ \vdots \end{array} \right. \right) = 1 \end{align*}

Hence, gathering data to train \(f_{\text{neural}}\) can be done by simply collecting all written text and training a neural language model in the usual way.

Furthermore, we will create better approximations to \(f_\infty\) by conditioning on more of the input to \(f_\infty\), meaning creating larger prompts for Large Language Models, and by making the approximation better by creating larger neural networks! In other words, bigger equals better.

Is the “In The Limit” Model Intelligent?

Large language models seem to act intelligently. So a natural question is the \(f_\infty\) function, that neural language models approximate, also intelligent? Admittedly, this question is a bit ill-formed. The \(f_\infty\) model is “omnipotent and infinite.” It does not need to be intelligent. It already knows everything.

For example, suppose that I have a time machine and go back in time to December 6, 1941, and predict that on December 7, 1941, Japanese planes will attack Pearl Harbor in Hawaii. From the perspective of the people living in 1941, I would appear to be a very intelligent person as I have apparently analyzed a bunch of data and accurately predicted the future. However, knowing that I was a person living in 2024 with knowledge of history, the only thing that I have done is recall a fact that I read in a history textbook.

What is Intelligence?

So if \(f_\infty\) is not intelligent, is training \(f_{\text{neural}}\) to approximate \(f_\infty\) going to enable us to build intelligent agents?

First, let us try to define what it means for a system to be intelligent. A plausible definition of intelligence could be having the ability to predict the future and then using those predictions to exhibit behavior that is favorable to the agent. I am using the term “favorable behavior” loosely here in that the behavior does not have to entirely come from a “conscious” decision by the agent.

For example, suppose that the agent is a hunter and is trying to catch an animal that it wants to eat. If the agent can accurately predict where the animal is going to be, then the agent is going to have a better chance of catching the animal. The favorable behavior in this case is getting resources (food) that are needed to survive. This behavior is driven by some innate instinct to get food and survive.

A more modern version of predicting the future these days might instead look like trying to predict the price of a stock. For example, if an agent can accurately predict the price of Nvidia stock, then it can place trades (bets against other traders) that will be profitable. In the case of the \(f_\infty\) model, the model is omnipotent, so it knows the answer. Hence, it would win every bet. However, if we are not an omnipotent agent and are limited by the constraints of time, then the agent will have to read through corporate reports and compare trends with historical data to make an “educated guess” about the future stock price. This is conceptually what a human financial analyst does. Here, we say that the agent is acting intelligently because the agent does not have direct access to the future (like \(f_\infty\)) and instead must use its existing sources of knowledge to make an “educated guess.” Hence, the agent must represent their guesses using a probability distribution of potential outcomes:

\(P_{\text{neural}}(\text{nvidia stock closed at \$900 at the end of 2024} | \cdots) = .001 \)
\(\vdots \)
\(P_{\text{neural}}(\text{nvidia stock closed at \$800 at the end of 2024} | \cdots) = .001 \)
\(\vdots \)
\(P_{\text{neural}}(\text{nvidia stock closed at \$600 at the end of 2024} | \cdots) = .001\)

This is in contrast to \(P_\infty(\cdots)\) that knows the correct answer and creates a “distribution” that is entirely peaked at the correct answer:

\(P_{\infty}(\text{nvidia stock closed at \$900 at the end of 2024} | \cdots) = 0\)
\(\vdots \)
\(P_{\infty}(\text{nvidia stock closed at \$875.23 at the end of 2024} | \cdots) = 1\)
\(P_{\infty}(\text{nvidia stock closed at \$700 at the end of 2024} | \cdots) = 0\)


So, in conclusion, Large Language Models are being trained to approximate the “in the limit” \(f_\infty\) function by being trained on the collection of all text. As we build larger LLMs and use more data, we will train neural nets to better approximate \(f_\infty\), which will make the agents appear more intelligent. Because LLMs, like humans, are unable to know the future, they must deal with ambiguity and instead create a distribution of potential outcomes. Hence, LLMs exhibit seemingly intelligent behavior.

Finally, I note that this “in the limit” argument says nothing about what is the optimal architecture or that transformers, or even neural networks, are the best way to approximate \(f_\infty\). Rather, my claim is that there exists a function \(f_\infty\) such that, when approximated well, will result in agents that act intelligently.