Language Modeling in the Limit

I think that there are many people who are surprised that Large Language Models (LLMs) that predict the next word act intelligently and are capable of solving many challenging tasks. In my opinion, the success of LLMs is not surprising. In this blog post, I will explain why I think LLM’s success was obvious.

So, around two years ago, during a discussion with some colleges, I coined the term “GPT-\(\infty\)” to represent language modeling in the limit of infinite power and infinite data.

And I think that this “in the limit” thinking is helpful in understanding the success of modern Large Language Models. So first, let us discuss what language modeling is and what it would mean to take language modeling to the extreme “in the limit.”

What is Language Modeling?

So, what is generative language modeling? Language modeling is a probabilistic model over a sequence of words, usually written as follows:

\(P(w_i | w_{i-1}, w_{i-2}, \ldots, w_0) \propto \exp(f(w_i, w_{i-1}, w_{i-2}, \ldots, w_0))\)

Where \(f(\cdots)\) is a function that returns a real number and is typically learned from data. The “\(\propto \exp(\cdots)\)” in this equation converts the real number returned from \(f(\cdots)\) into a probability distribution over possible next tokens \(w_i\). A long sequence of words can be generated by repetitively evaluating \(P(w_i | w_{i-1}, w_{i-2}, \ldots, w_0)\) to determine what is the most likely next word in the sequence conditioned on the previously generated words.

For example, suppose that we have the sequence “The moon is made of,” and we want to determine the next word. We will evaluate \(P(\text{rocks} | \text{The moon is made of})\), as well as \(P(\text{cheese} | \text{The moon is made of})\). Both the words “rocks” and “cheese” are given some probability of being the next word. In the case of generation, we choose the word that has a high probability, either greedily or according to some randomized process.

\(P(\text{rocks} | \text{The moon is made of}) > P(\text{cheese} | \text{The moon is made of})\)

In the case that we want to evaluate the probability of a longer sequence of words, we can multiply the \(P(w_i | \cdots)\) together, where each \(P(\cdots)\) is used to evaluate a single word. For example,

\(P(\text{the} | \emptyset) * P(\text{moon} | \text{the}) * P(\text{is} | \text{the moon}) * P(\text{made} | \text{the moon is}) *\)
\(P(\text{of} | \text{the moon is made}) * P(\text{rocks} | \text{the moon is made of})\)

models the probability of the phrase “the moon is made of rocks.”

Backing Off A Language Model

When building language models, researchers have to make many decisions to develop a model that is tractable. For example, there are several ways that the function \(f(\cdots)\) can be defined. These days, \(f(\cdots)\) is usually a large neural network (a transformer), which is trained using gradient descent. However, neural nets are not the only way to define a language model. For example, early n-gram language modes defined \(f(\cdots)\) as the ratio between counts of phases that appeared in a corpus. For example, a bi-gram model only conditions on the previous word and is defined as follows:

\(f(w_i, w_{i-1}) = \frac{\text{count}(w_i, w_{i-1} )}{\text{count}(w_{i-1} ) + \epsilon}\)

We can observe that the bi-gram model is very backed off, as it only depends on the previous word. Hence, the probability distribution represented \(P(w_i | w_{i-1}, w_{i-2}, \ldots, w_0)\) is the same as \(P(w_i | w_{i-1})\) under the bigram model. As a result, when people write the \(P(\cdot)\) equations using backed off language models in a paper, they will usually remove the \(w_{i-2}, \ldots, w_0\) from the equation they are writing. For example:

\(P(\text{the} | \emptyset) * P(\text{moon} | \text{the}) * P(\text{is} | \text{moon}) * P(\text{made} | \text{is}) * P(\text{of} | \text{made}) * P(\text{rocks} | \text{of})\)

Writing the \(P(w_i | \cdots)\) this way indicates which parts the \(f(\cdots)\) function is able to use. Engineers will claim that this approximation is “good enough” for whichever task they are attempting to solve. Note, just because we choose to ignore the \(w_{i-2}, w_{i-3}, \ldots, w_{0}\) does not mean it does not exist.

\(P(w_i | w_{i-1}, w_{i-2}, \ldots, w_0) \approx P(w_i | w_{i-1}) \propto \exp(f(w_i, w_{i-1}))\)

Language Modeling In The Limit

So, now that we have a basic understanding of language modeling, and backing off a language model, what does it mean to language model in the limit? Let us imagine a theoretical language model that does not back off from anything. In other words, it conditions on absolutely everything.

\(P(w_i | \text{EVERYTHING BEFORE } w_i) \propto \exp(f(w_i, \text{EVERYTHING BEFORE } w_i))\)

Furthermore, in the limit, we will say that this language model has been trained with everything. This includes all data that existed in the past and all data that will exist in the future (hence everything). Because the infinite language model has access to all data, this means it always accurately predicts the next word.

For example, the infinite language model conditions on who is speaking which changes the probability of the next word. In the case of the moon rocks example, if we are modeling a scientist vs a 5-year-old, then there are likely to be different answers

\begin{align*} P(\text{rocks} | \ldots\text{made of}, speaker=\text{5-year-old}) &< P(\text{cheese} | \ldots\text{made of}, speaker=\text{5-year-old}) \\ P(\text{rocks} | \ldots\text{made of}, speaker=\text{scientist}) &> P(\text{cheese} | \ldots\text{made of}, speaker=\text{scientist}) \end{align*}

As a more extreme example, suppose I prompt the infinite language model with “For breakfast, I ate,” and the language model completes the word “eggs.”

\begin{align*} “\text{eggs}” = \underset{w_i}{\text{argmax }} P\left( w_i \left| \begin{array}{c} \text{For breakfast I ate}, \\ speaker=\text{matthew}, \\ day=\text{feburary 20th}, \\ \vdots \end{array} \right. \right) \end{align*}

Here, the language model knows that I (Matthew) am the person speaking. It also knows what the day is. It even has information about what I actually ate, and it knows that I will answer this statement truthfully. Note that what I actually ate is not recorded anywhere. I did not write it down on a piece of paper or tweet about it online. In other words, the infinite language model has access to all data, not just the data that is online.

It might be better to call the infinite language model an omnipotent model in that it has access to everything and even knows the next word with 100% accuracy. Hence, it is not really appropriate to think of the “infinite language model” as a probabilistic or learned model.

Rather, the thing that we are interested in is that our “omnipotent and infinite” model is a function \(f_\infty(\cdots)\) that takes the entire world prior to \(w_i\) as an argument and returns a real-valued number that selects the next word.

Using the “In The Limit” Language Model

So how do we make use of the \(f_\infty(\cdots)\) function?

Neural networks are universal function approximators. This means that for any function \(\hat{f}:\mathbb{R}^n \to \mathbb{R}\), that takes a real-valued vector (\(x \in \mathbb{R}^n\)) as an argument and returns a real value (\(y \in \mathbb{R}\)) as the result, there exists a neural network \(f_{\text{neural}}:\mathbb{R}^n \to \mathbb{R}\), that can approximate the function \(\hat{f}\) to a desired degree of accuracy.

To train the neural network \(f_{\text{neural}}:\mathbb{R}^n \to \mathbb{R}\), one simply needs to collect many inputs and outputs samples \(\langle x, y \rangle\) from the function \(\hat{f}\), and then use those samples to train the \(f_{\text{neural}}\) function. This is exactly what we already do when training a neural network!

In other words, if we collect a lot of samples from the “omnipotent and infinite” language model \(f_\infty\) and use that to train \(f_{\text{neural}}\), then we can approximate this function. Thankfully this is easy! All text that exists are samples from the \(f_\infty\) model.

For example, suppose that we prompt the \(f_\infty\) model with “reddit post written by bob123 on July 16, 2006”. The \(f_\infty\) model will exactly know the post by b0b123, and the \(f_\infty\) model will generate “Religion and politics are always interesting.” This sentence can then be used as training data for the \(f_{\text{neural}}\) model.

\begin{align*} P_\infty\left(\text{“Religion and politics are always interesting.”} \left| \begin{array}{c} speaker=\text{bob123}, \\ source=\text{reddit}, \\ date=\text{July 16, 2006}, \\ \vdots \end{array} \right. \right) = 1 \end{align*}

Hence, gathering data to train \(f_{\text{neural}}\) can be done by simply collecting all written text and training a neural language model in the usual way.

Furthermore, we will create better approximations to \(f_\infty\) by conditioning on more of the input to \(f_\infty\), meaning creating larger prompts for Large Language Models, and by making the approximation better by creating larger neural networks! In other words, bigger equals better.

Is the “In The Limit” Model Intelligent?

Large language models seem to act intelligently. So a natural question is the \(f_\infty\) function, that neural language models approximate, also intelligent? Admittedly, this question is a bit ill-formed. The \(f_\infty\) model is “omnipotent and infinite.” It does not need to be intelligent. It already knows everything.

For example, suppose that I have a time machine and go back in time to December 6, 1941, and predict that on December 7, 1941, Japanese planes will attack Pearl Harbor in Hawaii. From the perspective of the people living in 1941, I would appear to be a very intelligent person as I have apparently analyzed a bunch of data and accurately predicted the future. However, knowing that I was a person living in 2024 with knowledge of history, the only thing that I have done is recall a fact that I read in a history textbook.

What is Intelligence?

So if \(f_\infty\) is not intelligent, is training \(f_{\text{neural}}\) to approximate \(f_\infty\) going to enable us to build intelligent agents?

First, let us try to define what it means for a system to be intelligent. A plausible definition of intelligence could be having the ability to predict the future and then using those predictions to exhibit behavior that is favorable to the agent. I am using the term “favorable behavior” loosely here in that the behavior does not have to entirely come from a “conscious” decision by the agent.

For example, suppose that the agent is a hunter and is trying to catch an animal that it wants to eat. If the agent can accurately predict where the animal is going to be, then the agent is going to have a better chance of catching the animal. The favorable behavior in this case is getting resources (food) that are needed to survive. This behavior is driven by some innate instinct to get food and survive.

A more modern version of predicting the future these days might instead look like trying to predict the price of a stock. For example, if an agent can accurately predict the price of Nvidia stock, then it can place trades (bets against other traders) that will be profitable. In the case of the \(f_\infty\) model, the model is omnipotent, so it knows the answer. Hence, it would win every bet. However, if we are not an omnipotent agent and are limited by the constraints of time, then the agent will have to read through corporate reports and compare trends with historical data to make an “educated guess” about the future stock price. This is conceptually what a human financial analyst does. Here, we say that the agent is acting intelligently because the agent does not have direct access to the future (like \(f_\infty\)) and instead must use its existing sources of knowledge to make an “educated guess.” Hence, the agent must represent their guesses using a probability distribution of potential outcomes:

\(P_{\text{neural}}(\text{nvidia stock closed at \$900 at the end of 2024} | \cdots) = .001 \)
\(\vdots \)
\(P_{\text{neural}}(\text{nvidia stock closed at \$800 at the end of 2024} | \cdots) = .001 \)
\(\vdots \)
\(P_{\text{neural}}(\text{nvidia stock closed at \$600 at the end of 2024} | \cdots) = .001\)

This is in contrast to \(P_\infty(\cdots)\) that knows the correct answer and creates a “distribution” that is entirely peaked at the correct answer:

\(P_{\infty}(\text{nvidia stock closed at \$900 at the end of 2024} | \cdots) = 0\)
\(\vdots \)
\(P_{\infty}(\text{nvidia stock closed at \$875.23 at the end of 2024} | \cdots) = 1\)
\(P_{\infty}(\text{nvidia stock closed at \$700 at the end of 2024} | \cdots) = 0\)


So, in conclusion, Large Language Models are being trained to approximate the “in the limit” \(f_\infty\) function by being trained on the collection of all text. As we build larger LLMs and use more data, we will train neural nets to better approximate \(f_\infty\), which will make the agents appear more intelligent. Because LLMs, like humans, are unable to know the future, they must deal with ambiguity and instead create a distribution of potential outcomes. Hence, LLMs exhibit seemingly intelligent behavior.

Finally, I note that this “in the limit” argument says nothing about what is the optimal architecture or that transformers, or even neural networks, are the best way to approximate \(f_\infty\). Rather, my claim is that there exists a function \(f_\infty\) such that, when approximated well, will result in agents that act intelligently.

PhD Done!!!

My PhD on Declarative Programming Via Term Rewriting, which I completed at Johns Hopkins University, is done.

This research project was about developing a declarative, weighted logic programming language for machine learning, artificial intelligence, and natural language processing applications. The programming language we were researching was interesting because it allowed for programs that do interesting probabilistic and symbolic reasoning to be concisely expressed in a few lines of code. It accomplishes this by allowing the programmer to leave out as many details as possible about how the program is executed. This is similar to a database, where the database needs to automatically figure out how to retrieve the data given a high-level declarative query.

To make this work, I created a relational algebra, which is capable of representing entire programs. To make the system flexible enough to find a good execution strategy, I created a term rewriting approach that includes hundreds of rewrite rules to run the program. This is similar to rewriting an expression like “2 + 3” as “5” where both of these expressions are semantically equivalent.

To make the term rewriting relational algebra approach tractable, I additionally had to redesign many of the traditional approaches that programming languages are implemented. For example, I created a new way to think about memoization (dynamic programming) to make it work with our system. Additionally, I created a (JIT) compiler for our term rewrite system because the naive implementation was too slow for real world use.

In the end, this was an interesting research project. However, I think that this work was set a bit too firmly in the realm of symbolic systems for AI (the AI paradigm of yesteryear). Hence, I do not know if this is applicable to modern only big neural AI that is dominating. Eventually, I do think that this work may see some use. The reason is that while pure neural creates really cool demonstrations, it will also fabricate information. This creates an issue when these systems are deployed into applications, and that is a problem for their usability in industry. Hence, having a system that incorporates weighted reasoning (necessary for neural networks), and symbolic reasoning into a single system is a very powerful programming paradigm.

The dissertation document and recording of the defense are available at:

Dissertation Abstract

I present a new approach to implementing weighted logic programming languages. I first present a bag-relational algebra that is expressive enough to capture the desired denotational semantics, directly representing the recursive conjunctions, disjunctions, and aggregations that are specified by a source program. For the operational semantics, I develop a term-rewriting system that executes a program by simplifying its corresponding algebraic expression.

I have used this approach to create the first complete implementation of the Dyna programming language. A Dyna program consists of rules that define a potentially infinite and cyclic computation graph, which is queried to answer data-dependent questions. Dyna is a unified declarative framework for machine learning and artificial intelligence researchers that supports dynamic programming, constraint logic programming, reactive programming, and object-oriented programming. I have further modernized Dyna to support functional programming with lambda closures and embedded domain-specific languages.

The implementation includes a front-end that translates Dyna programs to bag-relational expressions, a Python API, hundreds of term rewriting rules, and a procedural engine for determining which rewrite rules to apply. The rewrite rules generalize techniques used in constraint logic programming. In practice, our system is usually able to provide simple answers to queries.

Mixing disparate programming paradigms is not without challenges. We had to rethink the classical techniques used to implement logic programming languages. This includes the development of a novel approach for memoization (dynamic programming) that supports partial memoization of fully or partially simplified algebraic expressions, which may contain delayed, unevaluated constraints. Furthermore, real-world Dyna programs require fast and efficient execution. For this reason, I present a novel approach to just-in-time (JIT) compile sequences of term rewrites using a custom tracing JIT.

Redmagic Meta-tracing JIT

This is a newly published repository from an experiment that I started during the summer to attempt to optimize Python’s runtime without impacting compatibility with C models and existing libraries.

There currently exists frameworks such as PyPy and Truffle/Graal which will generate a JIT when you design your programming language inside of their framework and use their compiler to generate a custom binary which implements a JIT for a given language.  One problem with these frameworks is that they require that an existing language is reimplemented essentially from scratch using a subset of Python or Java respectively.  Additionally, once a programming language is reimplemented, any existing modules which interface with internal aspects of the interpreter (any python C module) will not be compatible and will have to be rewritten.

Redmagic is similar to PyPy and Truffle/Graal in that it tries to be a framework for creating a JIT however it is different in that it tries to work with an existing C or C++ interpreter requiring only a few annotations inside the code to identify loops at the user level of the program. (cPython example)  It then generates traces of a user’s program by switching between letting the language’s interpreter run normally and on top of a virtual machine which records all instructions and branching directions.  Unlike other JITs it does not reorganize how memory is laid out and even goes as far as simulating pushing and popping of C level stack frames.  This means that at any point while running inside the virtual machine or inside the generated code, the system can resume the “normal” code simply by jumping to the corresponding instruction in the original interpreter.  This does come with the downside that it inherits all memory layout and pointer dereferencing performance issues that were present in the original interpreter.

Results: After much work on Redmagic, I have observed very slight (<2%) performance improvements in limited cases when working with cPython.  The issues of memory layouts (explored further in this post) seem to contribute significantly to Python’s performance issues and those are not addressed at all by the Redmagic implementation.  Additionally, it is not clear that these would be easy to address from Redmagic itself given that it is looking at a program from the level of x86 assembly instructions vs higher level representations where data-structures can be easily recovered.  I believe that there might be cases, possibly in math routines, where memory layouts have already been optimized and possibly allowing for more clever uses of branching could prove beneficial to performance.

Cost of Abstractions

This post is from a recent conversation that I had about the cost of abstractions in various languages. In choosing a programming language for a project, it is important to choose a language that gives a good mix of “time to develop” and “performance.” The first tends to be obvious from experience. However, for many, the performance of a programming language (and its implementations) is somewhat opaque. I am going to demonstrate the differences between Python, Java, and C++, as this gives a good sample of the differences between programming implementations. We can think of Python as representative of the general class of interpreter-based programming languages such as Ruby and Perl, and Javascript before 2005. Java represents the current state of the art in JIT technology and should generalize to general JVM targeted languages and even languages such as Javascript, which have seen countless hours invested in developing new implementations of the language. Finally, C++ is a “low level” language, however, it features a number of unique features, such as templates and stack allocation, which allows for zero cost abstractions which is sparsely seen in other languages.

What is abstraction

Abstraction in this post means the ability of the programmer to wrap a series of methods and data into some object which we are going to use throughout our program. An example of abstraction might be a class LogNumber which abstracts away the fact that we are representing some numbers inside of a log space instead of a linear space. This is fairly standard practice in machine learning applications and might even be necessary in some cases where the magnitude of the number is smaller than the range of a floating point number. If we have this abstraction then we can write an expression such as: LogNumber + LinearNumber and have the LinearNumber automatically converted into log space and thus we will get the correct behavior from our program. If we are programming without this abstraction and simply using primitive types, and we accidentally write float_in_log + float_in_linear we are going to get a number that is neither in log space or linear space and this is a bug. One of the most famous examples of this bug is with the Mars Climate Orbiter crashed due to a units mismatch between standard and metric system.

For the remainder of this post, we consider a simplified “LogNumber” class, A which is using a single 32 bit signed integer and getter and setter methods. One could easily imagine there being multiple getter and setter methods for when one wants to set a linear number to a log space number etc.

# Python
class A:
  def __init__(self, a):
    self._value = a
  def get(self):
    return self._value
  def set(self, a):
    self._value = a

class A {
  private int value;
  public A(int a) { value = a; }
  public int get() { return value; }
  public void set(int a) { value = a; }

class A {
  int32_t value;
  A(int32_t a) : value(a) {}
  int32_t get() { return value; }
  void set(int32_t a) { value = a; }

Memory overhead

First, let us consider the impact on memory that this abstraction has in each of these languages. Memory makes since as a place to start it is generally understood that the less memory that we use the more that we can fit into ram. Additionally, on modern hardware, accessing main memory tends to be a major bottle neck for performance with a read from main memory taking ~100 cycles while a local processor caches takes 10-20 cycles. As such, having an object which is twice as large means that we can fit half as many objects inside our processor cache and thus will end up performing costly accesses to main memory more often. Remember, while reading this next section, we are just wrapping a single 4 byte integer.

# Python
a_python = A(5)

A a_java = new A(5);

A *a_ptr_cpp = new A(5);

A a_stack_cpp(5);

Python: First for Python, in constructing the A class, we are going first to create a PyInstanceObject which contains 3 pointers plus the PyObject_HEAD which is 3 pointers plus a size_t bringing the size of this struct up to 56 bytes plus malloc overhead. (Note: the malloc overhead depends on the implementation of malloc used and could even be changed at runtime using something like LD_PRELOAD). Next, to actually save _value we have to construct the hash map that is going to back all the elements for this class instance. This means creating a PyDictObject which is a PyObject_HEAD, 3 size_t, 2 pointers, and 8 preallocated hash map slots for which each slot contains a size_t and 2 pointers bringing up the size of this object to 264 + malloc overhead. The number 5 in this case is a small integer and thus is already interned inside the python interpreter and the hash map key _value would be shared between all instances of this object and so we are not going to count it. Finally, we are going to have to maintain a pointer on Python’s stack to the object a_python that we just constructed. All together then this brings our total size to 328 + 2 * malloc overhead. Note: We could reduce the object size somewhat by using Python’s __slots__ feature which should avoid allocating an oversized hash map to back this object, however our object size is still going to be in the 100s of byte range.

Java: Next with Java, there is an 8 byte header that is added to all objects which contains information about the type of object and object state w.r.t. locking etc.  (more info) Additionally, Java will round the size of objects up to the next 8 byte increment which means that there will be 4 wasted bytes inside this particular object. Finally, we need to store a pointer to this object on our stack. One of the nice tricks in Java for saving memory is pointer compression where a pointer will be 4 bytes instead of 8 which is a significant savings given that Java programs tend to use a lot of pointers.  (pointer compression info). Together, this means that there will be 20 bytes in memory corresponding to this object.

C++: In the first case with a_ptr_cpp we can see that the object size will be 4 bytes as one would expect given the C++ standard. There will also be the additional malloc overhead associated with this object. This brings the total size for the a_ptr example to 12 + malloc overhead when including the stack pointer.
In the second case with a_stack_cpp we have this being directly allocated on the stack which means that we have no pointer or malloc overhead or pointer to the object, thus the total size is only 4 bytes. The fact that we are able to take a primitive type (int32_t) and wrap it inside some class and still consume the exact same amount of memory is what it means to have a zero cost abstraction.


Once C++ is compiled with some optimizations or a Java program has been running for long enough, it is feasible that both a_ptr_cpp and a_java are entirely stored inside a register saving 8 and 4 bytes respectively. Moreover, in the a_stack_cpp case we might have the integer value stored in a register, which means that the memory used in this case is 0 bytes.

Escape Analysis

For the C++ and Java case if a compiler can prove during optimizations that a dynamically allocated structure is not going to leave a method, then it is possible for that compiler to transform that object into a stack allocated object. This means that even for a_java and a_ptr_cpp could end up having zero memory consumption in the right circumstances.

  A a_java = new A(5);


  A *a_ptr_cpp = new A(5);
  cout &lt;&lt; a_ptr_cpp->get();
  delete a_ptr_cpp;


Array overhead

Arrays are basic primitive in nearly all languages when it comes to efficiently storing a number of similar objects. In the case of numerical processing applications, our main performance overheads are going to come down to how efficiently we store numbers and are able to access these values.

# Python
arr_python = [A(i) for i in range(10)]

A[] arr_java = new A[10];
for(int i = 0; i &lt; arr_java.length; i++)
  arr_java[i] = new A(i);

A *arr_ptr_cpp[] = new A*[10];
for(int i = 0; i &lt; 10; i++)
  arr_ptr_cpp[i] = new A(i);

A arr_obj_cpp[] = new A[10];
for(int i = 0; i &lt; 10; i++)
  arr_obj_cpp[i] = A(i);

std::vector&lt;A>; vec_obj_cpp(10);
for(int i = 0; i &lt; vec_obj_cpp.size(); i++)
  vec_obj_cpp[i] = A(i);

A arr_stack_cpp[10];
for(int i = 0; &lt; 10; i++)
  arr_stack_cpp[10] = A(i);

Python: First for python, when constructing a list we create a PyListObject which contains the PyObject_HEAD, a pointer to the head of the list and a size_t for the size of the list for 48 bytes. The list itself is simply a list of pointers to objects. Therefore the list object on its own will be 128 + 2 * malloc overhead. For every object, we are then going to have the same as we had above with 348 per object. This brings the total of this to 3608 bytes + 22 * malloc overhead.

Java: With Java, there are essentially two types of arrays. The first is an array of a primitive type (such as int[]) where the value of the integers will actually be stored inside the array itself. These arrays can be considered very efficient since there is a small overhead when storing a large number of primitive values. However, in this case, our array does not contain primitive types and instead must has to be an array of pointers. This means that the array itself will have the 12 byte overhead for each object created and then 10 pointers for which each is 4 bytes, making the list itself 48 bytes which is divisible by 8 so we do not have any additional wasted space here. Each object in the list will then be 16 bytes as before and we have a reference on the stack bringing the total for this up to 212 bytes.

C++: Finally with C++, we can see that there exist a number of possible implementations for an array of objects for which each of these have slightly different memory implications. In the arr_ptr_cpp case we are creating an array of pointers which is conceptually equivalent to the Python and Java implementations. Including the stack reference to this, it takes 120 + 11 * malloc overheads bytes in memory. Note: we are are going to have to explicitly deallocate all objects inside the array when freeing arr_obj_cpp which requires additional cognitive overhead when writing this program. In the second C++ example arr_obj_cpp we are allocating space inside the array for our object instead of using a pointer to reference the array. This means that the size of the array will only be 40 bytes bringing this implementations memory consumption to 48 + 1 * malloc overhead when including the stack pointer. The third case with vec_obj_cpp uses the C++ standard library and would be a more acceptable way of writing the arr_obj_cpp example and will use 2 additional size_t (16 bytes) to track the array’s size and amount allocated (64 + 1 * malloc overhead bytes). The final case is constructing an array directly on C++’s stack which means that we are avoiding the stack pointer reference as well as the malloc overhead and this will only require 40 bytes.
Again, we can see that C++ is capable of a zero overhead abstraction as in the last 3 examples the only overhead from directly using int32_t was O(1) bookkeeping.

Performance overhead

While the way in which data is stored is important, we also care about how these languages are going to perform when we go to execute our program.

# Python
a_python = A(5)

A a_java = new A(5);

A *a_ptr_cpp = new A(5);
cout &lt;&lt; a_ptr_cpp->get();

A a_stack_cpp(5);
cout &lt;&lt; a_stack_cpp.get();

Python: First, Python is the simplest language in the comparison, given that it is a simple bytecode interpreter. This means that it will not be performing any static compile time optimizations or attempting to compile away this abstraction.
Looking at just the a.set(6) line, we can easily look at the Python’s bytecode using the dis library:

# Calling a.set
3 12 LOAD_FAST 0 (a)
15 LOAD_ATTR 1 (set)
18 LOAD_CONST 2 (6)

# Method a.set implementation
7 0 LOAD_FAST 1 (a)
3 LOAD_FAST 0 (self)
6 STORE_ATTR 0 (_value)
9 LOAD_CONST 0 (None)

We can easily see that there are 10 python bytecode instructions. Some of these instructions such as STORE_ATTR and LOAD_ATTR represent hash table lookups to identify the _value and set slots. Additionally, nearly all of these instructions represent writes into memory given that Python uses reference counting, and by loading objects onto Python’s stack requires incrementing the reference count. To see full implementations of these byte codes you can look at ceval.c. From prior experience with python’s internals, I would guess that this line represents about 2000-7000 CPU instructions with over 30 writes to memory and 100 reads.

Java: With Java, it is a bit more difficult to exactly measure how much overhead there is when evaluating this instruction. This is because the JIT will be constantly generating new versions of this code which are more efficient. If we were to evaluate this instruction exactly once, then our bytecode will be running on an interpreter which is conceptually similar to python, but by not having reference counting and not backing every object with a hash map means that this is going to be significantly more efficient and about 100-500 instructions. After we have executed this statement a few times, the JIT will generate more optimized code. The exact code that is generated depends on a number of factors including if at the call site for a.set(6) if there exist multiple implementations of set that we must dispatch to (multiple classes implementing the same interface). Assuming that there will only be a single implementation of set, then the JVM will end up inlining and optimizing this down to 1 instruction which writes to memory.

C++: When using a pointer in C++, we can see statically that there is only a single implementation of A::set and so the compiler will likely inline this method call. This means that similar to the Java case it will have 1 instruction to perform the write into memory. In the second case with a_stack_cpp we might represent this instruction as a single write into a register. While both of these cases are going to be a single instruction, the later of writing into a register instead of memory will be much more “efficient.” Additionally, in a larger context, we could imagine that the compiler completely removes the representation of a_stack_cpp and just inlines the value 6 or 5 whatever is appropriate.

Again, a_stack_cpp is giving a zero cost abstraction while a_ptr_cpp and a_java are giving us a low cost abstraction w.r.t. the number of instructions evaluated.

Meta template programming

The advantages of zero cost overheads really shine when combined with templates and meta programming techniques. By having templated classes, we are able to construct a single class which implements many basic features that we are able to reuse.  For example, suppose that we had a LogNumber class such that we could have LogNumber<LinearNumber<float> > and LogNumber<LinearNumber<double> > which use the same same which class implementation with two different sizes of numbers. We could even extend this to have LogNumber<LogNumber<LinearNumber<float> > > which would allow for us to easily represent a number in a log log space without having to reimplemented our LogNumber class.

If we implemented this templated class in Java, then we would have that every template parameter would require a pointer and another class instance. This means that in the LogNumber<LogNumber<LinearNumber<Float> > > we would require that there are 4 class instances and thus would require 4 pointer indirection to actually reach the floating point value and consume 64 bytes of memory.

In C++, this templated class does not use a pointer indirection when using a template. This means that the size of LogNumber<LogNumber<LinearNumber<float> > > and LinearNumber<float> will be exactly the same size. Additionally, accessing the stored number will only take a single memory access instruction and can be placed on the stack or used inside an array as shown above.


Abstraction is a very important concept when writing larger programs. Having zero cost abstraction available means that we can have both performant programs while reducing the mental burden on the programmer. When programming in a language such as Java or Python, we have to make a choice between developing the most performant program in that language and having a clean and reusable abstraction.

We see the power of zero cost abstractions aggressively used inside libraries such as Eigen, Dlib and Tensorflow as these libraries care immensely about the computational and storage efficiency. Using C++’s template system, these libraries contain a single implementation for a generic routine which can be customized by switching out a few minor methods without losing any performance.

Theoretical online voting system

With the election a few days away, I found myself recently looking at the state of voting in America and contemplating that there is still no online-based voting system in place. The main arguments against online voting or digital-based voting has been that it would be hard to verify and would require a computer security expert to identify if something has been tampered with.

Now to create a system that is “provably incorruptible” would be very difficult and impracticable to expect average poll workers to verify the correctness of such a system, however, there is probably a widely unexplored range of systems that are better than our current system but still have some easy to verify properties. In this post, I attempt to create a voting system which no worse then our current voting system with respect to voter fraud and ensuring the votes are counted.

First, let’s consider the state of our current voting system, specifically the voting-by-mail system. Step 1 is to go online and register your address along with some identifying voter information. At a later point in time, the state will mail a ballot to your address which contains a “voting key” which maps various voting positions (who you would like for some office or position on a proposition) to some number \([1, 200]\). To vote, you bubble in your corresponding chosen numbers, wrap your ballot in more paper called a “secrecy sleeve,” put this in another envelope and mail it to the ballot counting location. Presumably, once your ballot arrives, someone will check the identifying information on the mailing envelope to prevent duplication and then pass the ballot and secrecy sleeve to someone else who is just going to count the votes. This two-level operation would prevent people from knowing who you voted for, assuming that the first poll works don’t look at the ballot inside the secrecy sleeve. In terms of ensuring that your vote is counted, we have to then trust the second poll worker to count the votes correctly. We might use more than one person for this second part to prevent errors etc.

Now in making a new system, we have to consider what possible vulnerabilities exist in the current system, as those could still be allowed in the new system:

  1. Trusting the United states postal services (USPS) to properly deliver mail — If your ballot never makes it back to the polling place, then it will essentially be lost (there might be some ways to identify that it is lost, but still no real/easy recourse for ensuring that it gets counted)
  2. The USPS needs to get you your ballot to you in the first place — If the ballot was sent to the wrong address, it is possible that someone fills in the ballot in your name, forges your signature, and then mails it back in
  3. People are trusted to bubble in their choice correctly — Eg, they are at least able to understand that given some “number” on a “ballot key,” they are supposed to transfer that number correctly to the ballot itself
  4. A malicious poll worker could prevent a vote from getting counted that they didn’t agree with — Given that your vote is easily identifiable on the ballot, it is trivial for someone to reject all ballots which have bubbled in number 10 (ideally, there are two or more people to double check that this does not happen)

Given this set of vulnerabilities in our current system, lets now try to design a better system that allows for internet voting:

Our first steps would be very similar to the current voting system, where someone goes online and registers with their mailing address. The state would then mail out a “ballot key” to the provided address. The reason that we would still require that something is mailed out is that there is currently no good way to identify a citizen online in a secure way, however, like the current vote-by-mail system, it is acceptable to trust the USPS as a “broker of identities.” Now our vote by internet ballot key will be a bit different from existing ballots where each vote is represented by \([1, 200]\) and instead have a number in \([0, 2^{256}]\), additionally, instead of having a single number (say 10) represent a position on the ballot, each voter would be given a unique number for each position on the ballot. (A sample ballot is at the end of this post) We can then use a simple website to collect the keys which represent a person’s choice. Given that each user has different codes generated for their ballot, we can use untrusted channels to communicate these codes to the vote-counting authority. Additionally, we do not have to worry about “suppressing” the vote that a poll worker disagrees with since the intermediate communication mechanisms don’t even know which vote was cast. All they know is that they are responsible for is communicating some number to the voting authority.  Even if voter’s computer was infected with a computer virus, it would be unable to change your vote since it only knows the key that was entered representing your choice, while the other keys would only be present on the paper ballot key that was mailed to your address.

Some properties of this system:

  1. We are still trusting the USPS to properly identify people and communicate information with them securely. (Same as before)
  2. Submitting a vote for someone else still depends on your receiving or intercepting their ballot and “forging” a signature (Same as before)
  3. The intermediaries do not know your vote (better than before) — Now your vote is a number that is specific to you, so the only people who will know the vote are the person who generated the “voting key” and whoever has the voting key
    1. The intermediaries can not suppress your vote based on who you voted for — They do not who you voted for, so it can not be suppressed based on this reason
    2. Your vote can not be changed after the fact — Changing your vote would require that the malicious intermediary have your “voting key book,” which was printed by the state and mailed by the USPS (which is a trusted medium)
    3. Your computer (now technically also an intermediary) can not change your vote even if it was infected with a virus —  your computer does not know the alternate keys you were provided since they were printed and mailed, so it can not switch between them.
  4. The number that you have to enter is a lot longer (worse) — Currently, you only enter some number \([1, 200]\), however, a 256 bit number is notably longer.  Given how people are already used to entering 16 digit credit card numbers, this might not be such a big issue.  We could even include checksums to limit erroneously entering something (bitcoin already uses a 32-bit checksum on all addresses)

Some might point out that one could set up false voting websites to try to confuse voters or perform a DOS attack on the voting website. First, with false websites, we could follow the trend of some banking websites where an image is displayed to ensure that you are on the correct website. However, we might make it some confirmation code that is sufficiently long that it would be difficult to counterfeit and easy to print on a “ballot key.” For the DOS attack, we already know how to build systems that can deal with DOS attacks. Additionally, if we have a confirmation code system that confirms that a vote has been recorded, then any mechanism which takes a voting key and returns the confirmation code is as good as any other. This means you could have voting via email or even text message, which are “more difficult” to perform a DOS attack against or allow for third-party websites to spring up to collect votes as they would have to be still backed by the state vote recording authority.

Sample theoretical ballot key:

Politician Office key Confirmation code
T Sandwich President NMosFJizjPgUV2BKEhGE rjvUZzKZVAFCyqPy7w3t FuT8VDz3z
Giant D President Tru4oZn9y3RMnxAsb7g 5Gqs7Fu13FX4ExaQSer6y bFcCf4MJA
None of the above President LaGeinvoBUduEbovp5z JDQJ6DQEdgSqZWgXzArhi xjzEahMdi

(These politician names are based on this current season of south park.)

TL;DR: Online voting where you are still mailed your ballot via USPS, and your ballot contains keys that we consider “secure,” and you only submit one key that corresponds to your vote.

Update / additional background info / other posts on this topic:

While the mathematical concepts in these schemes are sound, it would be difficult to convince the public at large.  In these cases, people would have to just generally trust that someone has done the correct thing with designing the voting systems.  From an academic point of view, if these systems are implemented correctly, there wouldn’t even be a need for there to be vote checkers since they would “have to be correct.”

Estimating the n percentile of a set

Here is an interesting idea that I had recently, this is just a high level concept about how it would work, there are no proofs for error bounds or quality, in fact there would be a handful of orderings of sets which would product terrible results.


To accurately compute the \(n^{th}\) percentile value of a given set of values, one ends up having to sort the values which if they are not integers, can take \(O(n \log n)\) time.  However, then getting value itself is trivial since it is simply going to the correct place in the sorted values.  I am thinking that one should be able to compute an estimate for the \(n^{th}\) value for a randomly ordered set of elements in \(O(n)\) time.

First the basic idea, imagine that you have a set of elements \(X = \{ x_i \}\).  Then if we had this set sorted \(S\), then finding the \(n^{th} (0 \le n \le 1)\) percentile out of \(X\) would simply become \(s_{n * |S|}\).  This implies that we have \(n * |S|\) elements less than \(s_{n *|S|}\) and \(|S|*(1 – n)\) elements greater.  From this point we can imagine constructing two sets, \(\alpha = \{ x \in X : x < s_{n * |S| } \}, \beta = \{ x \in X : x > s_{n * |S|} \}\) which represent the elements greater and less then the \(n^{th}\) value.  This also means that \(\frac{|\alpha|}{|\alpha| + |\beta|} \approx n\).  Now using this concept for \(\alpha\) and \(\beta\), we can attempt to construct these sets while iterating through \(X\) by having a current estimate for the value \(s\), and then tracking the elements current in each set.  This essentially becomes if \(\frac{|\alpha|}{|\alpha| + |\beta|} > n + \epsilon\) then take the current value of \(s\) and insert it into \(\beta\), then take the largest element out of \(\alpha\) and set it equal to \(s\).  In the case of \(\frac{|\alpha|}{|\alpha| + |\beta|} < n – \epsilon\) we simply do the reverse by inserting the current value of \(s\) into \(\alpha\), and then removing the smallest value from \(\beta\) and setting it equal to \(s\).

Now the problem has been reduce to splitting \(X\) into two different sets and keeping them sorted somehow to be able to get and remove the largest/smallest elements.  However, this would give an exact answer to finding the \(n^{th}\) percentile.  Now given that we want to find an estimate, we can imagine capping the size of these sets to \(k\), where \(k\) is a small number such as \(2\).  Then instead of tracking the elements themselves, we are simply counting the number of elements that are greater or less than the current \(s\) value.  Additionally, we have the sets tracking the \(k\) elements that are largest but still less then \(s\), and the smallest but greater then \(s\).  As we iterate through the set, we can track the \(k\) values in \(\alpha, \beta\) and the size of \(\alpha, \beta\) accordingly, and when we want to change the value of \(s\) to keep \(\frac{|\alpha|}{|\alpha| + |\beta|} \approx n\), then we just take the new value from \(\alpha\) or \(\beta\) respectively and update the cardinality of each set.

An additional extension to this algorithm could be for an \(n \approx .999\) then the size of \(\beta\) would only be \(\frac{1}{1000}\) the size of the original data set.  Keeping track of largest \(.001\) of the data set would not be linear in the side of the data set, however it could take out a considerable chunk of computation depending on how large or small the value of \(n\) is.

The Hackathon paradigm

Today I was looking at a lot of the different applications that I normally use on my phone and through my web browser.  If I was talking to someone who had never experienced either of these before, they might believe that I generally have a single specific device for a specific task, and that in terms of functionality there would be little overlap of major features.  However, for anyone that has experienced either of these mediums, they are aware of the wide variety of applications and programs that duplicated the functions of other applications.

My complain is two-pronged on this issue of applications that start or continue with a Hackathon paradigm.  First, the old Unix philosophy says do one thing and do one thing well.  On this specific point, I believe that many applications start out with the right intentions, however over time there is a significant feature creep effect that takes place.  I believe that this is the result of “Hackathon project” becoming more than “Hackathon projects.”  The developer of these application feel that they are going to form a company with a project that in terms of complexity should really be no more than a side project.  Essentially what I am saying, is to develop and maintain your application X, it _might_ take 2 hours a week once it is launched.  However, these individuals choose to attempt to make a 50 hour, startup styled, work week out of these types of projects.

My second issue with the “Hackathon projects” is don’t assume that something that you can write in 24 hours is not easily copied.  There are a lot of very complex and difficult problem that exist in the world today.  Nearly all of these types of problems can not be solved in a short period of time.  Additionally, if a product can be made in 24 hours given the proper tools and skills, then it is only a matter of time before there is a large number of people who will be competing with you.  Some might even have better products given that they were able to replicate your product in such a short period of time, and then vastly improve upon there after.

With these two issues, I am not saying that Hackathons are bad,Hackathons provide a relativity easy way to create demos to show of skills.  However when it comes to publishing a product I believe that people should think a little more about what they are going to create and invest enough time into the product such that it is not just going to be another 1 of 100 “identical” products.


It seems a little strange that I have not yet written a blog entry on my own project JSApp.US

For those that do not know what it is, it is a platform to write applications quickly in Javascript using the node.js platform and then quickly deploy these applications to the web.  To some extent this is my attempt to relive the days of appjet, but with a new and more powerful platform.

At it stands now, it is my most popular site ever, it laded its self to the reddit/programming page, and was on the top of hackernews for about 3 hours.  It seems to be very successful for now and even I have a few applications that are using it, such as, it is powering the little visitor count that I have placed on the side bar.

There are a few new features that I am still planing to implement.  The first feature is going be a sorta profile that will allow for people to “show off” their applications that they have on JSApp.US.  I also have plans to make it so that one can require in files from other users.  This should end up like how github handles it with users, the basis will look like: require(‘other_user/’).

Dropbox as back-end for git

I have been using Dropbox for a good amount of time now, and something that I have recently started using Dropbox for is as a back-end system for private git repositories.  While Github is great for public repos, if you just want to have a private project synced between a few computers or with another friend on Dropbox then the steps are more or less straight forward.

1) Open up a terminal and cd into your dropbox folder,

2) You have to make a bare git repository that is used to hold all of the git objects, this can be done by issuing the “git init –bare repo.git” command.  This will make a new directory called repo.git with all the supporting files.

3) At this point all that is left to do is link up your working git directory with the Dropbox one, the easiest way to do this is cd into the repo.git directory.  Then Issue the ‘pwd’ command, this will give you the absolute path (git will not use relative paths for some reason).  Then copy the output from pwd and then cd into your working directory, and issue the command “git remote add origin file://[Paste in what ever came from pwd]”

4) Now that your working directory is linked with the directory on Dropbox issue the command “git push origin master” to load all your files into the dropbox.  One thing that you will notice is that this will run a lot faster than when working with a remote repo over the internet.  The reason is that you are not pushing files onto the intern but into another folder on your computer.  If you look at you dropbox icon at this point you will see that it is syncing a number of files.  If you have worked up a larger number of changed between syncing with dropbox you might see that it is syncing in excess of 200 files with the server.

5) To load the project onto another computer you can use the “git clone file://[path to the dropbox repo.git folder]”  One thing to note before pulling any changes is to make sure that dropbox has finished syncing all of the files.  If you trying and pull or clone out of the Dropbox directory before it is done syncing then you can run into problem as git will look for files that are not there yet as dropbox has not finished syncing, so as a not of caution, wait for dropbox to finish syncing before cloning or pulling.

When you want to share a project with someone else, then all you have to do is share the repo.git folder with that person, they can then run a clone command of that file to make their own working directory.  While this does give an easy way to have free private 2 GB of git syncing space, there is no cool Github like web interface for looking at code.  And as another note, anyone that can modify the repo.git folder will be an ‘admin’ to the project as all the server is really is just syncing file between computers.