This post is from a recent conversation that I had about the cost of abstractions in various languages. In choosing a programming language for a project, it is important to choose a language that gives a good mix of “time to develop” and “performance.” The first tends to be obvious from experience. However, for many, the performance of a programming language (and its implementations) is somewhat opaque. I am going to demonstrate the differences between Python, Java, and C++, as this gives a good sample of the differences between programming implementations. We can think of Python as representative of the general class of interpreter-based programming languages such as Ruby and Perl, and Javascript before 2005. Java represents the current state of the art in JIT technology and should generalize to general JVM targeted languages and even languages such as Javascript, which have seen countless hours invested in developing new implementations of the language. Finally, C++ is a “low level” language, however, it features a number of unique features, such as templates and stack allocation, which allows for zero cost abstractions which is sparsely seen in other languages.
What is abstraction
Abstraction in this post means the ability of the programmer to wrap a series of methods and data into some object which we are going to use throughout our program. An example of abstraction might be a class LogNumber
which abstracts away the fact that we are representing some numbers inside of a log space instead of a linear space. This is fairly standard practice in machine learning applications and might even be necessary in some cases where the magnitude of the number is smaller than the range of a floating point number. If we have this abstraction then we can write an expression such as: LogNumber + LinearNumber
and have the LinearNumber
automatically converted into log space and thus we will get the correct behavior from our program. If we are programming without this abstraction and simply using primitive types, and we accidentally write float_in_log + float_in_linear
we are going to get a number that is neither in log space or linear space and this is a bug. One of the most famous examples of this bug is with the Mars Climate Orbiter crashed due to a units mismatch between standard and metric system.
For the remainder of this post, we consider a simplified “LogNumber
” class, A
which is using a single 32 bit signed integer and getter and setter methods. One could easily imagine there being multiple getter and setter methods for when one wants to set a linear number to a log space number etc.
# Python class A: def __init__(self, a): self._value = a def get(self): return self._value def set(self, a): self._value = a
// Java class A { private int value; public A(int a) { value = a; } public int get() { return value; } public void set(int a) { value = a; } }
// C++ class A { private: int32_t value; public: A(int32_t a) : value(a) {} int32_t get() { return value; } void set(int32_t a) { value = a; } }
Memory overhead
First, let us consider the impact on memory that this abstraction has in each of these languages. Memory makes since as a place to start it is generally understood that the less memory that we use the more that we can fit into ram. Additionally, on modern hardware, accessing main memory tends to be a major bottle neck for performance with a read from main memory taking ~100 cycles while a local processor caches takes 10-20 cycles. As such, having an object which is twice as large means that we can fit half as many objects inside our processor cache and thus will end up performing costly accesses to main memory more often. Remember, while reading this next section, we are just wrapping a single 4 byte integer.
# Python a_python = A(5)
// Java A a_java = new A(5);
// C++ A *a_ptr_cpp = new A(5); A a_stack_cpp(5);
Python: First for Python, in constructing the A
class, we are going first to create a PyInstanceObject which contains 3 pointers plus the PyObject_HEAD which is 3 pointers plus a size_t bringing the size of this struct up to 56 bytes plus malloc overhead. (Note: the malloc overhead depends on the implementation of malloc used and could even be changed at runtime using something like LD_PRELOAD). Next, to actually save _value
we have to construct the hash map that is going to back all the elements for this class instance. This means creating a PyDictObject which is a PyObject_HEAD, 3 size_t, 2 pointers, and 8 preallocated hash map slots for which each slot contains a size_t and 2 pointers bringing up the size of this object to 264 + malloc overhead. The number 5 in this case is a small integer and thus is already interned inside the python interpreter and the hash map key _value
would be shared between all instances of this object and so we are not going to count it. Finally, we are going to have to maintain a pointer on Python’s stack to the object a_python
that we just constructed. All together then this brings our total size to 328 + 2 * malloc overhead. Note: We could reduce the object size somewhat by using Python’s __slots__ feature which should avoid allocating an oversized hash map to back this object, however our object size is still going to be in the 100s of byte range.
Java: Next with Java, there is an 8 byte header that is added to all objects which contains information about the type of object and object state w.r.t. locking etc. (more info) Additionally, Java will round the size of objects up to the next 8 byte increment which means that there will be 4 wasted bytes inside this particular object. Finally, we need to store a pointer to this object on our stack. One of the nice tricks in Java for saving memory is pointer compression where a pointer will be 4 bytes instead of 8 which is a significant savings given that Java programs tend to use a lot of pointers. (pointer compression info). Together, this means that there will be 20 bytes in memory corresponding to this object.
C++: In the first case with a_ptr_cpp
we can see that the object size will be 4 bytes as one would expect given the C++ standard. There will also be the additional malloc overhead associated with this object. This brings the total size for the a_ptr
example to 12 + malloc overhead when including the stack pointer.
In the second case with a_stack_cpp
we have this being directly allocated on the stack which means that we have no pointer or malloc overhead or pointer to the object, thus the total size is only 4 bytes. The fact that we are able to take a primitive type (int32_t
) and wrap it inside some class and still consume the exact same amount of memory is what it means to have a zero cost abstraction.
Registers
Once C++ is compiled with some optimizations or a Java program has been running for long enough, it is feasible that both a_ptr_cpp
and a_java
are entirely stored inside a register saving 8 and 4 bytes respectively. Moreover, in the a_stack_cpp
case we might have the integer value stored in a register, which means that the memory used in this case is 0 bytes.
Escape Analysis
For the C++ and Java case if a compiler can prove during optimizations that a dynamically allocated structure is not going to leave a method, then it is possible for that compiler to transform that object into a stack allocated object. This means that even for a_java
and a_ptr_cpp
could end up having zero memory consumption in the right circumstances.
// Java { A a_java = new A(5); a_java.set(6); System.out.print(a_java.get()); // we never let anything outside this block get the reference to a_java }
// C++ { A *a_ptr_cpp = new A(5); a_ptr_cpp->set(6); cout << a_ptr_cpp->get(); delete a_ptr_cpp; // the compiler is able to see where this object is constructed and destroyed // if we didn't delete it, then the fact the leak could still be considered an escape in C++ }
Array overhead
Arrays are basic primitive in nearly all languages when it comes to efficiently storing a number of similar objects. In the case of numerical processing applications, our main performance overheads are going to come down to how efficiently we store numbers and are able to access these values.
# Python arr_python = [A(i) for i in range(10)]
// Java A[] arr_java = new A[10]; for(int i = 0; i < arr_java.length; i++) arr_java[i] = new A(i);
// C++ A *arr_ptr_cpp[] = new A*[10]; for(int i = 0; i < 10; i++) arr_ptr_cpp[i] = new A(i); A arr_obj_cpp[] = new A[10]; for(int i = 0; i < 10; i++) arr_obj_cpp[i] = A(i); std::vector<A>; vec_obj_cpp(10); for(int i = 0; i < vec_obj_cpp.size(); i++) vec_obj_cpp[i] = A(i); A arr_stack_cpp[10]; for(int i = 0; < 10; i++) arr_stack_cpp[10] = A(i);
Python: First for python, when constructing a list we create a PyListObject which contains the PyObject_HEAD, a pointer to the head of the list and a size_t for the size of the list for 48 bytes. The list itself is simply a list of pointers to objects. Therefore the list object on its own will be 128 + 2 * malloc overhead. For every object, we are then going to have the same as we had above with 348 per object. This brings the total of this to 3608 bytes + 22 * malloc overhead.
Java: With Java, there are essentially two types of arrays. The first is an array of a primitive type (such as int[]) where the value of the integers will actually be stored inside the array itself. These arrays can be considered very efficient since there is a small overhead when storing a large number of primitive values. However, in this case, our array does not contain primitive types and instead must has to be an array of pointers. This means that the array itself will have the 12 byte overhead for each object created and then 10 pointers for which each is 4 bytes, making the list itself 48 bytes which is divisible by 8 so we do not have any additional wasted space here. Each object in the list will then be 16 bytes as before and we have a reference on the stack bringing the total for this up to 212 bytes.
C++: Finally with C++, we can see that there exist a number of possible implementations for an array of objects for which each of these have slightly different memory implications. In the arr_ptr_cpp
case we are creating an array of pointers which is conceptually equivalent to the Python and Java implementations. Including the stack reference to this, it takes 120 + 11 * malloc overheads bytes in memory. Note: we are are going to have to explicitly deallocate all objects inside the array when freeing arr_obj_cpp
which requires additional cognitive overhead when writing this program. In the second C++ example arr_obj_cpp
we are allocating space inside the array for our object instead of using a pointer to reference the array. This means that the size of the array will only be 40 bytes bringing this implementations memory consumption to 48 + 1 * malloc overhead when including the stack pointer. The third case with vec_obj_cpp
uses the C++ standard library and would be a more acceptable way of writing the arr_obj_cpp
example and will use 2 additional size_t (16 bytes) to track the array’s size and amount allocated (64 + 1 * malloc overhead bytes). The final case is constructing an array directly on C++’s stack which means that we are avoiding the stack pointer reference as well as the malloc overhead and this will only require 40 bytes.
Again, we can see that C++ is capable of a zero overhead abstraction as in the last 3 examples the only overhead from directly using int32_t was O(1) bookkeeping.
Performance overhead
While the way in which data is stored is important, we also care about how these languages are going to perform when we go to execute our program.
# Python a_python = A(5) a_python.set(6) print(a_python.get())
// Java A a_java = new A(5); a_java.set(5); System.out.print(a_java.get());
// C++ A *a_ptr_cpp = new A(5); a_ptr_cpp->set(6); cout << a_ptr_cpp->get(); A a_stack_cpp(5); a_stack_cpp.set(6); cout << a_stack_cpp.get();
Python: First, Python is the simplest language in the comparison, given that it is a simple bytecode interpreter. This means that it will not be performing any static compile time optimizations or attempting to compile away this abstraction.
Looking at just the a.set(6)
line, we can easily look at the Python’s bytecode using the dis library:
# Calling a.set 3 12 LOAD_FAST 0 (a) 15 LOAD_ATTR 1 (set) 18 LOAD_CONST 2 (6) 21 CALL_FUNCTION 1 24 POP_TOP # Method a.set implementation 7 0 LOAD_FAST 1 (a) 3 LOAD_FAST 0 (self) 6 STORE_ATTR 0 (_value) 9 LOAD_CONST 0 (None) 12 RETURN_VALUE
We can easily see that there are 10 python bytecode instructions. Some of these instructions such as STORE_ATTR and LOAD_ATTR represent hash table lookups to identify the _value
and set
slots. Additionally, nearly all of these instructions represent writes into memory given that Python uses reference counting, and by loading objects onto Python’s stack requires incrementing the reference count. To see full implementations of these byte codes you can look at ceval.c. From prior experience with python’s internals, I would guess that this line represents about 2000-7000 CPU instructions with over 30 writes to memory and 100 reads.
Java: With Java, it is a bit more difficult to exactly measure how much overhead there is when evaluating this instruction. This is because the JIT will be constantly generating new versions of this code which are more efficient. If we were to evaluate this instruction exactly once, then our bytecode will be running on an interpreter which is conceptually similar to python, but by not having reference counting and not backing every object with a hash map means that this is going to be significantly more efficient and about 100-500 instructions. After we have executed this statement a few times, the JIT will generate more optimized code. The exact code that is generated depends on a number of factors including if at the call site for a.set(6)
if there exist multiple implementations of set
that we must dispatch to (multiple classes implementing the same interface). Assuming that there will only be a single implementation of set
, then the JVM will end up inlining and optimizing this down to 1 instruction which writes to memory.
C++: When using a pointer in C++, we can see statically that there is only a single implementation of A::set
and so the compiler will likely inline this method call. This means that similar to the Java case it will have 1 instruction to perform the write into memory. In the second case with a_stack_cpp
we might represent this instruction as a single write into a register. While both of these cases are going to be a single instruction, the later of writing into a register instead of memory will be much more “efficient.” Additionally, in a larger context, we could imagine that the compiler completely removes the representation of a_stack_cpp
and just inlines the value 6 or 5 whatever is appropriate.
Again, a_stack_cpp
is giving a zero cost abstraction while a_ptr_cpp
and a_java
are giving us a low cost abstraction w.r.t. the number of instructions evaluated.
Meta template programming
The advantages of zero cost overheads really shine when combined with templates and meta programming techniques. By having templated classes, we are able to construct a single class which implements many basic features that we are able to reuse. For example, suppose that we had a LogNumber
class such that we could have LogNumber<LinearNumber<float> >
and LogNumber<LinearNumber<double> >
which use the same same which class implementation with two different sizes of numbers. We could even extend this to have LogNumber<LogNumber<LinearNumber<float> > >
which would allow for us to easily represent a number in a log log space without having to reimplemented our LogNumber
class.
If we implemented this templated class in Java, then we would have that every template parameter would require a pointer and another class instance. This means that in the LogNumber<LogNumber<LinearNumber<Float> > >
we would require that there are 4 class instances and thus would require 4 pointer indirection to actually reach the floating point value and consume 64 bytes of memory.
In C++, this templated class does not use a pointer indirection when using a template. This means that the size of LogNumber<LogNumber<LinearNumber<float> > >
and LinearNumber<float>
will be exactly the same size. Additionally, accessing the stored number will only take a single memory access instruction and can be placed on the stack or used inside an array as shown above.
Conclusion
Abstraction is a very important concept when writing larger programs. Having zero cost abstraction available means that we can have both performant programs while reducing the mental burden on the programmer. When programming in a language such as Java or Python, we have to make a choice between developing the most performant program in that language and having a clean and reusable abstraction.
We see the power of zero cost abstractions aggressively used inside libraries such as Eigen, Dlib and Tensorflow as these libraries care immensely about the computational and storage efficiency. Using C++’s template system, these libraries contain a single implementation for a generic routine which can be customized by switching out a few minor methods without losing any performance.
Your sharing of knowledge is greatly appreciated.