Intel Xeon Phi for “cheap”

(This work and post were originally from early 2015, some aspects may still be useful, eg the kernel patch for the lower end motherboards)

Recently Intel has been selling their a version of their Xeon Phi coprocessor under a promotional deal at 90% off.  This means that one can get a device with 8GB of ram (on the coprocessor) and 228 hardware threads (57 physical cores, and each with 4 hyper-threads) at a reasonable price of ~$200.

When I first purchased the Phi, I was planning to put it into somewhat of an old desktop system that I had lying around, however the motherboard did not support the major requirement of “Above 4G decoding” on the PCI bus.  4G decoding deals with how the system allocates the memory resources on items on the PCI bus.  With the Intel Phi, unlike consumer level GPUs it will present all 8G as a memory mapped region to the host computer.  (more about 4G decoding)   Based off some research on this obscured feature, it appeared that most “modern” motherboard have some support for this feature.  I decided to get an Asus h97m-plus which is fairly cheap, and fit the computer tower that I already had on hand.  While this motherboard does list the above 4G decoding in its bios and manual, I am not actually sure if this feature has been properly tested, as unlike Asus higher end motherboards, there was no mention of this mother board specifically working with the above 4G decoding.  Based off examining the early booting sequence it appeared that the Linux kernel was attempt to find alignment positions for PCI devices which were equal in size to the requested memory region (8GB in this case) or depends on the BIOS to perform the PCI allocation before booting.  For the higher end motherboards which the Intel Phi was known to work with, it appears that the “more powerful BIOSes” were allocating memory for the Phi, but in the case of this lower end motherboard, the BIOS was unable to deal with a request to allocate 8GB of memory and thus falling back onto the kernel to perform allocations.  Following this observation, I made a small kernel patch (here) which changes requests for alignment larger than the maximal size to be simply aligned at the maximal supported size.  With the  components in this computer it appears that even with this change the Intel Phi gets aligned to a 4GB boundary and is able to still function correctly.

The next challenge once the Phi was communicating with the computer was to prevent the chip from overheating.  The discounted versions of the Phi did not include any fans as it was designed for use in server environments.  Additionally being a 300+W accelerator, the system is capable of generating a lot of heat.  As such, many “typical” fan solutions that I tried failed to keep the chip cool for longer than a few minutes.  I eventually landed on the high-powered tornado fan which can move over 80 cubic inches of air a minute.  I ended up having to zip tie this over one end of the chip to ensure that there was enough directed airflow to keep it functional.  (warning to future users: This fan actually does sound like a tornado, constantly).

Having the entire system functional for over a year now, I have managed to use the Phi for a handful of computations.  While there is decent opportunity in improved performance, this chip really requires that you design customized software for it specifically.  This is especially true given that Intel Phi is less popular than graphics cards with Cuda, where many mathematical tools and frameworks already have customized backend targeting Cuda requiring limited effort on the user’s part.  While this chip has a nice promise of being able to execute normal x86 instructions, this seems to be of fairly limited use since the only compiler that will target the chip and use its specialized vector instructions is Intel’s own compiler (similar in nature to Cuda).  This makes it fairly difficult to natively run any non trivial programs on this chip as any external libraries require their own porting effort.  (As an accelerator which accelerates embedded methods similar to Cuda this chip works fine, just if you are trying to run a program without the hosts involvement.)



Photos of the setup: