“Scale and Vectorize”. These are two prerequisites which I read several times until now, when I searched for ways to make a program run faster on Xeon Phi. The principle of scaling is clear to me and my raytracing applications which I am testing on Xeon Phi do scale well on our server processors and also on Xeon Phi, as I will show in this post. But the overall performance on Xeon Phi is still worse than on the Server CPUs.
As another approach to get a fast Xeon Phi application I tried to offload the Geometric Algebra raytracer of my master thesis  to Xeon Phi. The offloaded code is not based on the C++ version of my software but on the OpenCL code which I tested on AMD and NVidia cards during the programming phase of my master thesis. This code uses only scalars and arrays and so it overcomes the limitation of the mic compiler, that it is not able to offload objects which can’t be copied by a simple memcpy . In fact this code runs much faster on CPU and the Xeon Phi than the original C++ version, but it is clear, that it is much harder to understand and to maintain. As an example for the performance I present the following scene here:
It consists of 6500 triangles, uses shadow rays and has recursion level 2. The resulting rendering time is:
- Xeon (C/OpenCL-based version): 0,40 sec
- Xeon (C++ version): 1,05 sec
- Xeon Phi (C/OpenCL-based version): 1,05 sec
So the Xeon Phi again isn’t able to reach the performance of both server processors. But why? First I did a scaling analyses with the following result:
So for 120 threads the Xeon Phi performs about 80 times faster than single threaded. I think that is a pretty good value and not the reason why the Xeon CPU ist such much faster. Thats why I tested the second prerequisite “vectorization” in order to find the performance issue. And there it is. It doesn’t have a measurable influence on the runtime whether to compile with vectorization or with the -no-vec flag to suppress the generation of vector code. The SIMD units of the Phi cores seem to be completely unused by my programm. I tried a few things until now but weren’t able to reach anything in this direction.
I also did a few tests on the thread affinity. scatter performs some percent faster than compact, but this small differences can’t be the explanation for Xeon Phi’s bad rendering times.