First real life experiment with the Xeon Phi IV

In this part I will present some results, which I got from using OpenCL on the Xeon Phi. In another blog entry I described some problems when using the Xeon Phi together with OpenCL [1]. This problems were solved now and I will first report, how I reached this. After that I will summarize the performace measurements and their results.

Getting OpenCL running
My problem was not to install OpenCL but to initialize it correctly. So everything I explained in [1] was correct and the installation was complete. First I used an example code from the Internet, to make sure, that the Phi is really registered as OpenCL device [2]. This created me the following output:

platform count: 1
device count: 2
1. Device: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
1.1 Hardware version: OpenCL 1.2 (Build 56860)
1.2 Software version: 1.2
1.3 OpenCL C version: OpenCL C 1.2
1.4 Parallel compute units: 32
2. Device: Intel(R) Many Integrated Core Acceleration Card
2.1 Hardware version: OpenCL 1.2
2.2 Software version: 1.2
2.3 OpenCL C version: OpenCL C 1.2 (Build 56860)
2.4 Parallel compute units: 236

At this point I could be sure, that the Phi is ready to work, but the next question was how to use it. After a while of reading the OpenCL Documentation I got a hint to the device type CL_DEVICE_TYPE_ACCELERATOR and with that I was able to calculate on Phi. The following code shows how I initialize my device (declarations and error handling ommited).

context = clCreateContextFromType(cprops,                                                                                                                                                                          
    CL_DEVICE_TYPE_ACCELERATOR,                                                                                                                                                                                
    NULL,                                                                                                                                                                                                      
    NULL,                                                                                                                                                                                                      
    &status);
 
status = clGetContextInfo(context,                                                                                                                                                                                         
    CL_CONTEXT_DEVICES,                                                                                                                                                      
    0,                                                                                                                                                                       
    NULL,                                                                                                                                                                    
    &deviceListSize);
 
devices = (cl_device_id *)malloc(deviceListSize);
 
commandQueue = clCreateCommandQueue(                                                                                                                                                                                       
    context,                                                                                                                                                                                
    devices[0],                                                                                                                                                                             
    CL_QUEUE_PROFILING_ENABLE,                                                                                                                                                              
    &status);

Now with device[0] I can do the rest of the initialization work.

OpenCL Results
For comparision between the installed Intel Xeon E5-2670 and the Xeon Phi I used a raytracer which operates within Geometric Algebra (GA). It was developed in my Master Thesis [3] and modified and ported to linux for this test. I will show the testscenes and present the results in the following. The profiling was done like in my thesis by using the OpenCL framework’s methods [4]. The resolution for every scene is 1024*1024.

Raptor
raptor
This small dinosaur consists of 100000 triangles. The model was only raycasted so that no reflection rays where used. The Xeon Phi needed 2,98 seconds to render this image, the Xeon E5 only 2,58.

3 Bunnys and an Elephant
3bunnys1ele
This scene consists of 16150 triangles (each bunny 4968, elephant the rest). The Xeon E5 needs 2,47 seconds and the Xeon Phi 2,17. So at least the Phi can outperform the E5.

CowSphere
Cowsphere
This rather small scene consists of only about 6000 triangles, but the calculation is dominated by the high amount of shadow. It was tested in two ways. First with use of bounding spheres to reduce the account of ray-triangle intersection tests (like all scenes until now) and in a second step without bounding volumes. In this case for every pixel (e.g. the corresponding eye-ray to it) a test with every triangle in the scene has to be done. For the first variant the Phi renders the picture within 2,11 sec, while the E5 is ready after 2,96 seconds. For the second, the Phi needs 4,76 sec and the E5 6,46 seconds.

Kittens
Kitten
This last scene is the most complex one. Each kitten is built by 137098 triangles and the bounding spheres are disabled, while the scene is not only raycasted but completely raytraced. So for every of the 1048576 pixels over 250000 triangles must be tested for intersection (in case of a hit this value doubles). The Phi finishes this task after 142 seconds, while the Xeon E5-2670 has the result after 177 seconds.

Summary
It looks like it was in the other three parts of this serie before: Without changing or rewriting existing code it seems impossible to exploit the Xeon Phi’s potential. In contrast to my tries in offload and native mode with C++ code the Phi is able to render faster than the server processor in most of the scenes, but his advance is not that big. More than ever if I take the results on my AMD HD6970 from my Master Thesis into account, the calculation of the scenes on the Xeon Phi is slow.

Sources:
[1] http://www.theismus.de/HPCBlog/?p=81
[2] http://dhruba.name/2012/08/14/opencl-cookbook-listing-all-devices-and-their-critical-attributes/
[
3] http://www.gaalop.de/wp-content/uploads/Masterarbeit-Michael-Burger.pdf
[
4] http://software.intel.com/sites/landingpage/opencl/optimization-guide/Profiling_Operations_Using_OpenCL_Profiling_Events.htm