First real life experiment with the Xeon Phi IV

In this part I will present some results, which I got from using OpenCL on the Xeon Phi. In another blog entry I described some problems when using the Xeon Phi together with OpenCL [1]. This problems were solved now and I will first report, how I reached this. After that I will summarize the performace measurements and their results.

Getting OpenCL running
My problem was not to install OpenCL but to initialize it correctly. So everything I explained in [1] was correct and the installation was complete. First I used an example code from the Internet, to make sure, that the Phi is really registered as OpenCL device [2]. This created me the following output:

platform count: 1
device count: 2
1. Device: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
1.1 Hardware version: OpenCL 1.2 (Build 56860)
1.2 Software version: 1.2
1.3 OpenCL C version: OpenCL C 1.2
1.4 Parallel compute units: 32
2. Device: Intel(R) Many Integrated Core Acceleration Card
2.1 Hardware version: OpenCL 1.2
2.2 Software version: 1.2
2.3 OpenCL C version: OpenCL C 1.2 (Build 56860)
2.4 Parallel compute units: 236

At this point I could be sure, that the Phi is ready to work, but the next question was how to use it. After a while of reading the OpenCL Documentation I got a hint to the device type CL_DEVICE_TYPE_ACCELERATOR and with that I was able to calculate on Phi. The following code shows how I initialize my device (declarations and error handling ommited).

context = clCreateContextFromType(cprops,                                                                                                                                                                          
    CL_DEVICE_TYPE_ACCELERATOR,                                                                                                                                                                                
    NULL,                                                                                                                                                                                                      
    NULL,                                                                                                                                                                                                      
    &status);
 
status = clGetContextInfo(context,                                                                                                                                                                                         
    CL_CONTEXT_DEVICES,                                                                                                                                                      
    0,                                                                                                                                                                       
    NULL,                                                                                                                                                                    
    &deviceListSize);
 
devices = (cl_device_id *)malloc(deviceListSize);
 
commandQueue = clCreateCommandQueue(                                                                                                                                                                                       
    context,                                                                                                                                                                                
    devices[0],                                                                                                                                                                             
    CL_QUEUE_PROFILING_ENABLE,                                                                                                                                                              
    &status);

Now with device[0] I can do the rest of the initialization work.

OpenCL Results
For comparision between the installed Intel Xeon E5-2670 and the Xeon Phi I used a raytracer which operates within Geometric Algebra (GA). It was developed in my Master Thesis [3] and modified and ported to linux for this test. I will show the testscenes and present the results in the following. The profiling was done like in my thesis by using the OpenCL framework’s methods [4]. The resolution for every scene is 1024*1024.

Raptor
raptor
This small dinosaur consists of 100000 triangles. The model was only raycasted so that no reflection rays where used. The Xeon Phi needed 2,98 seconds to render this image, the Xeon E5 only 2,58.

3 Bunnys and an Elephant
3bunnys1ele
This scene consists of 16150 triangles (each bunny 4968, elephant the rest). The Xeon E5 needs 2,47 seconds and the Xeon Phi 2,17. So at least the Phi can outperform the E5.

CowSphere
Cowsphere
This rather small scene consists of only about 6000 triangles, but the calculation is dominated by the high amount of shadow. It was tested in two ways. First with use of bounding spheres to reduce the account of ray-triangle intersection tests (like all scenes until now) and in a second step without bounding volumes. In this case for every pixel (e.g. the corresponding eye-ray to it) a test with every triangle in the scene has to be done. For the first variant the Phi renders the picture within 2,11 sec, while the E5 is ready after 2,96 seconds. For the second, the Phi needs 4,76 sec and the E5 6,46 seconds.

Kittens
Kitten
This last scene is the most complex one. Each kitten is built by 137098 triangles and the bounding spheres are disabled, while the scene is not only raycasted but completely raytraced. So for every of the 1048576 pixels over 250000 triangles must be tested for intersection (in case of a hit this value doubles). The Phi finishes this task after 142 seconds, while the Xeon E5-2670 has the result after 177 seconds.

Summary
It looks like it was in the other three parts of this serie before: Without changing or rewriting existing code it seems impossible to exploit the Xeon Phi’s potential. In contrast to my tries in offload and native mode with C++ code the Phi is able to render faster than the server processor in most of the scenes, but his advance is not that big. More than ever if I take the results on my AMD HD6970 from my Master Thesis into account, the calculation of the scenes on the Xeon Phi is slow.

Sources:
[1] http://www.theismus.de/HPCBlog/?p=81
[2] http://dhruba.name/2012/08/14/opencl-cookbook-listing-all-devices-and-their-critical-attributes/
[
3] http://www.gaalop.de/wp-content/uploads/Masterarbeit-Michael-Burger.pdf
[
4] http://software.intel.com/sites/landingpage/opencl/optimization-guide/Profiling_Operations_Using_OpenCL_Profiling_Events.htm

First real life experiment with the Xeon Phi III

After I reported my experiences with native mode of the Phi I now did my first steps in the offload mode. I used the same raytracer as in the first two parts of this article serie. A few pitfalls revealed during my tries to get the application running in offload mode.

Changes needed in the raytracer
First off all I had to modify the code, so that it is able to compile with the offload pragmas. There I noticed some difficulties especially for C++ Code.

First off all there is the issue that the Phi can get to know the used classes. For the raytracing procedure and the existing code this affects all classes of the project. So for calculating the colour of a pixel the Phi must now the scene and ist objects, in this implementation called Shapes. Additionally it needs to know what the Color class is, which it should have as output. Furthermore LightsRays, Vectors are Points are required. So I had to tell the compiler for all of this classes, that he has to offload them. This is done by surrounding the class definitions with the offload attribute pragma:

#pragma offload_attribute (push,target(mic))
// includes

class Color
{
  public:
  …

  private:
  …
};
#pragma offload_attribute (pop)

This changes had to be applied to all header files, so you need more than just one pragma to offload code to the Phi.

Another problem was the fact that my image array for the calculated picture was declared as img[Height][Width][3]. I did some small examples with multi dimensional arrays and try to fill them von Phi, but this results in crashes during execution. I don’t know if this was my error or if the Phi ( / the compiler) isn’t capable of dealing with such constructs. So I had to change the code so that he uses a 1D array now. To offload and fill this with testdata was no problem.

After this step I took the actual rendering loop and offloaded it to Phi with:

#pragma offload target (mic) in(argc) \
    out(img1d : length(HEIGHT*WIDTH*3))

The last problem was the biggest one. After offloading the class structure and changing the structure of the output image the code could be compiled. Trying to execute it ended up with a crash, that Phi returned:

offload error: process on the device 0 was terminated by signal 11

Since I don’t know how to debug the Phi at the moment, I located the problem by commenting out code and uncommenting it step by step. The reason for the crashes was very evident, when I think to it afterwards. The classed structure and there function is copied to Phi yes, but not all their members. Simple ints and doubles are copied automatically but the list for the shapes was empty. I searched in the internet for a lot of time for finding an easy way to copy the hole instance of a class to Phi, but I wasn’t successful on that. A look in one of Intels own examples destroyed my hope altogether. You can find it in the directory of Intels 2013 version of the Composer:

There you can find and example of offloading a struct to Phi. Commented with:

// The first version of the offload implementation does not support copying
// of classes/structs that are not simply bit-wise copyable
//
// Sometimes it is necessary to get a struct across
//
// This needs to be done by transporting the members individually
// and reconstructing the struct on the other side

So this means for me: I would have to decompose the hole shapes class and its inharitors to simple arrays or single variables, to copy them separately and to reassemble it on the Phi. I refused this way because of the amount of work. So I used a second method: I increased the code region within the offload pragma so that it additionally includes the creation of the scene. So the scene is instantiated from a single Xeon Phi Core and directly put in Phi’s RAM. The results are presented in the next section.

A last problem I was engaged in was the writing of the resulting image to file. Since the array was a member of the Raytracer class and this class was instantiated on the Phi directly, it was not possible to write this data after the offload region. But this must be done so that the stream is written to the hard disk of my host system. So I had to instantiate the output array first, pass it within the offload pragma as out-parameter and internally copy / link it to the member variable of the raytracer in the space of the offloaded code. Then after the region I write the stream to file.

Results
Since the code changes these results are not comparable to the older once from parts I and II. But again only the time for the actual rendering loop is measured. I also changed the scene a litte bit. But the reached times are more than disappointing.

Xeons on Host:
1 Thread: 46.238833 sec
2 Threads: 23.850293 sec
4 Threads: 12.371241 sec
8 Threads: 6.942405 sec
16 Threads: 4.752595 sec
32 Threads: 3.586519 sec

Xeon Phi:
30 Threads: 34.608027 sec
40 Threads: 27.258293 sec
60 Threads: 24.582100 sec
120 Threads: 18.004286 sec
240 Threads: 15.859062 sec

Xeon Phi (native)
30 Threads: 29.427415 sec
40 Threads: 22.920789 sec
60 Threads: 21.599124 sec
120 Threads: 14.557700 sec
240: Threads: 13.837122 sec

The native version is slightly faster than the offloaded one and both are much slower than the run on the host. The new scene can be seen at the following picture:

offload

In a next step I will first try to find better solutions for debugging the Phi than commenting and uncommenting. I will test the eclipse plugin which is shipped with the mpss package from Intel.

Sources:
http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/index.htm

First real life experiment with the Xeon Phi

After executing and editing some of Intel’s tutorials in the /opt/intelcomposer_xe_2013/Samples/en_US/C++/mic_samples/intro_sampleC directory I did a practial experiment with a raytracer on Xeon Phi in native execution mode. For that I used a simple, open source C++ raytracer which I downloaded from [1].

The main problem of this raytracer was the structure of the loop, which runs over all pixels of the image and can normally be parallized in an easy way. But in this case the original implementation created dependencies between single loop runs and made an OpenMP parallelization impossible due other issues. Main reason for that was, that the structure of the for-loops wasn’t OpenMP compatible (no initialization, two increments) and the calculation of the i- and j- local coordinates parameters was depended on the previous iteration.
Another problem was the creation of the final output image, which was realized as a stream writing to file in every iteration, so that the correctness of the picture was depended on a systematic run through the pixels.
After this issues were resolved a simple OpenMP parallelization was done over the outer pixel loop (over height of image).
The resulting new  RayTracer.cpp can be downloaded by clicking on it.

Another change was done in the main routine. Since the original scene was to simple to test scalability, more objects were added to it. Now a loop creates 300 spheres which were slightly displaced to form a pipe similar structure. The new image looks like this:

Raytracer

The modified main.cpp can also be downloaded here.

This example now was compiled with icc and the -O3, -ipo and -xHost flags and benchmarked on one cluster node with two Intel Xeon E5-2670 (2,6 GHz) processors on Sandybridge base. Every CPU has eight physical cores and Hyperthreading. So cat /proc/cpuinfo lists 32 processors.

For time messurement the omp_get_wtime function is used. I take only the time into account which elapses between the start and the end of the pixel loop. So the serial part of the application is neglected. The results are:

  • 1 thread: 50.147235 sec
  • 2 threads: 25.685468 sec
  • 4 threads: 14.723042 sec
  • 8 threads: 9.780591 sec
  • 16 threads: 7.495562 sec
  • 32 threads: 5.663061 sec

So in the beginning there is a very good scaling of the raytracer how one can expect it from theory. But the scaling between 8 and 16 threads is rather poor. That the difference between 16 and 32 wouldn’t be very huge becomes clear when you take into account that 16 of the threads are only running with Hyperthreading and not on real, physical cores.

The results on the Phi are the following ones:

  • 1  Thread: 800.992084 sec
  • 2 Threads: 419.042969 sec
  • 4 Threads: 230.156072 sec
  • 8 Threads: 148.853595 sec
  • 15 Threads: 81.311708 sec
  • 30 Threads: 50.757255 sec
  • 60 Threads: 32.278490 sec
  • 120 Threads: 22.553419 sec
  • 240 Threads: 15.653570 sec

As one can see that the single core performance of the Phi is very poor which is obvious because of the architecture of the less complex, 1 Ghz clocked cores of the card. But I didn’t expect it to be as huge as it is now. The scaling in the very beginning is very good, but decreases fast with increasing number of threads. So in total the Xeon Phi in this setup isn’t able to reach the performance of the 2 Xeon Server processors.

One curious thing turns out when you look at the average core utilization of the Phi via Intel monitoring tool. From start of the raytracer on all cores are used, but during the calculation the average usage is decreasing with the time. You can see this on the following screenshot:

PerCoreUtil

This behaviour is unexpected, because following the theory of raytracing, all threads can take part in the calculation of the final image until it is rendered completely. The next step would be to investigate, where this behaviour comes from and in general to solve the issue of the relativly low performance of the Phi raytracing.

Sources:
[1] http://sourceforge.net/projects/simpleraytracer/