Performance decrease since pre production samples?

Now I finished reading the book Intel Xeon Phi Coprocessor – High-Performance Programming from Jim Jeffers and James Reinders. I hoped to find the reason why I am not able to reach good performance results on Xeon Phi, but I am still a bit confused about that. So I decided to test the two example programs from the book and to look if the given results are comparable to the performance on our Xeon Phi. Since the source code and the output is printed completely in the book, the calculation times should be nearly the same because our model of Phi has also 61 cores like the pre production sample in the book. So lets have a look at the two programs.

9 point stencil algorithm
This small program applies a blur filter to a given image represented as 2D array. The influence of all 8 neighborpoints to a center point is taken into account. So for each point a weighted sum of 9 addends must be calculated. Since their are two image buffers which are swapped at the end of each iteration, every pixel can be calculated independently from the others. Thats’s why a simple parallelization can be realized by:

#pragma omp parallel for private(x)
for(y=1; y < HEIGHT-1; y++) {
    for(x=1; x < WIDTH-1; x++) {
        ....
    }
}

To help the compiler with the vectorization it is only necessary to add a #pragma ivdep. So the compiler vectorizes the inner loop.

#pragma omp parallel for private(x)
for(y=1; y < HEIGHT-1; y++) {
    #pragma ivdep
    for(x=1; x < WIDTH-1; x++) {
        ....
    }
}

After this code changes the authors reach the following execution time on the Xeon Phi:

  • 122 threads:  8,772s
  • 244 threads: 12,696s

The program compiled here with the same flags and the same setup for our Phi (scatter scheduling) leeds to:

  • 122 threads: 12,664s
  • 244 threads: 19,998s
  • (240 threads: 17,181s)

So in the case of 122 threads our Phi needs 44% more time to finish its work. In the case of 244 threads the increase is even 57%! The special behaviour of using the maximal number of threads will be investigated below. But even 240 threads are much slower than the reference in the book (35% difference).

Diffusion
Here a program is examined which simulates the diffusion of a solute through a volume of liquid over time. This happens in 3D space. The calculcation is very simular to the image filter example from above with the main difference of a 3D array now. Here you take six neighboring grid cells into account (above, below, in front, behind, left and right). So you have for every entry a weighed sum with seven addends. After optimizing for scaling and vectorizing your code looks like:

#pragma omp parallel for
{
....
     #pragma omp for collapse(2)
    for(z=0; z < nz; z++) {
        for(y=0; y < ny; y++) {
        #pragma ivdep
            for(x=0; x < nx; x++) {
            ....
            }
        }
    }
}

The results in the book are:

  • 122 threads: 25,369s
  • 244 threads: 18,664s

With our Phi I am able to achieve the following times:

  • 122 threads: 22,661s
  • 244 threads: 29,849s
  • (240 threads): 20,419s

For me it was very strange to notice, that the execution times especially for 240 threads have big variability. The fastest run within 10 was finished after 20,419s and the slowest one needed 31,580s, although I was the only user on Phi. In contrast for 122 threads the fastest execution finished after 22,661s, the slowest one after 23,796s. For 244 threads the behaviour of the Phi ist again completely different from the result of the book. And if one looks at the output of Phi’s monitoring software one can see the reason for it:

240 Threads

240 Threads

244 Threads

244 Threads

So the average core utilization decreases dramatically if you follow the recommendation of the book to use all available cores in native mode and all-1 when running in offload mode. Perhaps a change in the software leeds to this behaviour? I also measured the fastest execution time on the two server processors on the board (which is not done in the book). For 16 threads they needed 30,900 seconds and so they took “only” 50% more time than the Xeon Phi. And this in an application which should be capable of using all compute power which the Xeon Phi offers.

Summary
Strange. That’s all I have in mind when I am thinking about this situation. I am using the same code like the book, the same compiler flags and a Phi product with the same featureset as the pre production sample in the book. I’m running the code in native mode so that driver, mpss and so on can’t have an impact on the performance. The only things I can see that differs from the book is the linux-version on the Phi (latest available one) and the new versions of Intel compiler and OpenMP library. But can this cause such big performance differences?

First real life experiment with the Xeon Phi V

“Scale and Vectorize”. These are two prerequisites which I read several times until now, when I searched for ways to make a program run faster on Xeon Phi. The principle of scaling is clear to me and my raytracing applications which I am testing on Xeon Phi do scale well on our server processors and also on Xeon Phi, as I will show in this post. But the overall performance on Xeon Phi is still worse than on the Server CPUs.

As another approach to get a fast Xeon Phi application I tried to offload the Geometric Algebra raytracer of my master thesis [2] to Xeon Phi. The offloaded code is not based on the C++ version of my software but on the OpenCL code which I tested on AMD and NVidia cards during the programming phase of my master thesis. This code uses only scalars and arrays and so it overcomes the limitation of the mic compiler, that it is not able to offload objects which can’t be copied by a simple memcpy [1]. In fact this code runs much faster on CPU and the Xeon Phi than the original C++ version, but it is clear, that it is much harder to understand and to maintain. As an example for the performance I present the following scene here:

bunny

 

It consists of 6500 triangles, uses shadow rays and has recursion level 2. The resulting rendering time is:

  • Xeon (C/OpenCL-based version): 0,40 sec
  • Xeon (C++ version): 1,05 sec
  • Xeon Phi (C/OpenCL-based version): 1,05 sec

So the Xeon Phi again isn’t able to reach the performance of both server processors. But why? First I did a scaling analyses with the following result:

scaling

 

So for 120 threads the Xeon Phi performs about 80 times faster than single threaded. I think that is a pretty good value and not the reason why the Xeon CPU ist such much faster. Thats why I tested the second prerequisite “vectorization” in order to find the performance issue. And there it is. It doesn’t have a measurable influence on the runtime whether to compile with vectorization or with the -no-vec flag to suppress the generation of vector code. The SIMD units of the Phi cores seem to be completely unused by my programm. I tried a few things until now but weren’t able to reach anything in this direction.

I also did a few tests on the thread affinity. scatter performs some percent faster than compact, but this small differences can’t be the explanation for Xeon Phi’s bad rendering times.

Sources:
[1] http://www.theismus.de/HPCBlog/?p=59
[2] http://www.gaalop.de/wp-content/uploads/Masterarbeit-Michael-Burger.pdf

First real life experiment with the Xeon Phi IV

In this part I will present some results, which I got from using OpenCL on the Xeon Phi. In another blog entry I described some problems when using the Xeon Phi together with OpenCL [1]. This problems were solved now and I will first report, how I reached this. After that I will summarize the performace measurements and their results.

Getting OpenCL running
My problem was not to install OpenCL but to initialize it correctly. So everything I explained in [1] was correct and the installation was complete. First I used an example code from the Internet, to make sure, that the Phi is really registered as OpenCL device [2]. This created me the following output:

platform count: 1
device count: 2
1. Device: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
1.1 Hardware version: OpenCL 1.2 (Build 56860)
1.2 Software version: 1.2
1.3 OpenCL C version: OpenCL C 1.2
1.4 Parallel compute units: 32
2. Device: Intel(R) Many Integrated Core Acceleration Card
2.1 Hardware version: OpenCL 1.2
2.2 Software version: 1.2
2.3 OpenCL C version: OpenCL C 1.2 (Build 56860)
2.4 Parallel compute units: 236

At this point I could be sure, that the Phi is ready to work, but the next question was how to use it. After a while of reading the OpenCL Documentation I got a hint to the device type CL_DEVICE_TYPE_ACCELERATOR and with that I was able to calculate on Phi. The following code shows how I initialize my device (declarations and error handling ommited).

context = clCreateContextFromType(cprops,                                                                                                                                                                          
    CL_DEVICE_TYPE_ACCELERATOR,                                                                                                                                                                                
    NULL,                                                                                                                                                                                                      
    NULL,                                                                                                                                                                                                      
    &status);
 
status = clGetContextInfo(context,                                                                                                                                                                                         
    CL_CONTEXT_DEVICES,                                                                                                                                                      
    0,                                                                                                                                                                       
    NULL,                                                                                                                                                                    
    &deviceListSize);
 
devices = (cl_device_id *)malloc(deviceListSize);
 
commandQueue = clCreateCommandQueue(                                                                                                                                                                                       
    context,                                                                                                                                                                                
    devices[0],                                                                                                                                                                             
    CL_QUEUE_PROFILING_ENABLE,                                                                                                                                                              
    &status);

Now with device[0] I can do the rest of the initialization work.

OpenCL Results
For comparision between the installed Intel Xeon E5-2670 and the Xeon Phi I used a raytracer which operates within Geometric Algebra (GA). It was developed in my Master Thesis [3] and modified and ported to linux for this test. I will show the testscenes and present the results in the following. The profiling was done like in my thesis by using the OpenCL framework’s methods [4]. The resolution for every scene is 1024*1024.

Raptor
raptor
This small dinosaur consists of 100000 triangles. The model was only raycasted so that no reflection rays where used. The Xeon Phi needed 2,98 seconds to render this image, the Xeon E5 only 2,58.

3 Bunnys and an Elephant
3bunnys1ele
This scene consists of 16150 triangles (each bunny 4968, elephant the rest). The Xeon E5 needs 2,47 seconds and the Xeon Phi 2,17. So at least the Phi can outperform the E5.

CowSphere
Cowsphere
This rather small scene consists of only about 6000 triangles, but the calculation is dominated by the high amount of shadow. It was tested in two ways. First with use of bounding spheres to reduce the account of ray-triangle intersection tests (like all scenes until now) and in a second step without bounding volumes. In this case for every pixel (e.g. the corresponding eye-ray to it) a test with every triangle in the scene has to be done. For the first variant the Phi renders the picture within 2,11 sec, while the E5 is ready after 2,96 seconds. For the second, the Phi needs 4,76 sec and the E5 6,46 seconds.

Kittens
Kitten
This last scene is the most complex one. Each kitten is built by 137098 triangles and the bounding spheres are disabled, while the scene is not only raycasted but completely raytraced. So for every of the 1048576 pixels over 250000 triangles must be tested for intersection (in case of a hit this value doubles). The Phi finishes this task after 142 seconds, while the Xeon E5-2670 has the result after 177 seconds.

Summary
It looks like it was in the other three parts of this serie before: Without changing or rewriting existing code it seems impossible to exploit the Xeon Phi’s potential. In contrast to my tries in offload and native mode with C++ code the Phi is able to render faster than the server processor in most of the scenes, but his advance is not that big. More than ever if I take the results on my AMD HD6970 from my Master Thesis into account, the calculation of the scenes on the Xeon Phi is slow.

Sources:
[1] http://www.theismus.de/HPCBlog/?p=81
[2] http://dhruba.name/2012/08/14/opencl-cookbook-listing-all-devices-and-their-critical-attributes/
[
3] http://www.gaalop.de/wp-content/uploads/Masterarbeit-Michael-Burger.pdf
[
4] http://software.intel.com/sites/landingpage/opencl/optimization-guide/Profiling_Operations_Using_OpenCL_Profiling_Events.htm