Performance decrease since pre production samples?

Now I finished reading the book Intel Xeon Phi Coprocessor – High-Performance Programming from Jim Jeffers and James Reinders. I hoped to find the reason why I am not able to reach good performance results on Xeon Phi, but I am still a bit confused about that. So I decided to test the two example programs from the book and to look if the given results are comparable to the performance on our Xeon Phi. Since the source code and the output is printed completely in the book, the calculation times should be nearly the same because our model of Phi has also 61 cores like the pre production sample in the book. So lets have a look at the two programs.

9 point stencil algorithm
This small program applies a blur filter to a given image represented as 2D array. The influence of all 8 neighborpoints to a center point is taken into account. So for each point a weighted sum of 9 addends must be calculated. Since their are two image buffers which are swapped at the end of each iteration, every pixel can be calculated independently from the others. Thats’s why a simple parallelization can be realized by:

#pragma omp parallel for private(x)
for(y=1; y < HEIGHT-1; y++) {
    for(x=1; x < WIDTH-1; x++) {
        ....
    }
}

To help the compiler with the vectorization it is only necessary to add a #pragma ivdep. So the compiler vectorizes the inner loop.

#pragma omp parallel for private(x)
for(y=1; y < HEIGHT-1; y++) {
    #pragma ivdep
    for(x=1; x < WIDTH-1; x++) {
        ....
    }
}

After this code changes the authors reach the following execution time on the Xeon Phi:

  • 122 threads:  8,772s
  • 244 threads: 12,696s

The program compiled here with the same flags and the same setup for our Phi (scatter scheduling) leeds to:

  • 122 threads: 12,664s
  • 244 threads: 19,998s
  • (240 threads: 17,181s)

So in the case of 122 threads our Phi needs 44% more time to finish its work. In the case of 244 threads the increase is even 57%! The special behaviour of using the maximal number of threads will be investigated below. But even 240 threads are much slower than the reference in the book (35% difference).

Diffusion
Here a program is examined which simulates the diffusion of a solute through a volume of liquid over time. This happens in 3D space. The calculcation is very simular to the image filter example from above with the main difference of a 3D array now. Here you take six neighboring grid cells into account (above, below, in front, behind, left and right). So you have for every entry a weighed sum with seven addends. After optimizing for scaling and vectorizing your code looks like:

#pragma omp parallel for
{
....
     #pragma omp for collapse(2)
    for(z=0; z < nz; z++) {
        for(y=0; y < ny; y++) {
        #pragma ivdep
            for(x=0; x < nx; x++) {
            ....
            }
        }
    }
}

The results in the book are:

  • 122 threads: 25,369s
  • 244 threads: 18,664s

With our Phi I am able to achieve the following times:

  • 122 threads: 22,661s
  • 244 threads: 29,849s
  • (240 threads): 20,419s

For me it was very strange to notice, that the execution times especially for 240 threads have big variability. The fastest run within 10 was finished after 20,419s and the slowest one needed 31,580s, although I was the only user on Phi. In contrast for 122 threads the fastest execution finished after 22,661s, the slowest one after 23,796s. For 244 threads the behaviour of the Phi ist again completely different from the result of the book. And if one looks at the output of Phi’s monitoring software one can see the reason for it:

240 Threads

240 Threads

244 Threads

244 Threads

So the average core utilization decreases dramatically if you follow the recommendation of the book to use all available cores in native mode and all-1 when running in offload mode. Perhaps a change in the software leeds to this behaviour? I also measured the fastest execution time on the two server processors on the board (which is not done in the book). For 16 threads they needed 30,900 seconds and so they took “only” 50% more time than the Xeon Phi. And this in an application which should be capable of using all compute power which the Xeon Phi offers.

Summary
Strange. That’s all I have in mind when I am thinking about this situation. I am using the same code like the book, the same compiler flags and a Phi product with the same featureset as the pre production sample in the book. I’m running the code in native mode so that driver, mpss and so on can’t have an impact on the performance. The only things I can see that differs from the book is the linux-version on the Phi (latest available one) and the new versions of Intel compiler and OpenMP library. But can this cause such big performance differences?

First real life experiment with the Xeon Phi

After executing and editing some of Intel’s tutorials in the /opt/intelcomposer_xe_2013/Samples/en_US/C++/mic_samples/intro_sampleC directory I did a practial experiment with a raytracer on Xeon Phi in native execution mode. For that I used a simple, open source C++ raytracer which I downloaded from [1].

The main problem of this raytracer was the structure of the loop, which runs over all pixels of the image and can normally be parallized in an easy way. But in this case the original implementation created dependencies between single loop runs and made an OpenMP parallelization impossible due other issues. Main reason for that was, that the structure of the for-loops wasn’t OpenMP compatible (no initialization, two increments) and the calculation of the i- and j- local coordinates parameters was depended on the previous iteration.
Another problem was the creation of the final output image, which was realized as a stream writing to file in every iteration, so that the correctness of the picture was depended on a systematic run through the pixels.
After this issues were resolved a simple OpenMP parallelization was done over the outer pixel loop (over height of image).
The resulting new  RayTracer.cpp can be downloaded by clicking on it.

Another change was done in the main routine. Since the original scene was to simple to test scalability, more objects were added to it. Now a loop creates 300 spheres which were slightly displaced to form a pipe similar structure. The new image looks like this:

Raytracer

The modified main.cpp can also be downloaded here.

This example now was compiled with icc and the -O3, -ipo and -xHost flags and benchmarked on one cluster node with two Intel Xeon E5-2670 (2,6 GHz) processors on Sandybridge base. Every CPU has eight physical cores and Hyperthreading. So cat /proc/cpuinfo lists 32 processors.

For time messurement the omp_get_wtime function is used. I take only the time into account which elapses between the start and the end of the pixel loop. So the serial part of the application is neglected. The results are:

  • 1 thread: 50.147235 sec
  • 2 threads: 25.685468 sec
  • 4 threads: 14.723042 sec
  • 8 threads: 9.780591 sec
  • 16 threads: 7.495562 sec
  • 32 threads: 5.663061 sec

So in the beginning there is a very good scaling of the raytracer how one can expect it from theory. But the scaling between 8 and 16 threads is rather poor. That the difference between 16 and 32 wouldn’t be very huge becomes clear when you take into account that 16 of the threads are only running with Hyperthreading and not on real, physical cores.

The results on the Phi are the following ones:

  • 1  Thread: 800.992084 sec
  • 2 Threads: 419.042969 sec
  • 4 Threads: 230.156072 sec
  • 8 Threads: 148.853595 sec
  • 15 Threads: 81.311708 sec
  • 30 Threads: 50.757255 sec
  • 60 Threads: 32.278490 sec
  • 120 Threads: 22.553419 sec
  • 240 Threads: 15.653570 sec

As one can see that the single core performance of the Phi is very poor which is obvious because of the architecture of the less complex, 1 Ghz clocked cores of the card. But I didn’t expect it to be as huge as it is now. The scaling in the very beginning is very good, but decreases fast with increasing number of threads. So in total the Xeon Phi in this setup isn’t able to reach the performance of the 2 Xeon Server processors.

One curious thing turns out when you look at the average core utilization of the Phi via Intel monitoring tool. From start of the raytracer on all cores are used, but during the calculation the average usage is decreasing with the time. You can see this on the following screenshot:

PerCoreUtil

This behaviour is unexpected, because following the theory of raytracing, all threads can take part in the calculation of the final image until it is rendered completely. The next step would be to investigate, where this behaviour comes from and in general to solve the issue of the relativly low performance of the Phi raytracing.

Sources:
[1] http://sourceforge.net/projects/simpleraytracer/