Performance decrease since pre production samples?

Now I finished reading the book Intel Xeon Phi Coprocessor – High-Performance Programming from Jim Jeffers and James Reinders. I hoped to find the reason why I am not able to reach good performance results on Xeon Phi, but I am still a bit confused about that. So I decided to test the two example programs from the book and to look if the given results are comparable to the performance on our Xeon Phi. Since the source code and the output is printed completely in the book, the calculation times should be nearly the same because our model of Phi has also 61 cores like the pre production sample in the book. So lets have a look at the two programs.

9 point stencil algorithm
This small program applies a blur filter to a given image represented as 2D array. The influence of all 8 neighborpoints to a center point is taken into account. So for each point a weighted sum of 9 addends must be calculated. Since their are two image buffers which are swapped at the end of each iteration, every pixel can be calculated¬†independently from the others. Thats’s why a simple parallelization can be realized by:

#pragma omp parallel for private(x)
for(y=1; y < HEIGHT-1; y++) {
    for(x=1; x < WIDTH-1; x++) {
        ....
    }
}

To help the compiler with the vectorization it is only necessary to add a #pragma ivdep. So the compiler vectorizes the inner loop.

#pragma omp parallel for private(x)
for(y=1; y < HEIGHT-1; y++) {
    #pragma ivdep
    for(x=1; x < WIDTH-1; x++) {
        ....
    }
}

After this code changes the authors reach the following execution time on the Xeon Phi:

  • 122 threads: ¬†8,772s
  • 244 threads: 12,696s

The program compiled here with the same flags and the same setup for our Phi (scatter scheduling) leeds to:

  • 122 threads: 12,664s
  • 244 threads: 19,998s
  • (240 threads: 17,181s)

So in the case of 122 threads our Phi needs 44% more time to finish its work. In the case of 244 threads the increase is even 57%! The special behaviour of using the maximal number of threads will be investigated below. But even 240 threads are much slower than the reference in the book (35% difference).

Diffusion
Here a program is examined which simulates the diffusion of a solute through a volume of liquid over time. This happens in 3D space. The calculcation is very simular to the image filter example from above with the main difference of a 3D array now. Here you take six neighboring grid cells into account (above, below, in front, behind, left and right). So you have for every entry a weighed sum with seven addends. After optimizing for scaling and vectorizing your code looks like:

#pragma omp parallel for
{
....
     #pragma omp for collapse(2)
    for(z=0; z < nz; z++) {
        for(y=0; y < ny; y++) {
        #pragma ivdep
            for(x=0; x < nx; x++) {
            ....
            }
        }
    }
}

The results in the book are:

  • 122 threads: 25,369s
  • 244 threads: 18,664s

With our Phi I am able to achieve the following times:

  • 122 threads: 22,661s
  • 244 threads: 29,849s
  • (240 threads): 20,419s

For me it was very strange to notice, that the execution times especially for 240 threads have big variability. The fastest run within 10 was finished after 20,419s and the slowest one needed 31,580s, although I was the only user on Phi. In contrast for 122 threads the fastest execution finished after 22,661s, the slowest one after 23,796s. For 244 threads the behaviour of the Phi ist again completely different from the result of the book. And if one looks at the output of Phi’s monitoring software one can see the reason for it:

240 Threads

240 Threads

244 Threads

244 Threads

So the average core utilization decreases dramatically if you follow the¬†recommendation of the book to use all available cores in native mode and all-1 when running in offload mode. Perhaps a change in the software leeds to this behaviour? I also measured the fastest execution time on the two server processors on the board (which is not done in the book). For 16 threads they needed 30,900 seconds and so they took “only” 50% more time than the Xeon Phi. And this in an application which should be capable of using all compute power which the Xeon Phi offers.

Summary
Strange. That’s all I have in mind when I am thinking about this situation. I am using the same code like the book, the same compiler flags and a Phi product with the same featureset as the pre production sample in the book. I’m running the code in native mode so that driver, mpss and so on can’t have an impact on the performance. The only things I can see that differs from the book is the linux-version on the Phi (latest available one) and the new versions of Intel compiler and OpenMP library. But can this cause such big performance differences?

10 thoughts on “Performance decrease since pre production samples?

  1. Hi,

    I think I know what changed in the 9-point stencil sample: the default loop scheduling mode in OpenMP.

    I downloaded the code from http://lotsofcores.com/article/example-code-book-chapters-2-4 and compiled the sample sten2d9pt_omp_xphi, which corresponds to the case that you are dealing with. My result: 16.6 seconds with 240 cores (coprocessor stepping B1).

    Then I went into the code and changed the openmp pragma to the following:
    #pragma omp parallel for private(x) schedule(dynamic)
    Recompiled, re-ran, the time is now: 11.3 seconds. This recovers about 50% performance loss that you observed.

    My guess is that the default OpenMP scheduling mode was “dynamic” with the OpenMP library that the authors used, and now OpenMP uses static or something else if the scheduling mode is not specified.

    Nice blog!

    Andrey

  2. Hi Andrey,
    many thanks for your comment! I tried it and you’re right. For the stencil example I reach now nearly the same times as in the book. For the diffusion the results are still very unstable, for static and for dynamic scheduling.
    Michael

  3. Hi Michael,

    Can it be that the OpenMP thread affinity affects the runtime and causes the instability of the diffusion example? That would explain the difference between 240 and 244 cores. Not sure if that explains the result in the book, but this is what I got:

    – For the example “diffusion_omp_vect_xphi” from the book the runtime on all cores fluctuates between 18 and 21 seconds.
    – Then I set the environment variable “KMP_AFFINITY=compact,granularity=none” (for micnativeloadex, I passed this variable using the -e argument). This caused runtime to go down to 15 s, and I reproduced this result in several runs.

    I have also heard the advice to use KMP_PLACE_THREADS in offload applications, but have not seen cases yet where that is important ( http://software.intel.com/en-us/forums/topic/369175 ).

    Andrey

  4. Hi Andrey,
    thanks again for your comment.
    I have to try this again. Until now I only testet compact and scatter without setting granularity (with export KMP_AFFINITY). But I will try to repeat the tests this weak and to give the results here.
    Sorry for the late reply, but Word Press didn’t inform me about your comment.

    Michael

  5. Hi Andrey,
    you are right. With this KMP_AFFINITY settings a got the following runtimes:
    14,842 s
    14,858 s
    14,630 s
    Many thanks!
    At the moment I try to benchmark several sparse matrix vector multiplication variants on the Phi. I will write about this within the next weeks.

    Michael

  6. yeah we also found large variation in times for given code for one MIC card, but not so with another… this when we only user on either card and only one card being used at a time and nothing else running on host…

    • Hi hec,
      thank you also for your comment. We will get several new Phis for the institute within the next three months. Then I can test if they behave in a different way.

  7. The work that you have done here is amazing. I read the Phi book also, although I’m not as familiar as you are with this processor. After reading these comments I was curious if were you able to get the Xeon Phi to perform in the range of 2 to 3 times faster than 2 Xeon 2660s ( 8 Xeon cores at 2 GHz ).

    Also, are you using blender(internal or cycles) for the ray tracing app?

    • Hi,
      I wasn’t able to get a faster performance than our two Xeon 2670 processors with all apps I ported to the Phi. Only the examples from the book are running faster, with the configurations which mehr postet here by Andrey.
      I’m not using blender. The GA-Raytracer was fully devoloped from scratch and does not include any libraries.

      • Thank you so much for your response. I’ve had many inquiries about it’s performance with larger graphics rendering applications. It also seems to have a lot to do with the performance of individual commands in the programs in combination with the optimization with the command line arguments to optimize performance of the Phi because of the nature of the program commands themselves. I hope that in the near future that Intel will address and release more in depth and insightful documentation regarding this.

Comments are closed.