First real life experiment with the Xeon Phi II

In the previous part if this article [1] I mentioned that the next step would be an analyses of the bad scalability and so performance of the raytracer on Xeon Phi. As a first step for this I used the Intel vTuneAmplifier to search for hotspots in the code algorithm. But there seem to be no abnormalities in the execution flow. But it is conspicuous that the Amplifier states, that the CPU time in the running threads is rather low. On the other side, the overall summation of the results looks pretty good:

simultaneous_threads

 

simultaneous_cpus

So I decided to use Intel Inspector first. The normal analyses reported no errors. After that I increased the search depth and anayses form. Since the analyses wasn’t finished after seven minutes I terminated it. I got two data race errors. One within the Shading and one within the Rendering function. So I disabled the shading and tried to eliminate the data race in Rendering with changing my OpenMP clause. Thats why the resulting image looks like:

trace_without_shade

The results on the double Xeon Server processor system are:

  • 1 Thread: 35.769255 sec
  • 2 Threads: 18.427898 sec
  • 4 Threads: 10.145121 sec
  • 8 Threads: 6.403982 sec
  • 16 Threads: 3.907015 sec
  • 32 Threads: 3.667761 sec

The corresponding values for the Phi are the following:

  • 1 Thread: 573.422251 sec
  • 2 Threads: 288.445928 sec
  • 4 Threads: 156.622805 sec
  • 8 Threads: 98.961222 sec
  • 15 Threads: 54.404671 sec
  • 30 Threads: 34.617849 sec
  • 60 Threads: 22.450361 sec
  • 120 Threads: 15.535183 sec
  • 240 Threads: 10.441986 sec

You can see that the overall performance in comparision to the original version with shading increases, but the scaling problem remains the same. So the Phi still isn’t able to outperform the dual socket cluster node. Though the shading is not the problem for the scaling or the general performance on Phi. The search goes on…

Sources:
[1] http://www.theismus.de/HPCBlog/?p=17