After I reported my experiences with native mode of the Phi I now did my first steps in the offload mode. I used the same raytracer as in the first two parts of this article serie. A few pitfalls revealed during my tries to get the application running in offload mode.
Changes needed in the raytracer
First off all I had to modify the code, so that it is able to compile with the offload pragmas. There I noticed some difficulties especially for C++ Code.
First off all there is the issue that the Phi can get to know the used classes. For the raytracing procedure and the existing code this affects all classes of the project. So for calculating the colour of a pixel the Phi must now the scene and ist objects, in this implementation called Shapes. Additionally it needs to know what the Color class is, which it should have as output. Furthermore Lights, Rays, Vectors are Points are required. So I had to tell the compiler for all of this classes, that he has to offload them. This is done by surrounding the class definitions with the offload attribute pragma:
#pragma offload_attribute (push,target(mic))
// includes
…
class Color
{
public:
…
private:
…
};
#pragma offload_attribute (pop)
This changes had to be applied to all header files, so you need more than just one pragma to offload code to the Phi.
Another problem was the fact that my image array for the calculated picture was declared as img[Height][Width][3]. I did some small examples with multi dimensional arrays and try to fill them von Phi, but this results in crashes during execution. I don’t know if this was my error or if the Phi ( / the compiler) isn’t capable of dealing with such constructs. So I had to change the code so that he uses a 1D array now. To offload and fill this with testdata was no problem.
After this step I took the actual rendering loop and offloaded it to Phi with:
#pragma offload target (mic) in(argc) \
out(img1d : length(HEIGHT*WIDTH*3))
The last problem was the biggest one. After offloading the class structure and changing the structure of the output image the code could be compiled. Trying to execute it ended up with a crash, that Phi returned:
offload error: process on the device 0 was terminated by signal 11
Since I don’t know how to debug the Phi at the moment, I located the problem by commenting out code and uncommenting it step by step. The reason for the crashes was very evident, when I think to it afterwards. The classed structure and there function is copied to Phi yes, but not all their members. Simple ints and doubles are copied automatically but the list for the shapes was empty. I searched in the internet for a lot of time for finding an easy way to copy the hole instance of a class to Phi, but I wasn’t successful on that. A look in one of Intels own examples destroyed my hope altogether. You can find it in the directory of Intels 2013 version of the Composer:
There you can find and example of offloading a struct to Phi. Commented with:
// The first version of the offload implementation does not support copying
// of classes/structs that are not simply bit-wise copyable
//
// Sometimes it is necessary to get a struct across
//
// This needs to be done by transporting the members individually
// and reconstructing the struct on the other side
So this means for me: I would have to decompose the hole shapes class and its inharitors to simple arrays or single variables, to copy them separately and to reassemble it on the Phi. I refused this way because of the amount of work. So I used a second method: I increased the code region within the offload pragma so that it additionally includes the creation of the scene. So the scene is instantiated from a single Xeon Phi Core and directly put in Phi’s RAM. The results are presented in the next section.
A last problem I was engaged in was the writing of the resulting image to file. Since the array was a member of the Raytracer class and this class was instantiated on the Phi directly, it was not possible to write this data after the offload region. But this must be done so that the stream is written to the hard disk of my host system. So I had to instantiate the output array first, pass it within the offload pragma as out-parameter and internally copy / link it to the member variable of the raytracer in the space of the offloaded code. Then after the region I write the stream to file.
Results
Since the code changes these results are not comparable to the older once from parts I and II. But again only the time for the actual rendering loop is measured. I also changed the scene a litte bit. But the reached times are more than disappointing.
Xeons on Host:
1 Thread: 46.238833 sec
2 Threads: 23.850293 sec
4 Threads: 12.371241 sec
8 Threads: 6.942405 sec
16 Threads: 4.752595 sec
32 Threads: 3.586519 sec
Xeon Phi:
30 Threads: 34.608027 sec
40 Threads: 27.258293 sec
60 Threads: 24.582100 sec
120 Threads: 18.004286 sec
240 Threads: 15.859062 sec
Xeon Phi (native)
30 Threads: 29.427415 sec
40 Threads: 22.920789 sec
60 Threads: 21.599124 sec
120 Threads: 14.557700 sec
240: Threads: 13.837122 sec
The native version is slightly faster than the offloaded one and both are much slower than the run on the host. The new scene can be seen at the following picture:

In a next step I will first try to find better solutions for debugging the Phi than commenting and uncommenting. I will test the eclipse plugin which is shipped with the mpss package from Intel.
Sources:
http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/index.htm