Composer 2013 Update 3 and OpenCL

I had the problem that after installing the Update 3 for the Composer 2013 I wasn’t able to execute my OpenCL programs any more. Debugging told me that already the function clGetPlatformIDs returns error status -1001 and so no OpenCL device in my system could be found. I realized that I was able to run applications which where compiled with gcc. But at the point where I sourced the intel compilervariables via

source /opt/intel/composer_xe_2013/bin/ intel64

my OpenCL devices disappeared. I played a little bit around but only could reach a point where I recognized that their is no error, when I comment out the debugger variables file in But at this point I ran out of ideas (and especially knowledge of the instrinsics of the Intel software and configuration). But in Intel forum I got the solution from Evgeny Fiksman [1]:

The problem may happen because of mismatch in TBB binaries. After sourcing of compilervars the LD_LIBRARY_PATH is updated with a path to TBB binaries provided with ICC installation.

Since LD_LIBRARY_PATH overides RUNPATH used in the OpenCL binaries, the TBB binaries from the ICC installation are used.

Please include the path wherein OpenCL binaries are installed before ICC path in the LD_LIBRARY_PATH.

So I need the following command sequence to get OpenCL running together with the Update 3 of the composer:

source /opt/intel/bin/ intel64
export LD_LIBRARY_PATH=/opt/intel/opencl-1.2-3.0.56860/lib64:$LD_LIBRARY_PATH

So if someone has the same issue: try this.


Performance decrease since pre production samples?

Now I finished reading the book Intel Xeon Phi Coprocessor – High-Performance Programming from Jim Jeffers and James Reinders. I hoped to find the reason why I am not able to reach good performance results on Xeon Phi, but I am still a bit confused about that. So I decided to test the two example programs from the book and to look if the given results are comparable to the performance on our Xeon Phi. Since the source code and the output is printed completely in the book, the calculation times should be nearly the same because our model of Phi has also 61 cores like the pre production sample in the book. So lets have a look at the two programs.

9 point stencil algorithm
This small program applies a blur filter to a given image represented as 2D array. The influence of all 8 neighborpoints to a center point is taken into account. So for each point a weighted sum of 9 addends must be calculated. Since their are two image buffers which are swapped at the end of each iteration, every pixel can be calculated independently from the others. Thats’s why a simple parallelization can be realized by:

#pragma omp parallel for private(x)
for(y=1; y < HEIGHT-1; y++) {
    for(x=1; x < WIDTH-1; x++) {

To help the compiler with the vectorization it is only necessary to add a #pragma ivdep. So the compiler vectorizes the inner loop.

#pragma omp parallel for private(x)
for(y=1; y < HEIGHT-1; y++) {
    #pragma ivdep
    for(x=1; x < WIDTH-1; x++) {

After this code changes the authors reach the following execution time on the Xeon Phi:

  • 122 threads:  8,772s
  • 244 threads: 12,696s

The program compiled here with the same flags and the same setup for our Phi (scatter scheduling) leeds to:

  • 122 threads: 12,664s
  • 244 threads: 19,998s
  • (240 threads: 17,181s)

So in the case of 122 threads our Phi needs 44% more time to finish its work. In the case of 244 threads the increase is even 57%! The special behaviour of using the maximal number of threads will be investigated below. But even 240 threads are much slower than the reference in the book (35% difference).

Here a program is examined which simulates the diffusion of a solute through a volume of liquid over time. This happens in 3D space. The calculcation is very simular to the image filter example from above with the main difference of a 3D array now. Here you take six neighboring grid cells into account (above, below, in front, behind, left and right). So you have for every entry a weighed sum with seven addends. After optimizing for scaling and vectorizing your code looks like:

#pragma omp parallel for
     #pragma omp for collapse(2)
    for(z=0; z < nz; z++) {
        for(y=0; y < ny; y++) {
        #pragma ivdep
            for(x=0; x < nx; x++) {

The results in the book are:

  • 122 threads: 25,369s
  • 244 threads: 18,664s

With our Phi I am able to achieve the following times:

  • 122 threads: 22,661s
  • 244 threads: 29,849s
  • (240 threads): 20,419s

For me it was very strange to notice, that the execution times especially for 240 threads have big variability. The fastest run within 10 was finished after 20,419s and the slowest one needed 31,580s, although I was the only user on Phi. In contrast for 122 threads the fastest execution finished after 22,661s, the slowest one after 23,796s. For 244 threads the behaviour of the Phi ist again completely different from the result of the book. And if one looks at the output of Phi’s monitoring software one can see the reason for it:

240 Threads

240 Threads

244 Threads

244 Threads

So the average core utilization decreases dramatically if you follow the recommendation of the book to use all available cores in native mode and all-1 when running in offload mode. Perhaps a change in the software leeds to this behaviour? I also measured the fastest execution time on the two server processors on the board (which is not done in the book). For 16 threads they needed 30,900 seconds and so they took “only” 50% more time than the Xeon Phi. And this in an application which should be capable of using all compute power which the Xeon Phi offers.

Strange. That’s all I have in mind when I am thinking about this situation. I am using the same code like the book, the same compiler flags and a Phi product with the same featureset as the pre production sample in the book. I’m running the code in native mode so that driver, mpss and so on can’t have an impact on the performance. The only things I can see that differs from the book is the linux-version on the Phi (latest available one) and the new versions of Intel compiler and OpenMP library. But can this cause such big performance differences?

First real life experiment with the Xeon Phi V

“Scale and Vectorize”. These are two prerequisites which I read several times until now, when I searched for ways to make a program run faster on Xeon Phi. The principle of scaling is clear to me and my raytracing applications which I am testing on Xeon Phi do scale well on our server processors and also on Xeon Phi, as I will show in this post. But the overall performance on Xeon Phi is still worse than on the Server CPUs.

As another approach to get a fast Xeon Phi application I tried to offload the Geometric Algebra raytracer of my master thesis [2] to Xeon Phi. The offloaded code is not based on the C++ version of my software but on the OpenCL code which I tested on AMD and NVidia cards during the programming phase of my master thesis. This code uses only scalars and arrays and so it overcomes the limitation of the mic compiler, that it is not able to offload objects which can’t be copied by a simple memcpy [1]. In fact this code runs much faster on CPU and the Xeon Phi than the original C++ version, but it is clear, that it is much harder to understand and to maintain. As an example for the performance I present the following scene here:



It consists of 6500 triangles, uses shadow rays and has recursion level 2. The resulting rendering time is:

  • Xeon (C/OpenCL-based version): 0,40 sec
  • Xeon (C++ version): 1,05 sec
  • Xeon Phi (C/OpenCL-based version): 1,05 sec

So the Xeon Phi again isn’t able to reach the performance of both server processors. But why? First I did a scaling analyses with the following result:



So for 120 threads the Xeon Phi performs about 80 times faster than single threaded. I think that is a pretty good value and not the reason why the Xeon CPU ist such much faster. Thats why I tested the second prerequisite “vectorization” in order to find the performance issue. And there it is. It doesn’t have a measurable influence on the runtime whether to compile with vectorization or with the -no-vec flag to suppress the generation of vector code. The SIMD units of the Phi cores seem to be completely unused by my programm. I tried a few things until now but weren’t able to reach anything in this direction.

I also did a few tests on the thread affinity. scatter performs some percent faster than compact, but this small differences can’t be the explanation for Xeon Phi’s bad rendering times.


First real life experiment with the Xeon Phi IV

In this part I will present some results, which I got from using OpenCL on the Xeon Phi. In another blog entry I described some problems when using the Xeon Phi together with OpenCL [1]. This problems were solved now and I will first report, how I reached this. After that I will summarize the performace measurements and their results.

Getting OpenCL running
My problem was not to install OpenCL but to initialize it correctly. So everything I explained in [1] was correct and the installation was complete. First I used an example code from the Internet, to make sure, that the Phi is really registered as OpenCL device [2]. This created me the following output:

platform count: 1
device count: 2
1. Device: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
1.1 Hardware version: OpenCL 1.2 (Build 56860)
1.2 Software version: 1.2
1.3 OpenCL C version: OpenCL C 1.2
1.4 Parallel compute units: 32
2. Device: Intel(R) Many Integrated Core Acceleration Card
2.1 Hardware version: OpenCL 1.2
2.2 Software version: 1.2
2.3 OpenCL C version: OpenCL C 1.2 (Build 56860)
2.4 Parallel compute units: 236

At this point I could be sure, that the Phi is ready to work, but the next question was how to use it. After a while of reading the OpenCL Documentation I got a hint to the device type CL_DEVICE_TYPE_ACCELERATOR and with that I was able to calculate on Phi. The following code shows how I initialize my device (declarations and error handling ommited).

context = clCreateContextFromType(cprops,                                                                                                                                                                          
status = clGetContextInfo(context,                                                                                                                                                                                         
devices = (cl_device_id *)malloc(deviceListSize);
commandQueue = clCreateCommandQueue(                                                                                                                                                                                       

Now with device[0] I can do the rest of the initialization work.

OpenCL Results
For comparision between the installed Intel Xeon E5-2670 and the Xeon Phi I used a raytracer which operates within Geometric Algebra (GA). It was developed in my Master Thesis [3] and modified and ported to linux for this test. I will show the testscenes and present the results in the following. The profiling was done like in my thesis by using the OpenCL framework’s methods [4]. The resolution for every scene is 1024*1024.

This small dinosaur consists of 100000 triangles. The model was only raycasted so that no reflection rays where used. The Xeon Phi needed 2,98 seconds to render this image, the Xeon E5 only 2,58.

3 Bunnys and an Elephant
This scene consists of 16150 triangles (each bunny 4968, elephant the rest). The Xeon E5 needs 2,47 seconds and the Xeon Phi 2,17. So at least the Phi can outperform the E5.

This rather small scene consists of only about 6000 triangles, but the calculation is dominated by the high amount of shadow. It was tested in two ways. First with use of bounding spheres to reduce the account of ray-triangle intersection tests (like all scenes until now) and in a second step without bounding volumes. In this case for every pixel (e.g. the corresponding eye-ray to it) a test with every triangle in the scene has to be done. For the first variant the Phi renders the picture within 2,11 sec, while the E5 is ready after 2,96 seconds. For the second, the Phi needs 4,76 sec and the E5 6,46 seconds.

This last scene is the most complex one. Each kitten is built by 137098 triangles and the bounding spheres are disabled, while the scene is not only raycasted but completely raytraced. So for every of the 1048576 pixels over 250000 triangles must be tested for intersection (in case of a hit this value doubles). The Phi finishes this task after 142 seconds, while the Xeon E5-2670 has the result after 177 seconds.

It looks like it was in the other three parts of this serie before: Without changing or rewriting existing code it seems impossible to exploit the Xeon Phi’s potential. In contrast to my tries in offload and native mode with C++ code the Phi is able to render faster than the server processor in most of the scenes, but his advance is not that big. More than ever if I take the results on my AMD HD6970 from my Master Thesis into account, the calculation of the scenes on the Xeon Phi is slow.


Xeon Phi, Xeon E5 and OpenCL under Ubuntu 11.10

Since the idbc_mic debugger still denies to work together with the Xeon Phi I did another experiment today and tried to get the Phi running with OpenCL. So I downloaded the latest version of the Intel OpenCL SDK 2013 and the drivers for processors and the one for Xeon Phi from: Like for the MPSS Intels software only supports RedHat and SuSe Enterprise. So I downloaded the three archives and unzipped them one by one with tar xvf. After that I executed in all of the three unzipped directories these both loops:

for f in *.rpm; sudo alien –scripts $f; done


for f in *.deb; sudo dpkg -i $f, done

to convert and install the packages for the two drivers and the general software. Within that process I didn’t get any errors and everything seems to be installed.

Since all linked linux sample projects from: lead to dead links, I had to take the Windows samples package from: and to modify the MedianFilter project, so that it runs under Linux. All code, except the OpenCL kernel, is placed in one cpp file. The Makefile looks like this:

rm=/bin/rm -f
INCLUDES=-I. -I../common -I/usr/include/ -I/opt/intel/opencl/include
LIBS=-l OpenCL


SRCS = ocl.cpp

OBJS = ocl.o

$(rm) $@
$(CC) $(CFLAGS) -c $*.cpp

all: $(PROGNAME)


$(rm) $(OBJS) $(PROGNAME) core *~

But when trying to compile this, I get a warning that the numactl library wasn’t found on which the OpenCL libs are depended. A try to run the program anyway ended up in a crash because the needed *.so file couldn’t be found (no wonder :) ). So I downloaded the numcatl package for Fedora 15 from:, converted it to *.deb with the help of alien and installed it with dpgk. After that I found the Ubuntu solution from: But it isn’t installed yet.

After that my program compiled and could be run. But the functions to find the OpenCL devices in the system only are aware of the Xeon E5 processors on the board and not on the Xeon Phi, although its driver is installed. I’m still searching a solution for this.

My problem was not to install OpenCL but to initialize it correctly. So everything I explained here was correct and the installation was complete. First I used an example code from the Internet, to make sure, that the Phi is really registered as OpenCL device [1]. This created me the following output:

platform count: 1
device count: 2
1. Device: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
1.1 Hardware version: OpenCL 1.2 (Build 56860)
1.2 Software version: 1.2
1.3 OpenCL C version: OpenCL C 1.2 
1.4 Parallel compute units: 32
2. Device: Intel(R) Many Integrated Core Acceleration Card
2.1 Hardware version: OpenCL 1.2 
2.2 Software version: 1.2
2.3 OpenCL C version: OpenCL C 1.2 (Build 56860)
2.4 Parallel compute units: 236

At this point I could be sure, that the Phi is ready to work, but the next question was how to use it. After a while of reading the OpenCL Documentation I got a hint to the device type CL_DEVICE_TYPE_ACCELERATOR and with that I was able to calculate on Phi. The following code shows how I initialize my device (declarations and error handling ommited).

context = clCreateContextFromType(cprops,                                                                                                                                                                          
status = clGetContextInfo(context,                                                                                                                                                                                         
devices = (cl_device_id *)malloc(deviceListSize);
commandQueue = clCreateCommandQueue(                                                                                                                                                                                       

Now with device[0] I can do the rest of the initialization work.



Installing the new 2.1.5889-16 of MPSS on Ubuntu 11.10

Because of troubles with the debug software I updated my MPSS version with the hope, that this will resolve (some) problems. So I downloaded my the latest version from and followed my own tutorial from [1]. The only thing, that you have to change is the spec-file, if you want the Intel tools to show you the correct version of it. The appropriate file can be downloaded here: intel-mic-kmod-5889-16.

I did a complete uninstall of the previous version and after a restart I installed the newer one. Everything runs like before. Not worse, but still not better…


Updating Phi’s bootloader and flash version

Since I still have trouble with Intels Debugging tools on Xeon Phi I decided to do an update day today. On march 21 Intel released a new driver version 2.1.5889-16 which I installed and will blog in a seperate post. After that I did the update of bootload and flash. To do that you have to no the stepping of your Xeon Phi. So you first have to execute /opt/intel/mic/bin/micinfo and look for the line Coprocessor Stepping. In my case this is B1. After that you can follow Intels description of the process given in [1], the readme of the MPSS package. I reprint it here in a shortened way:

This configuration is required for “SMC Firmware Version 1.7″ or earlier. Execute /opt/intel/mic/bin/micinfo to identify the SMC firmware version installed on the card. 1) Set the coprocessor to a ‘ready’ state. user_prompt> sudo micctrl -r user_prompt> sudo micctrl -w user_prompt> mic0: ready 2) Update the SMC firmware only for installed B0 or B1 steppings of Intel(R) Xeon Phi(TM) hardware. Skip to step 3 if C0 stepping is present user_prompt>sudo /opt/intel/mic/bin/micflash \ -Update /opt/intel/mic/flash/<EXT_HP2_SMC_Bootloader_version> -device all Where <EXT_HP2_SMC_Bootloader_version> represents an SMC firmware file, which for this release is named EXT_HP2_SMC_Bootloader_1_8_4326.css_ab. The “ab” postfix means that the image applies to A and B steppings of the coprocessor. No reboot is necessary at this point. 3) The bootloader update is limited in functionality. The flash update must be run after the bootloader update for full card management availability (Section 7.3 “Intel(R) Xeon Phi(TM) Coprocessor Flash Update”). 7.3 Intel(R) Xeon Phi(TM) Coprocessor Flash Update. Notes: o Flash image files that end with .smc contain both the flash and smc firmware. o For in depth information about micflash options and tools check the micflash man pages. ******************************************************************************** * WARNING: Pay special attention when selecting the version of the Intel(R) * * Xeon Phi(TM) coprocessor flash image that will be flashed while performing * * the steps described in this section. Flashing the incorrect ROM file to an * * Intel(R) Xeon Phi(TM) coprocessor WILL cause the coprocessor to become * * unresponsive and will require it to be replaced. * ******************************************************************************** Steps: 1) Determine the flash image to install: Flash images are stored at the target directory, usually located at /opt/intel/mic/flash/. To choose the right flash image, refer to the following table that relates to each stepping of the Intel(R) Xeon Phi(TM) coprocessor: Stepping | Flash ROM Name ———-+—————————————— B0 | EXT_HP2_B0_0385-01.rom.smc B1 | EXT_HP2_B1_0385-01.rom.smc C0 | EXT_HP2_C0_0385-01.rom.smc 2) Set all coprocessors on ready state. user_prompt> sudo micctrl -r user_prompt> sudo micctrl -w mic0: ready 3) Install the flash image. In single coprocessor systems or multi-coprocessor systems with the same coprocessor stepping use: user_prompt> sudo /opt/intel/mic/bin/micflash -Update \ /opt/intel/mic/flash/ -device all In multi-coprocessor systems with different stepping values, update the flash for each coprocessor, specifying each card separately. For example, for B0 and C0 coprocessors installed in slots 0 and 1 use: user_prompt> sudo /opt/intel/mic/bin/micflash -Update \ /opt/intel/mic/flash/EXT_HP2_B0_0385-01.rom.smc -device 0 user_prompt> sudo /opt/intel/mic/bin/micflash -Update \ /opt/intel/mic/flash/EXT_HP2_C0_0385-01.rom.smc -device 1 4) Power cycle the host for the changes to take effect. Flash will be loaded after a host reboot. In a virtual environment, a reboot by micflash will only reboot the virtual machine.

Now micinfo shows me Driver Version: 5889-16, Flash Version:, SMC Boot Loader Version: 1.8.4326 and uOS Version: This was the first thing, that I did with Phi, that completely runs like it is described in the documentation.

Since everything seems to be up to date now I will again try to get the debug tools running.


First real life experiment with the Xeon Phi III

After I reported my experiences with native mode of the Phi I now did my first steps in the offload mode. I used the same raytracer as in the first two parts of this article serie. A few pitfalls revealed during my tries to get the application running in offload mode.

Changes needed in the raytracer
First off all I had to modify the code, so that it is able to compile with the offload pragmas. There I noticed some difficulties especially for C++ Code.

First off all there is the issue that the Phi can get to know the used classes. For the raytracing procedure and the existing code this affects all classes of the project. So for calculating the colour of a pixel the Phi must now the scene and ist objects, in this implementation called Shapes. Additionally it needs to know what the Color class is, which it should have as output. Furthermore LightsRays, Vectors are Points are required. So I had to tell the compiler for all of this classes, that he has to offload them. This is done by surrounding the class definitions with the offload attribute pragma:

#pragma offload_attribute (push,target(mic))
// includes

class Color

#pragma offload_attribute (pop)

This changes had to be applied to all header files, so you need more than just one pragma to offload code to the Phi.

Another problem was the fact that my image array for the calculated picture was declared as img[Height][Width][3]. I did some small examples with multi dimensional arrays and try to fill them von Phi, but this results in crashes during execution. I don’t know if this was my error or if the Phi ( / the compiler) isn’t capable of dealing with such constructs. So I had to change the code so that he uses a 1D array now. To offload and fill this with testdata was no problem.

After this step I took the actual rendering loop and offloaded it to Phi with:

#pragma offload target (mic) in(argc) \
    out(img1d : length(HEIGHT*WIDTH*3))

The last problem was the biggest one. After offloading the class structure and changing the structure of the output image the code could be compiled. Trying to execute it ended up with a crash, that Phi returned:

offload error: process on the device 0 was terminated by signal 11

Since I don’t know how to debug the Phi at the moment, I located the problem by commenting out code and uncommenting it step by step. The reason for the crashes was very evident, when I think to it afterwards. The classed structure and there function is copied to Phi yes, but not all their members. Simple ints and doubles are copied automatically but the list for the shapes was empty. I searched in the internet for a lot of time for finding an easy way to copy the hole instance of a class to Phi, but I wasn’t successful on that. A look in one of Intels own examples destroyed my hope altogether. You can find it in the directory of Intels 2013 version of the Composer:

There you can find and example of offloading a struct to Phi. Commented with:

// The first version of the offload implementation does not support copying
// of classes/structs that are not simply bit-wise copyable
// Sometimes it is necessary to get a struct across
// This needs to be done by transporting the members individually
// and reconstructing the struct on the other side

So this means for me: I would have to decompose the hole shapes class and its inharitors to simple arrays or single variables, to copy them separately and to reassemble it on the Phi. I refused this way because of the amount of work. So I used a second method: I increased the code region within the offload pragma so that it additionally includes the creation of the scene. So the scene is instantiated from a single Xeon Phi Core and directly put in Phi’s RAM. The results are presented in the next section.

A last problem I was engaged in was the writing of the resulting image to file. Since the array was a member of the Raytracer class and this class was instantiated on the Phi directly, it was not possible to write this data after the offload region. But this must be done so that the stream is written to the hard disk of my host system. So I had to instantiate the output array first, pass it within the offload pragma as out-parameter and internally copy / link it to the member variable of the raytracer in the space of the offloaded code. Then after the region I write the stream to file.

Since the code changes these results are not comparable to the older once from parts I and II. But again only the time for the actual rendering loop is measured. I also changed the scene a litte bit. But the reached times are more than disappointing.

Xeons on Host:
1 Thread: 46.238833 sec
2 Threads: 23.850293 sec
4 Threads: 12.371241 sec
8 Threads: 6.942405 sec
16 Threads: 4.752595 sec
32 Threads: 3.586519 sec

Xeon Phi:
30 Threads: 34.608027 sec
40 Threads: 27.258293 sec
60 Threads: 24.582100 sec
120 Threads: 18.004286 sec
240 Threads: 15.859062 sec

Xeon Phi (native)
30 Threads: 29.427415 sec
40 Threads: 22.920789 sec
60 Threads: 21.599124 sec
120 Threads: 14.557700 sec
240: Threads: 13.837122 sec

The native version is slightly faster than the offloaded one and both are much slower than the run on the host. The new scene can be seen at the following picture:


In a next step I will first try to find better solutions for debugging the Phi than commenting and uncommenting. I will test the eclipse plugin which is shipped with the mpss package from Intel.


Trying to get the Phi running under Ubuntu 12.04 LTS with kernel 3.5.0-26

As I reported in [1] I was able to get the Xeon Phi running on Ubuntu 11.10 with the 2.1.5889-14 mpss version and the latest available kernel for 11.10. After a complete crash of our cluster node we tried to use the chance to upgrade to 12.04 LTS with kernel 3.5 and I tried for two days to get the Phi running on this configuration.

The first thing when trying to follow my description in [1] is that the compilation in step 9

For compiling the kernel module go to /root/rpmbuild/SPECS. Here execute the command rpmbuild –bb intel-mic-kmod.spec. The sources are being compiled and a rpm-file is created. It is placed in /root/rpmbuild/RPMS/x86_64.

fails with several compile errors. This is caused due three changes in the linux kernel header files:

  1. The tty_driver structure has no field minor_num any more. It was removed because it was never used. The Intel software writes it at one position in linvcon.c but never reads this value. So I commented this writing out.
  2. The poll_table structure renamed her entry key to _key. So I changed this in all uses in the files miscif_api.c and miscif_nm.c.
  3. The SYSTEM_SUSPEND_DISK state was removed. It is only checked in micsif_nm.c and the only command that follows is a break. So I commented out those to lines.

So I extraced the driver code from the tar archiv to direction dirorg and copied these file to a new directory name dirpatch. There I applied the code changes described above. After that I created a second patch file with the diff command:
diff -uNr dirorg dirpatch > kernel35.patch
and added his execution/application to the spec file. I also changed the original spec file to remove the inherent error, that the name of the original first patch file there was wrong. The files are available in Kernel35 Patches. If you use this spec file to install the 2.1.5889-14 MPSS version on Ubuntu 11 as descripted in [1] you can skip step 7 from there and ignore the original patch (intel-mic-mpss21up1-kmodspecfile.patch) for the *.spec file.

With the kernel35.patch, the original intel-mic-mpss21up1-kmod-2.1.4982.patch  and the modified intel-mic-kmod.spec it was possible to compile the kmod kernel module and create the *.deb file following steps 8-10 from [1]. After finishing the installation process with this selfmade package, it is possible to get the status of the Phi via micctrl –status. It says ready and you can execute micctrl –initdefaults. The ssh keys are copied.

But then the disillusion. When trying to start the mpss daemon everything behaves normally and the Phi reports online. After 14 seconds (time taken from log file) the status changes to loss and the Phi tries to reboot until he reaches ready state. There he stays some time and after a while the hole system crashes. This behaviour is reproducible.

Nothing of my tries could change this situation. Warm or cold restarts or a complete new installation of the software (with repeating compiling). To uninstall the Intel stuff I generated a list with all installed packages with:
dpkg –get-selections > installed-software
and uninstalled one Intel package after the other with
dpkg -r PACKAGE
dpkg –purge PACKAGE

As last idea I tried to use the old Intel software version 2.1.4982-15. But with the same result in the end. As intermediate problem there was additionally a conflict of the intel-mic-gdb and the intel-mic-gpl package, where I had to force dpkg to install it, although they were writing to same positions.

After that I gave up the plan to use the new linux for the moment and went back to 11.10 with kernel revision 3.0.0-32, where I only needed about 30 minutes to get the Phi running, following my own step by step manual from [1].


First real life experiment with the Xeon Phi II

In the previous part if this article [1] I mentioned that the next step would be an analyses of the bad scalability and so performance of the raytracer on Xeon Phi. As a first step for this I used the Intel vTuneAmplifier to search for hotspots in the code algorithm. But there seem to be no abnormalities in the execution flow. But it is conspicuous that the Amplifier states, that the CPU time in the running threads is rather low. On the other side, the overall summation of the results looks pretty good:




So I decided to use Intel Inspector first. The normal analyses reported no errors. After that I increased the search depth and anayses form. Since the analyses wasn’t finished after seven minutes I terminated it. I got two data race errors. One within the Shading and one within the Rendering function. So I disabled the shading and tried to eliminate the data race in Rendering with changing my OpenMP clause. Thats why the resulting image looks like:


The results on the double Xeon Server processor system are:

  • 1 Thread: 35.769255 sec
  • 2 Threads: 18.427898 sec
  • 4 Threads: 10.145121 sec
  • 8 Threads: 6.403982 sec
  • 16 Threads: 3.907015 sec
  • 32 Threads: 3.667761 sec

The corresponding values for the Phi are the following:

  • 1 Thread: 573.422251 sec
  • 2 Threads: 288.445928 sec
  • 4 Threads: 156.622805 sec
  • 8 Threads: 98.961222 sec
  • 15 Threads: 54.404671 sec
  • 30 Threads: 34.617849 sec
  • 60 Threads: 22.450361 sec
  • 120 Threads: 15.535183 sec
  • 240 Threads: 10.441986 sec

You can see that the overall performance in comparision to the original version with shading increases, but the scaling problem remains the same. So the Phi still isn’t able to outperform the dual socket cluster node. Though the shading is not the problem for the scaling or the general performance on Phi. The search goes on…


First real life experiment with the Xeon Phi

After executing and editing some of Intel’s tutorials in the /opt/intelcomposer_xe_2013/Samples/en_US/C++/mic_samples/intro_sampleC directory I did a practial experiment with a raytracer on Xeon Phi in native execution mode. For that I used a simple, open source C++ raytracer which I downloaded from [1].

The main problem of this raytracer was the structure of the loop, which runs over all pixels of the image and can normally be parallized in an easy way. But in this case the original implementation created dependencies between single loop runs and made an OpenMP parallelization impossible due other issues. Main reason for that was, that the structure of the for-loops wasn’t OpenMP compatible (no initialization, two increments) and the calculation of the i- and j- local coordinates parameters was depended on the previous iteration.
Another problem was the creation of the final output image, which was realized as a stream writing to file in every iteration, so that the correctness of the picture was depended on a systematic run through the pixels.
After this issues were resolved a simple OpenMP parallelization was done over the outer pixel loop (over height of image).
The resulting new  RayTracer.cpp can be downloaded by clicking on it.

Another change was done in the main routine. Since the original scene was to simple to test scalability, more objects were added to it. Now a loop creates 300 spheres which were slightly displaced to form a pipe similar structure. The new image looks like this:


The modified main.cpp can also be downloaded here.

This example now was compiled with icc and the -O3, -ipo and -xHost flags and benchmarked on one cluster node with two Intel Xeon E5-2670 (2,6 GHz) processors on Sandybridge base. Every CPU has eight physical cores and Hyperthreading. So cat /proc/cpuinfo lists 32 processors.

For time messurement the omp_get_wtime function is used. I take only the time into account which elapses between the start and the end of the pixel loop. So the serial part of the application is neglected. The results are:

  • 1 thread: 50.147235 sec
  • 2 threads: 25.685468 sec
  • 4 threads: 14.723042 sec
  • 8 threads: 9.780591 sec
  • 16 threads: 7.495562 sec
  • 32 threads: 5.663061 sec

So in the beginning there is a very good scaling of the raytracer how one can expect it from theory. But the scaling between 8 and 16 threads is rather poor. That the difference between 16 and 32 wouldn’t be very huge becomes clear when you take into account that 16 of the threads are only running with Hyperthreading and not on real, physical cores.

The results on the Phi are the following ones:

  • 1  Thread: 800.992084 sec
  • 2 Threads: 419.042969 sec
  • 4 Threads: 230.156072 sec
  • 8 Threads: 148.853595 sec
  • 15 Threads: 81.311708 sec
  • 30 Threads: 50.757255 sec
  • 60 Threads: 32.278490 sec
  • 120 Threads: 22.553419 sec
  • 240 Threads: 15.653570 sec

As one can see that the single core performance of the Phi is very poor which is obvious because of the architecture of the less complex, 1 Ghz clocked cores of the card. But I didn’t expect it to be as huge as it is now. The scaling in the very beginning is very good, but decreases fast with increasing number of threads. So in total the Xeon Phi in this setup isn’t able to reach the performance of the 2 Xeon Server processors.

One curious thing turns out when you look at the average core utilization of the Phi via Intel monitoring tool. From start of the raytracer on all cores are used, but during the calculation the average usage is decreasing with the time. You can see this on the following screenshot:


This behaviour is unexpected, because following the theory of raytracing, all threads can take part in the calculation of the final image until it is rendered completely. The next step would be to investigate, where this behaviour comes from and in general to solve the issue of the relativly low performance of the Phi raytracing.


Online tutorials and the -mmic / -(no)-offload compiler-flags

After reading the first tutorials of using the Xeon Phi and some experiments with it in reality, it seems that I missunderstood its use. Also some of the tutorials (not from Intel itself) seem to be wrong at this point.

The -mmic compiler-flag is only needed if an application shall run in native mode on the Xeon Phi. The generated executable can’t be started on the host system, but only on the Phi. To run an application completely on the Phi:

  1. Compile the app with icc -mmic -OTHER_NEEDED_FLAGS
  2. Copy the binary to Phi via scp EXENAME mic0:/tmp/
  3. Login on the Phi with ssh mic0
  4. Go to directory with cd /tmp
  5. Start application ./EXENAME

For offload versions of your application you must not use the -mmic flag! The compiler will complain about unknown pragmas and stop compiling. Only icc without -mmic recongnizes #pragma offload and generates usable hybrid code.

The usage of the -offload flag changed since the early versions of the mic-capable compilers. The -offload is obsolete. So now it is the standard behaviour that the compiler generates both, mic and host cpu code for the offloaded regions. If you want to surpress the generation of mic code you have to use -no-offload flag. It helped me a lot to become aware of this fact.

Installing Intel Xeon Phi on Ubuntu 11

Since our cluster-node is running under Ubuntu 11.10, it was not that easy to get our Xeon Phi running in this system. That’s why we decided to give here a short how-to on this issue in the hope that it can help somebody else who is confronted with the same or similar problems. It is based on a descripition from[1] but differs in some point. One reason is, that I used a newer version of the Intel driver. All steps were done as root, because some of the directories I needed where placed in the /root directory.

  1. First step is to download the driver from the Intel homepage: and go to “Tools&Downloads” and choice: Software Drivers: Intel® Manycore Platform Software Stack (Intel® MPSS). I used the version KNC_gold_update_2-2.1.5889-14 for SuSe SLES11 SP2 and you have to use it also on systems with newer kernels because it is designed for a 3.0xx kernel version.
  2. Unzip driver and MPSS via tar xvf KNC_gold_update_2-2.1.5889-14-suse-11.2.tar.
  3. For the next step you need the package alien. If it is not installed you have to install it via apt-get install alien.
  4. Go to the unzip directory at topmost level, where all nine *.rpm files are stored. Convert all packages to *.deb except intel_mic_kmod-2.1… because this package contains the driver kernel module, which needs to be compiled for your specific kernel manually. To convert all other packages use alien –scripts PACKAGENAME.deb. Alternatively you can use a for loop in your bash:
    for f in *.deb; do sudo alien –scripts $f; done
    to do the work for you. Before that you should delete the kmod rmp package. You won’t need it anymore.
    After this step you have eight *.deb packages in the directory where you have unzipped the files.
  5. Next step is the unpacking of the drivercode. Go to the subdirectory src. For this step you need the rpm-package. If it isen’t installed yet you have to do this with apt-get install first. Then unpack the files via rpm2cpio intel-mic-kmod-2.1.5889-14.suse.src.rpm | cpio -idmv. You will get to files. The first one with the driver data is called intel-mic-kmod-2.1.5889.tar.bz2 and the spec file intel-mic-kmod.spec.
  6. Now you will need the two files from zip archiv intel-mic-mpss-kmod-patch_usable. In this case you get a intel-mic-kmod.spec file for the compilation process of the kernel driver code and a  intel-mic-mpss21up1-kmod*.patch file which modifies the code before compilation to be compatible with the Ubuntu kernel.
    You can also download original versions from [1]. Then you have the two files mentioned above and additionally a intel-mic-mpss21up1-kmodspecfile.patch file. But then you have to do the changes to the *.spec file, which are described in the following step by hand.
    The files in the archive linked above are ready to use. So you can extract them and skip step 7. But you should read it to know, what was changed in the file.
  7. The original spec-file-patch can’t be automatically applied to intel-mic-kmod.spec because Intel changed the spec-file content for the 5889-14 release. So I had to integrate the changes by hand into intel-mic-kmod.spec. So I had to:
    - add the g in the line which starts with %define kernel abbrev. Like you can see it in the specfile patch.
    - add the line which starts with Patch1:. But here is an error in the original specfile patch. I had to fill in the correct name of the second file in the archive to the “Source0″ line which is: intel-mic-mpss21up1-kmod-2.1.4982.patch.
    - comment out the line which begins with BuildRequires:.
    - add the %patch1 -p1 like you can see it in the patch-file.
  8. Now you have to move all the needed files in the correct directory to build the kernel module. So copy the edited spec-file to /root/rpmbuild/SPECS. The remaining (code) patch file and the intel-mic-kmod-2.1.5889.tar.bz2 must be placed in /root/rpmbuild/SOURCES.
  9. For compiling the kernel module go to /root/rpmbuild/SPECS. Here execute the command rpmbuild –bb intel-mic-kmod.spec. The sources are being compiled and a rpm-file is created. It is placed in /root/rpmbuild/RPMS/x86_64.
  10. Convert the created rpm file again to *.deb with using alien –scripts on it. Then copy the *.deb file to the directory where you unzipped the Intel stuff and where the other *.deb files are.
  11. Go to the Intel stuff folder and install all *.deb file with apt-get -f install. I did it in alphabetical order. Alternativly you can use a for loop in your bash:
    for f in *.deb; do sudo dpkg -i $f; done
    to do the work for you.
  12. Since the control tools for the Phi are placed in /usr/lib64 by default, you have to tell Ubuntu, that it should look there for the binaries. Create a new file named mic.conf in /etc/ and write the line
    usr/lib64 into it. Now execute the ldconfig command after that. [1] claims, that you can now use the Intel tools, like micctrl.
    One pitfall I (and another person [2]) had, when the systems tells that the is missing, was the fact, that there are two versions of this file on your harddisk. One as it is stated in [2] at opt/intel/mic/filesystem/base/lib64/ and another one in /usr/lib64/. The first one is a version which is compiled for the Linux version that is running on the Phi and not of use for miccntrl. So you have to add the /usr/lib64/ path (for example like it is described in [3]. After that micctrl –status should work.
  13. In the next step you have to change the init script for the mpss services, so that it works under Ubuntu. So start editing the file /etc/init.d/mpss. Just as desribed in [1] replace the line that reads startproc -t 1 $exec with these two:
    [ -d "/var/lock/subsys" ] || mkdir /var/lock/subsys
    start-stop-daemon --start --exec $exec 
  14. This passage can be exactly done, as [1] states. Open the file /etc/network/interfaces for edit an append:
    iface mic0 inet static
    to it. Save and close. For the network configuration I had also to run ifup mic0.
  15. Now everything should be prepared to start the Phi. mpss service. For that type service mpss start. The message Starting mpss stack: will appear. This can last some time, so wait and don’t press any keys. After a while you will get some missing rc_* messages but get the control of the console back. First test if the kernel module is loaded by typing lsmod and looking for a module named mic. After that type micctrl –initdefaults and micctrl –status. This call shall return you that the Xeon Phi is online now and ready to use.