Mar 192011
 

I’m going to start with a piece of generic c/c++ code and walk through the steps of working out what is taking all the time and how to use the profile results to speed things up.

The best place to start is to get a version of the code you need to profile that takes about 30 seconds to a minute to run. The reason for doing this is that you don’t spend an eternity waiting for something to complete. Short turn around times to begin with. If you find that for some reason you are not capturing the data properly, or that something is missing, it can be useful to extend this time. It is not until I have undergone several iterations of profiling-optimising-running, profiling- optimising-running that I move to much longer run times (eg hours).

dwyer3@lyra:~/Projects/HPC_Tutorials/Profiling/Water> make
icpc -O2  -c SPH_Water.cpp -o SPH_Water.o 
icpc -O2 SPH_Water.o -o Water -limf 
Binary created!!
dwyer3@lyra:~/Projects/HPC_Tutorials/Profiling/Water> ./Water
Timing: 38263 milliseconds - 249 frames

The next thing to do is to turn on the profiling flags.

Compile your program with -pg

dwyer3@lyra:~/Projects/HPC_Tutorials/Profiling/Water> make
icpc -O2 -pg  -c SPH_Water.cpp -o SPH_Water.o 
icpc -O2 SPH_Water.o -o Water -limf -pg
Binary created!!

Run the newly compiled code.

dwyer3@lyra:~/Projects/HPC_Tutorials/Profiling/Water> ./Water
Timing: 38057 milliseconds - 249 frames

To start getting profiling information out, run the following commands:

gprof Go gmon.out -p > p.txt
gprof Go gmon.out -q > q.txt

Examining gprof Outputs

gprof -q WaterCode

Figure 1: gprof Binary gmon.out -p

It can be seen from Fig.1 (examining p.txt) that I have one method that is taking up most of the time. Of the 38 seconds of runtime, 99.8% of the time was spent in the function: Run(). This is not helpful (for this particular case). q.txt will tell us similar information.

Don’t go thinking that gprof is useless; It isn’t. Rather this is not a great example for it. Another gprof capture I have for another project is presented in Fig 2.

gprof -q power

Figure 2: gprof Intel Power Function

Figure 2 shows that the std::pow function is being called many, many times. Other interesting information here is that 3.57% of program run time is spent just making calls to the function. All the std::pow function is doing, in this particular case, is offloading the work to the Intel Math Function for it’s implementation of the power function. The user (me) must have compiled with -limf and #include “mathimf.h” instead of the usual -lm and #include “math.h”. Is there a way to not call std::pow and short circuit directly to Intel’s version? The Intel math version of pow took an additional 22.5% of program runtime. 26.06% (3.57+22.5) of program runtime was spent using the power function.

gprof -q power

Figure 3: gprof Physics Code

Figure 3 is better in that it shows a break down of multiple methods. No one method is consuming a large proportion of compute time.

gprof lidar

Figure 4: gprof Massive Rendering

There are two methods in Figure 4 that are taking up most of the time. One is being called about 1.7 billion times while the other a mere 1000 times. Which do you work with?

Getting back to Figure 1. We need more information. Hence, we now turn to other profiling tools. The best one being Intel’s VTune (Amplifier).

Intel VTune Amplifier

To start Intel VTune Amplifier:

module load intel
module load intel/vtune
amplxe-gui

You should be greeted with Figure 5

VTune: The Start

Figure 5: Starting Vtune

When you want to profile your code for the first time, you will want to create a new project:
File -> New -> Project

VTune: Define a project

Figure 6: Creating a project

You will also want to make sure that VTune can find your source code. It is not necessary but I really, really recommend it. This doesn’t work 100% of the time; VTune still contains a lot of bugs … but keep trying. Sometimes you will need to point to the source code, run the profiling, and repeat until eventually VTune locates your source. Other times you need to exit VTune and start it again, defining a new project.

Update: It seems this has been mostly fixed. I find that I have to go through every entry in the “Search directories for” combobox, selecting where the code is and also selecting the sub directory checkbox. Once all of that is done, don’t click ok, rather go back to the “Target” tab and click OK.

Another thing is not to compile the binary with any optimisation flags … only compile with “-g” (Which seems to me a bit of a problem, but I’m not going into that).

Choosing Source Files

Figure 7: Pointing to your source code

You should have defined your project adequately for now.
Press OK and start a new tuning event by
File -> New -> Analysis

At this point you could run any of the options but I have found that you are better off creating a tuning event yourself.
Click the “New” at the bottom left of the GUI

VTune: Specific Tune Activity

Figure 8: Custom Analysis

Once you have done this, click START on the right.

The results:

VTune: Initial Results

Figure 9: Initial Results

Fullscreen Figure 9
If you double click on the time consuming bit (in my case |>Water), and VTune was able to locate you source code you will be able to scroll through your source code line-by-line with a counter on each line indicating how much cpu time was spent on it. VERY USEFUL. If you see assembly code, VTune was unable to locate your source code and you may need to go back to figure 7 and rerun the custom analysis.

VTune: Full Source Results

Figure 10: Full Source Results

Fullscreen Figure 10

Other Examples:

Figure 11 is interesting. It is the VTune results of figure 4. The interesting thing to note is that more time is spent manipulating the points array than the actual computation. This is where you would start to examine the cache misses, etc. That sort of exercise is beyond the scope of this investigation. Stay tuned.

VTune: Lidar Rendering

Figure 11: The results of figure 4. Note that this is done on an old version of this code. Highlights a neat point that one day I’ll get around to talking about.

Fullscreen Figure 11

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)

Human Conf Test * Time limit is exhausted. Please reload CAPTCHA.