Mar 302011
 

Just did a pretty neat trick for some CPU code I had to speed up. The solution here comes as a result of GPU programming where ‘if’ statements in code are bad. Well, branching is bad anyway, but on a GPU the evil is really exposed.

This snippet is called a lot. According to gprof, 99.61% of program time is spent in the method that contains the below code. As a benchmark for arguments sake, we will say that the below code total runtime is 26 seconds

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
    for (int i=0;i<F->sizeQ; i++)
    {
        for (int j=i+1;j<F->sizeQ; j++)
        {
            bool sum = true; 
            // Only sum over different values for self node (ie. 1 and -1)
            if (F->Q[i].integer_array[index] != F->Q[j].integer_array[index])
            {
                for (int m=0;m<index;m++)
                {    
                    if (F->Q[i].integer_array[m]!=F->Q[j].integer_array[m])
                    {
                        sum=false;
                        continue;
                    }
                }
                for (int n=index+1;n<F->in_node_cnt;n++)
                {
                    if (F->Q[i].integer_array[n]!=F->Q[j].integer_array[n])
                    {
                        sum = false;
                        continue;
                    }
                }
            }
            else
            {
                sum = false;
                continue;
            }
            if (sum)
            {
                // Do some computational work that barely registers on a 1 hour profile.
                // Code REMOVED
            }
 
        }
    }

There are quite a few things here that are examples of pretty poor programming. Or, rather the author wrote the code and just left it at that once it worked.

To make this posting a little faster, I’ll summarise the changes I made before I did the cool trick.

1. Change the continue statements on lines [14,22] to break – Effect 16.6 seconds
2. Instead of the caching mess that is the F->Q[i].array[j] statements (these arrays are constantly deleted and new’ed else where in the code), I formulate int **array – Effect: 12 seconds
3. gcov had if statement on line 19 being called many more times than line 11, so reorder the loops. Effect: 11.3 seconds
4. The “pivot” index in the new int **array variable is moved to array[0] (on creation), so can merge i=1:index-1 and i=index+1 into single loop: Effect: 10.2 seconds
5. array[0] or “index” is removed completely into a boolean array, AND
6. Inspection of the elements of remaining int **array show that values are only ever ‘-1’ and ‘1’ – now a boolean array, AND
7. Using the reverse of the trick I use with Binary Arrays the int **array is converted to a single int *array: Effect: 5.0 seconds

At this stage, the code now looks like:

/*
 *  Take to boolean array of length n and create a single integer
 *
 *  Still looking for a better way to do this.
 */
int PackBools (bool* v, int n)
{
    int mask = 1, result = 0;
    for (int i = 0; i < n; i++, mask <<= 1)
    {
        if (v[i] == true) 
            result |= mask;
    }
    return result;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
    for (int i = 0; i<F_sizeQ; i++)
    {        
        counter = 0;
        // Add all elements on left of pivot
        for (int j = 0; j < index; j++)
        {
            if (F->q[i].node_vals_ptr[j] > 0)
                array[counter] = true;    
            else
                array[counter] = false;        
            counter++;    
        }
 
        // Put the pivot in the first element of the row
        if (F->q[i].node_vals_ptr[index] > 0)
            doWork[i] = true;    
        else
            doWork[i] = false;        
 
 
        // Add all elements on right of pivot
        for (int j = index+1; j < F_maxSize; j++)
        {
            if (F->q[i].node_vals_ptr[j] > 0)
                array[counter] = true;    
            else
                array[counter] = false;        
            counter++;             
        }
        // Turn the boolean array back into a single integer
        final[i] = PackBools(array, F_maxSize-1);
    }
 
    for ( int i = 0; i < F_sizeQ; i++ )
    {
        for ( int j = i+1; j < F_sizeQ; ++j )
        {
            if (doWork[i] ^ doWork[j])
            {
                if ( final[i] == final[j])
                {
                    // Do that barely registering compute work
                    // Code removed
                }
            }   
        }
    }

The VTune results here say that still 90% of the runtime is being spent on lines 36 (0.95s), 38 (2.23s) and 40 (1.36s).

To remove the if statements, replace all code from line 34 to 47 with:

34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
    bool toProcess;
    int *processIndexes = new int[F_sizeQ*(F_sizeQ+1)/2];
    counter = 0;
    for ( int i = 0; i < F_sizeQ; i++ )
    {
        for ( int j = i+1; j < F_sizeQ; ++j )
        {
            toProcess = (final[i] == final[j]) &&  ( doWork[i]^doWork[j] );
            processIndexes[counter] = j*F_sizeQ + i;
 
            // Increment the counter if the result is valid
            counter += toProcess;
        }
    }
 
    int i, j;
    for (int m = 0; m<counter; m++)
    {
        i = processIndexes[m]/F_sizeQ;
        j = processIndexes[m]%F_sizeQ;
 
        // Do the, now registering, compute work
    }

Total Compute time is now down to 1.5 seconds.

I thought the counter increment and array indexing was cool. I now only have an array of indexes that need to be processed. On line 41, the most common ‘if’ fail will short circuit early so no array increment occurs.

It should be noted that the nested loops above are still responsible for most of the runtime but when I parallelised the rest of the code, running on a compute node (dual X5650’s) with hyperthreading turned on made the code run 25% slower than running without the hyperthreading. That is running with 24 logical cpus compared to running 12 cpus. One iteration of the loop on line 39 takes three (3) cycles.

The rest of the code above was then fine tuned and parallelised with pthread.h. The result was that the jobs should take about 155 hours to complete. That is a big reduction on the original, estimated, Matlab code time of 186, 000 hours (21.2 years).

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)

Human Conf Test * Time limit is exhausted. Please reload CAPTCHA.