Feb 162011
 

Please note: I have part 2 done and a part 3 on the way. I actually talk about real examples

What is vectorisation?

Vectorisation is a special case of parallelisation in which software programs that by default perform one operation at a time on a single thread are modified to perform multiple operations simultaneously.

Vectorisation

Little update given the error with the Z subscripts:

The vectorizer detects operations in the program that can be done in parallel, and then converts the sequential operations like one SIMD instruction that processes 2, 4, 8 or up to 16 elements in parallel, depending on the data type.  Options -vec and -no-vec enables or disables vectorization and transformations enabled for vectorization.  The default is enabled.  You can target particular processors with the -x and -ax flag.

Vectorisation reports can be turned on by -vec-report[n] where n=

  • 0 – Tells the vectorizer to report no diagnostic information
  • 1 – report on vectorized loops (default)
  • 2 – report on vectorized and non-vectorized loops
  • 3 – report on vectorized and non-vectorized loops and any proven or assumed data dependencies
  • 4 – report on non-vectorized loops
  • 5 – report on non-vectorized loops and the reason why they were not vectorized.

Typically, one would use the following compile line to begin vectorising your code:

icpc -O2 -xHost -vec-report3

Let’s examine a very simple case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
 
struct timeval start, end;   
 
 
/*
 *  Please note that this is not a great random generator.
 *
 *  Please contact me if you are after one.
 */
float closed_interval_rand(float x0, float x1)
{
    return x0 + (x1 - x0) * rand() / ((float) RAND_MAX);
}
 
 
/*
 * PrintElapsedTime
 *
 * Do you really nead an explaination for this one?
 */
 
void PrintElapsedTime(timeval start, timeval end)
{
    long mtime, seconds, useconds; 
    seconds  = end.tv_sec  - start.tv_sec;
    useconds = end.tv_usec - start.tv_usec;
    mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
    fprintf(stdout, "Elapsed Time: %ld millisecondsn", mtime);
}
 
 
int main()
{
    // What size matrix?
    int n = 1500;
 
    // Declare
    float **A;
    float **B;
    float **C;
 
    // Allocate
    A = new float*[n];     
    for (int i = 0; i<n; i++)
    {
        A[i] = new float[n];
    }
    B = new float*[n];
    for (int i = 0; i<n; i++)
    {
        B[i] = new float[n];
    }
    C = new float*[n];
    for (int i = 0; i<n; i++)
    {
        C[i] = new float[n];
    }
 
    // Create some dummy data
    for (int i = 0; i<n; i++)
    {
        for (int j = 0; j<n; j++)
        {
            A[i][j] = closed_interval_rand(0.0, 1.0);
            B[i][j] = closed_interval_rand(0.0, 1.0);
        }
    }
 
    // Start the timer
    gettimeofday(&start, NULL);
 
    // Start a matrix multiplication
    // Please note: I realise that this is not a cache coherent implementation
    // Demonstration purposes only :)
    for (int i = 0; i<n; i++)
    {
        for (int j = 0; j<n; j++)
        {
            for (int k = 0; k<n; k++)
            {
                C[i][j] += A[i][k]*B[k][j]; 
            }
        }
    }
 
    // Stop the timer and print the results
    gettimeofday(&end, NULL);
    PrintElapsedTime(start, end);
 
    return 0;
}

First off, we will compile the code with no vectorisation so we can see the impact.
icpc -O2 -xHost -no-vec Source.cpp -o Go

Elapsed Time: 5655 milliseconds

icpc -O2 -xHost -vec-report2 Main.cpp -o Go -limf

Main.cpp(47) (col. 5): remark: loop was not vectorized: existence of vector dependence.
Main.cpp(52) (col. 5): remark: loop was not vectorized: existence of vector dependence.
Main.cpp(57) (col. 5): remark: loop was not vectorized: existence of vector dependence.
Main.cpp(63) (col. 5): remark: loop was not vectorized: not inner loop.
Main.cpp(65) (col. 9): remark: loop was not vectorized: existence of vector dependence.
Main.cpp(78) (col. 5): remark: loop was not vectorized: not inner loop.
Main.cpp(82) (col. 13): remark: loop was not vectorized: not inner loop.
Main.cpp(80) (col. 9): remark: PERMUTED LOOP WAS VECTORIZED.

Elapsed Time: 1254 milliseconds

Now, I’m testing this on a X5650 (cat /proc/cpuinfo) which has 256 bit wide mmx registers and the code above uses float (32 bit) so I should expect to get an approx 8x speedup. Why didn’t I?
To answer that, I should use VTune Amplifier to examine what is going on

VTune General ExplorationIt can be seen from the above diagram that my cpi (cycles per instruction) is pretty high.  Now is not the time to go into it, but I should be expecting this figure to be at most 0.125 on this processor – Currently my instructions are taking at least 4 times longer to complete. The last-level cache (LLC) indicates that a high number of cycles were spent waiting for LLC load misses to be serviced.

Please note that vectorisation is not just limited to the Intel Compiler; The GNU Compiler can do it as well but I am not as familiar with it. Nice Linky

Vectorisation II

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)

Human Conf Test * Time limit is exhausted. Please reload CAPTCHA.