Jun 142011
 

Ok, so I had some code that created a matrix where each row was a binary increment on the row previous.  Given a number, B, the matrix formed had B columns and 2**B rows.  This matrix then underwent various manipulation and then some multiplication.

Example, if B=3, the matrix is:
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

Not so bad really, except that what if B went as high as 30?  That’s right, 2**30 rows multiplied by B columns of doubles. Or, specifically, 30*1073741824*8 bytes.
258GiB of memory required.

There has to be a better way.  Well, there is. Apart from using char instead of double (32.2GiB – but introduces casting later during computation)

Generate each row of the matrix on the fly. Below is code that will take an integer and convert it into a binary array (of chars). It takes about 2 cycles to perform the work.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/*
 *  For each byte, put the least significant bit into 'h', ... , most significant into 'a'
 */
typedef struct packed_byte 
{
    unsigned char a : 1;
    unsigned char b : 1;
    unsigned char c : 1;
    unsigned char d : 1;
    unsigned char e : 1;
    unsigned char f : 1;
    unsigned char g : 1;
    unsigned char h : 1;
} packed_byte;
 
/*
 *  For a given integer, but the least significant byte into b3, ... , most significant byte into b0
 */
typedef struct packed_int 
{
    packed_byte b0;
    packed_byte b1;
    packed_byte b2;
    packed_byte b3;
} packed_int;
 
/*
 * The union here is the neat trick ... zero cost with the cpu
 *
 *  Not much in life is free .... except bit operators :)
 */
typedef union 
{
    unsigned int i;
    packed_int b;
} packed;
 
/*
 *  This is the time consuming bit - roughly 2 cycles for each call
 *
 *  I'm still hunting through the Intel Architecture Optimization Reference Guide
 *  to figure it out exactly (and to speed it up).
 */
inline void GetBitArray(char *a, packed_int b)
{
    a[7]  = b.b3.a;
    a[6]  = b.b3.b;
    a[5]  = b.b3.c;
    a[4]  = b.b3.d;
    a[3]  = b.b3.e;
    a[2]  = b.b3.f;
    a[1]  = b.b3.g;
    a[0]  = b.b3.h;
 
    a[15] = b.b2.a;
    a[14] = b.b2.b;
    a[13] = b.b2.c;
    a[12] = b.b2.d;
    a[11] = b.b2.e;
    a[10] = b.b2.f;
    a[9]  = b.b2.g;
    a[8]  = b.b2.h;
 
    a[23] = b.b1.a;
    a[22] = b.b1.b;
    a[21] = b.b1.c;
    a[20] = b.b1.d;
    a[19] = b.b1.e;
    a[18] = b.b1.f;
    a[17] = b.b1.g;
    a[16] = b.b1.h;
 
    a[31] = b.b0.a;
    a[30] = b.b0.b;
    a[29] = b.b0.c;
    a[28] = b.b0.d;
    a[27] = b.b0.e;
    a[26] = b.b0.f;
    a[25] = b.b0.g;
    a[24] = b.b0.h;
}
 
int main(int argc, char **argv)
{
    int B = 3;
    int num_combs = pow(2,B);
    packed p;
    char *bits = new char[32];
    for (i = 0; i<num_combs; i++)
    {
        p.i = i;
        GetBitArray(bits, p.b);
        // 'bits' is now an array with the structure of each 
        // row of the matrix required (it is padded with two extra '0's).
    }
 
}

Now, some of those structs don’t exactly look pretty and neither does that GetBitArray function, but it works and it works fast. The job now uses next to nothing memory and much faster. Will post back some speedup times a bit later on.

Links

Intel Architecture Optimization Reference Guide
Other nice bit manipulation tricks

  10 Responses to “Binary Permutation Generator”

  1. Unrelated thought. Someone needs to introduce a bit type. And a pointer to this type.

    • +1

      That GetBitArray sucks. If I need 64bit I have to code that up. You can’t use arrays in the union so everything has to be manually coded.

      I’ve used this code for heaps of things and cannot believe how difficult it is to get to the 0’s and 1’s out of a stored type. Bit shifting is too slow.

  2. pow(2,B) is ok if you call it once. If you aim to loop, I imagine you’d assign 1 << B to num_combs. Does this also save an implicat cast, I wonder?

  3. Been a while since I’ve coded some C. Can you macro GetBitArray?

  4. Would be interesting to see how (in)efficient Ruby is, by way of comparison. For example:


    B = 3
    RowSize = 4

    for i in 0...(1<<B)
    a = i.to_s(2).rjust(RowSize, "0").split(//).to_a # Ruby 1.8.6
    #a = i.to_s(2).rjust(RowSize, "0").chars.to_a # Ruby 1.8.7 ?
    puts a.inspect
    end

    • Anyway to inspect the assembly to count the instructions and latencies?

      One iteration for another project made more than 7 trillion calls to getbitarray.

      • I don’t think Ruby will be faster! But you have to love the language for its brevity.

        Ruby profiler, with instructions, here: http://ruby-prof.rubyforge.org/

        In C++, it would be nice to have enough memory to use, for example, a per-byte lookup table or something… If anything, another nice comparison…

        • I was talking to an old computer science guy this morning that suggested something new:
          Create an array of masks that you just AND the index to that mask for the particular bit you require.

          Will work well for when I only need the n’th bit. Will code that one up a bit later.

          • Cool. Something like:


            B = 3
            RowSize = 8
            b = Array.new(RowSize)
            for i in 0...(1<<B)
            for j in 0...RowSize
            b[RowSize-j-1] = i & (1<<j) == 0 ? 0 : 1
            end
            puts b.inspect
            end

  5. Or, for an array of ints:


    a = i.to_s(2).rjust(RowSize, "0").scan(/d/).map { |j| j.to_i } # Ruby 1.8.6

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)

Human Conf Test * Time limit is exhausted. Please reload CAPTCHA.