Monday, July 2, 2007

fls implementation

fls stands for find last bit set for an unsigned int.
For example, the function signature looks like unsigned fls( unsigned ); meaning that it returns the index of some unsigned integer that you pass to it.

There are a lot of approaches to this. (I am not going to point out the binary search/scan across the unsigned integer, though it is the classical solution)

First and foremost, I will explain what x & ~(x-1) actually means.
When x is a power of 2, the check if x & (x-1) == 0 is a trick that is employed to verify that x is indeed a power of 2. But why? What is actually taking place...

The simple fact is that x-1 is actually taking the right most (LSB) that is a '1' and converting it to a '0' and setting all bits after (to the right) it to a '1'.

This invariably explains why if x is a power of 2 that it sets all values behind that bit to a '1' and the bit that is initially set to a '0'. Therefore, if that is the case, then what does x & ~(x-1) mean?

Well, that's simple. If x-1 yields the lsb that is set to a '0' that means that is the first value that is not the same in x and x-1. That also explains that every bit above (msb) it are the same and everything below (lsb) it are the opposite. Therefore, if you flip the bits of x-1, which is ~(x-1), and AND (&) it with x, then the bits above the lsb that is set are different, and the bits that were set to '1's below the lsb that was set get set to '0's. But the lsb that was set in x & the lsb that was a '0' in x-1 when flipped gets turned to a '1' again, and therefore is the only bit that is the same.

So there we have it, x & ~(x-1) actually yields only one bit set, namely that of lsb that was a '1'.
Therefore, if we know this, there is a simple algorithm to perform:


unsigned fls( unsigned x ) {
return (unsigned)log2( (double) (x & ~(x-1) );
}


This will give us the index of the last bit that is set.

However, this case is slow because it uses the SIMD on Intel's architecture. Context switches and first-use faults yields this operation to be very slow, compared to the other solutions that exist that obviously try to speed the fls() operation up.

However, there is another solution. A solution that is quite unique. Something that is quite a hack, but the speed is incredibly astounding. Let's consider a char, call it c.
If c is split into 3-bit chunks, and only one bit is set in c, then it's rather easy to find the index by dividing by 2. For example,

100 = 4
4/2 = 2 which happens to be the index of the bit set

010 = 2
2/2 = 1 which happens to be the index of the bit set

001 = 1
1/2 = 0 which happens to be the index of the bit set

Therefore, we proved it for small numbers, shifting with correct values will give the correct result by adding the shift offset. This is index quite fast; however dividing by 2 is the same as right shifting by 1. So instead of right shifting by multiples of 3 and then dividing by 2, we can right shift by the multiple of 3 + 1. The implementation is below:

#define BITS 8
unsigned fls( unsigned x ) {
int y = 0;
x &= ~(x-1);
for( int i = 0; i < y; ++i ) {
y = (x>>(BITS*i)) & 0xff;

// make sure it's not zero (0)
if( !y ) continue;

int ret = (y&07) ? (y&amp;amp;amp;07)>>1 :
(y&070) ? ((y&070)>>4) + 3 :
(y&0700) ? ((y&0700)>>7) + 6;

return ret + (BITS*i);
}
return 0;
}


Therefore, (BITS*i) shifts the starting index to start on the valid byte assignment.

No comments: