Re: [Algorithms] pattern search for large integer

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Drakie Awita wrote:
> I have a large list of integers; each integer is of 128-bit in size.
> How to answer this question efficiently:
>
> find those integers with bit-X equals to 1, bit-Y equals to 0, and
> bit-Z equals to 1 (0<=X<Y<Z<=127).  ie, find integers in a large list
> with bit pattern of "*1*0*1*", where '1', '0', and '1' appears at
> fixed bit position X, Y, and Z.  The values of X, Y, Z are arbitrary
> between [0,127] and have no observable pattern.
>
> The brute-force method of iterating through the list via bit-mask
> check works, but it's O(n), and not practical if the list size is in
> millions and there're multiple bit patterns to be matched on the same
> integer list.
>
> Any faster algorithm?
>
>   

You do not specify much about the parameters of the algorithm, whether 
the bit masks are dynamically assigned or have a finite (and 
precomputable) number of possibilities, or whether there is a meaningful 
way to store only the elements that matter to you during these searches.

Assuming the worst, that these integers are dynamically generated and 
the masks are dynamically composed for querying, you probably can do a 
little better.  Consider storing the data in swizzled format, where all 
the (integers & 1) bits are stored in a packed buffer, and their index 
corresponds to the index of the actual integer.  Do the same for 
(integers & 2), etc, leaving you with 128 buffers of bits.  This means, 
however, for 128 bit integers, each new value must be bit masked and 
pushed onto 128 different buffers.  This may be acceptable for you, 
depending on the generation rate versus query rate of the data.  It's a 
lot faster than O(n) because you can trivially reject a wholly zero word 
on large sparse sets of integers, and cache coherency will definitely 
make it an order of magnitude faster, even if you are reduced to O(n) 
barrel shifting.

If you can preprocess the data more, and know there are a finite number 
of queries that matter more than others, simply create a Markov-like 
chain where any X that is 1 will have a list of buckets that have Y that 
are 0 and Z are 1.  Only pre-cache the ones that have the largest number 
of valid pieces of data, and remove them from the list that other 
queries will search, effectively reducing the 'n' in O(n) scans.

There are many approaches.  It just depends on your usage pattern, and 
how critical path this is.

Thanks,
JH