Exploit DB redundancy

Status: Beta

Brought to you by: hypatia, thawker

#8 Exploit DB redundancy

Status: open

Owner: Toby Hawker

Labels: Implementation Enhancements (4)

Priority: 3

Updated: 2007-11-02

Created: 2007-11-02

Creator: Toby Hawker

Private: No

The data in the Web 1T corpus is ordered, but we currently treat every entry in it as entirely independent. If we were to track the selected nodes from each preceding query and only process anew from the token where the queries differed, considerable speedups may be possible, as many hashtable lookups would be avoided.

This would require storing the tokens from the previous line, as well as the nodes they matched. The previous tokens will always still exist in the buffer(s), so if we use two tstart arrays, we can alternate them. The number of node storage tables required is n, with 2^x at each position x=1..n. The number of matching nodes also needs to be stored. A simple strcmp or strcasecmp for each token would suffice to establish whether the tokens match. We would jump into processing from the previous level as soon as we hit a difference in tokens.

Exploit DB redundancy

Group

Searches

Help

#8 Exploit DB redundancy

Discussion