[PyIndexer] MySQL indexing performance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

A few notes on index population in the MySQL implementation:

0. The time to index each document really does need to be improved,
   especially since reads on the textindex table are largely blocked
   during this time.

1. All the indexing is already within a transaction block per document. 
   LOAD DATA should still increase INSERT performance, as might using
   MySQL's extended INSERT syntax.

2. Indexing on the first 32 characters of dictionary.word, instead of
   the full length, resulted in an overall 5% increase in performance,
   with a < 1% loss in performance in the 'words to wordids' block.
   Not clear yet why there was a _loss_ of performance in the lookup;
   would have to do more tests/tests with other key lengths...
   (Only one test performed)

3. Deferring indexing of the textindex indexes over prev_textindex_id
   and (word_id, prev_word_id) (IOW, creating indexes initially only for
   identity columns and dictionary.word) did not result in a pronounced
   improvement ( < 5%) on textindex generation with 0.7M rows. Overall 
   performance actually sufferred (by 5%). This could be because MySQL 
   creates a temporary working copy of the table so that it can index it 
   in a consistent state. Anyway, this wouldn't be feasible for real-
   world practice....
   (Only one test performed)

4. I haven't yet played with recording the document count for dictionary
   words, but that will increase the time a bit. How much depends...

More positive test results on searching to follow...

Cheers

-- Marcus