Thanks for your work - your library is gold!
I need to compare 150,000 names against a list of 13,000 names. I am using QGrams (QGramsDistance.BatchCompareSet) which appear to be giving me the best matching behaviour – but each call is taking around three seconds (I load up the 13,000 strings then loop through the 150,000 calling BatchCompareSet). I suspect the bulk of the time is the BatchCompareSet continually tokenizing the 13,000 for every call. Is there a way in your library to cache the tokenized list and pass it in to improve performance? I don’t think BatchCompareSets is any help is it looks like it simply compares the first row of each array with each other.
Batch compare set is quick code, the tokens are not cached currently as you said currently the tokenisation occurs in the compare function itself (and not for all metrics) so no caching is therefore employed. It was writen in mind of simpler functions without thought for caching.
Ideally the changes you suggest could improve your performance but I would recommend under these circumstances if knowing doing this with one metric only (your case) to instead strip the comparison function and put the loop after tokenisation in the function itself. More hassle perhaps but no way to add caching at this moment (no time) if done routinely it would slow normal operation of the function when doing simple comparisons. Good luck - if you do do anything interesting with this please send the source as i would be grateful.
Another method (again no time) would be even better which for Q-Grams in particular would be to simply treat input strings as char arrays and starting at position 'n' count 'q' characters for comparison, no tokenisation in this method would then be needed (i.e. no memory allocation - whichj would hugely speed things up). If you want to look at this please let me know if you need advice but I will likely not have time to be actively involved.