I've experienced large differences in speed when calculating similarities between large amounts of words given as a --file option to similarity.pl. In my case the every words was compared to every other word, e.g. of the file:
roll roll
roll cutting
roll feeding
roll length
...
cutting cutting
cutting feeding
cutting length
...
feeding feeding
feeding length
...
Dont know why, but it is much faster if you sort the file according to the second field (sort +1 -2 inputfile > outputfile).
Best regards,
Paul-Armand Verhaegen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As a note to self (and others):
I've experienced large differences in speed when calculating similarities between large amounts of words given as a --file option to similarity.pl. In my case the every words was compared to every other word, e.g. of the file:
roll roll
roll cutting
roll feeding
roll length
...
cutting cutting
cutting feeding
cutting length
...
feeding feeding
feeding length
...
Dont know why, but it is much faster if you sort the file according to the second field (sort +1 -2 inputfile > outputfile).
Best regards,
Paul-Armand Verhaegen
This is a fascinating observation, and off the top of my head I can't explain this behavior...but, we'll check into it!
Cordially,
Ted