[text-similarity-users] Taking many similarity measurements between two corpora

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I need the pairwise similarity measurements between two corpora, MxN.   I wondered about the efficiency of this in Text::Similarity.

Here's the task.  We're transitioning a piece of hardware from one program to another.  The hardware was built to the old program's requirements, (roughly 300 old requirements.)  The new program has its own requirements (roughly 600 requirements.)  Each requirement is ~ 100 words.

I'm supporting a gap analysis.  One task in the gap analysis can be stated as
*       For each old requirement, find up to 3 new requirements which are most-similar to the old requirement.

Example: Suppose I have an old requirement that reads "The Delivery-unit shall fold to a stowage volume that will fit within the Transport Bag dimensions of 48 by 20 by 14 inches and allow space for foam cushioning material."  Then I want to find any new requirements that are talking about delivery-units, stowage volume, dimensions, transport bags or foam cushioning.

To do this, I want the pairwise similarity scores between all the old and new requirements, roughly 300x600 = 180,000 comparisons.  I suspect that invoking
$score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459);
 isn't the best way to do this.   E.g, it would call sanitizeString on each old requirement 600 times.

Am I missing something?  Is there already to iterate efficiently over a such a pair of corpora?

David Throop