From: Throop, D. R. (JSC-ER)[J. Technology] <dav...@na...> - 2011-07-12 17:02:20
|
Thanks! I would have responded faster, but your message wandered into my spam filter, where I just noticed it. I tore your code apart and was able to cache the computation that only affected one of the strings (the tokenization etc) so that the only computation inside the 600 x 300 loop was the direct comparison. Eventually I'll ask NASA for the forms to allow me to post it on CPAN. David Throop -----Original Message----- From: Ted Pedersen [mailto:tpederse@d.umn.edu] Sent: Wednesday, June 15, 2011 4:38 PM To: tex...@li... Subject: Re: [text-similarity-users] Taking many similarity measurements between two corpora Hi David, Nice question, and unfortunately I don't think there is a particularly better way to do what you propose, other than a long series of pairwise comparisons. That said, I ran something of the same dimensionality that you want to do (600 x 300) and the following script took 2.5 hours on a 5 year old desktop...so, if this isn't something you need to do on a regular basis, maybe it works out ok.... Below is my timing output... ted@linux-zxku:~> time bash runit.sh real 156m55.322s user 124m11.270s sys 24m30.416s ted@linux-zxku:~> And then there is the script I ran - I just took a file and made 600 individual 1 line files, and then did a bunch of pairwise similarities with our command line tool. Using the API would in effect result in the same thing... ted@linux-zxku:~> more runit.sh ----------------------- for line in {1..600..1} do head -$line text | tail -1 > text.$line done for linea in {1..600..1} do for lineb in {1..300..1} do text_similarity.pl --type Text::Similarity::Overlaps text.$linea text.$lineb >> text.output done done for line in {1..600..1} do rm text.$line done ----------- I hope this helps...please feel free to let us know of any additional questions that might arise. Cordially, Ted On Tue, Jun 14, 2011 at 3:52 PM, Throop, David R. (JSC-ER)[Jacobs Technology] <dav...@na...> wrote: > I need the pairwise similarity measurements between two corpora, MxN. I > wondered about the efficiency of this in Text::Similarity. > > Here's the task. We're transitioning a piece of hardware from one program > to another. The hardware was built to the old program's requirements, > (roughly 300 old requirements.) The new program has its own requirements > (roughly 600 requirements.) Each requirement is ~ 100 words. > > I'm supporting a gap analysis. One task in the gap analysis can be stated > as > > For each old requirement, find up to 3 new requirements which are > most-similar to the old requirement. > > > Example: Suppose I have an old requirement that reads "The Delivery-unit > shall fold to a stowage volume that will fit within the Transport Bag > dimensions of 48 by 20 by 14 inches and allow space for foam cushioning > material." Then I want to find any new requirements that are talking about > delivery-units, stowage volume, dimensions, transport bags or foam > cushioning. > > To do this, I want the pairwise similarity scores between all the old and > new requirements, roughly 300x600 = 180,000 comparisons. I suspect that > invoking > $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459); > isn't the best way to do this. E.g, it would call sanitizeString on each > old requirement 600 times. > > Am I missing something? Is there already to iterate efficiently over a such > a pair of corpora? > > David Throop > > > ------------------------------------------------------------------------------ > EditLive Enterprise is the world's most technically advanced content > authoring tool. Experience the power of Track Changes, Inline Image > Editing and ensure content is compliant with Accessibility Checking. > http://p.sf.net/sfu/ephox-dev2dev > _______________________________________________ > text-similarity-users mailing list > tex...@li... > https://lists.sourceforge.net/lists/listinfo/text-similarity-users > > -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------------ EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev _______________________________________________ text-similarity-users mailing list tex...@li... https://lists.sourceforge.net/lists/listinfo/text-similarity-users |