|
From: Throop, D. R. (JSC-ER)[J. Technology] <dav...@na...> - 2011-07-12 17:02:20
|
Thanks! I would have responded faster, but your message wandered into my spam filter, where I just noticed it.
I tore your code apart and was able to cache the computation that only affected one of the strings (the tokenization etc) so that the only computation inside the 600 x 300 loop was the direct comparison. Eventually I'll ask NASA for the forms to allow me to post it on CPAN.
David Throop
-----Original Message-----
From: Ted Pedersen [mailto:tpederse@d.umn.edu]
Sent: Wednesday, June 15, 2011 4:38 PM
To: tex...@li...
Subject: Re: [text-similarity-users] Taking many similarity measurements between two corpora
Hi David,
Nice question, and unfortunately I don't think there is a particularly
better way to do what you propose, other than a long series of
pairwise comparisons.
That said, I ran something of the same dimensionality that you want to
do (600 x 300) and the following script took 2.5 hours on a 5 year old
desktop...so, if this isn't something you need to do on a regular
basis, maybe it works out ok....
Below is my timing output...
ted@linux-zxku:~> time bash runit.sh
real 156m55.322s
user 124m11.270s
sys 24m30.416s
ted@linux-zxku:~>
And then there is the script I ran - I just took a file and made 600
individual 1 line files, and then did a bunch of pairwise similarities
with our command line tool. Using the API would in effect result in
the same thing...
ted@linux-zxku:~> more runit.sh
-----------------------
for line in {1..600..1}
do
head -$line text | tail -1 > text.$line
done
for linea in {1..600..1}
do
for lineb in {1..300..1}
do
text_similarity.pl --type Text::Similarity::Overlaps
text.$linea text.$lineb >> text.output
done
done
for line in {1..600..1}
do
rm text.$line
done
-----------
I hope this helps...please feel free to let us know of any additional
questions that might arise.
Cordially,
Ted
On Tue, Jun 14, 2011 at 3:52 PM, Throop, David R. (JSC-ER)[Jacobs
Technology] <dav...@na...> wrote:
> I need the pairwise similarity measurements between two corpora, MxN. I
> wondered about the efficiency of this in Text::Similarity.
>
> Here's the task. We're transitioning a piece of hardware from one program
> to another. The hardware was built to the old program's requirements,
> (roughly 300 old requirements.) The new program has its own requirements
> (roughly 600 requirements.) Each requirement is ~ 100 words.
>
> I'm supporting a gap analysis. One task in the gap analysis can be stated
> as
>
> For each old requirement, find up to 3 new requirements which are
> most-similar to the old requirement.
>
>
> Example: Suppose I have an old requirement that reads "The Delivery-unit
> shall fold to a stowage volume that will fit within the Transport Bag
> dimensions of 48 by 20 by 14 inches and allow space for foam cushioning
> material." Then I want to find any new requirements that are talking about
> delivery-units, stowage volume, dimensions, transport bags or foam
> cushioning.
>
> To do this, I want the pairwise similarity scores between all the old and
> new requirements, roughly 300x600 = 180,000 comparisons. I suspect that
> invoking
> $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459);
> isn't the best way to do this. E.g, it would call sanitizeString on each
> old requirement 600 times.
>
> Am I missing something? Is there already to iterate efficiently over a such
> a pair of corpora?
>
> David Throop
>
>
> ------------------------------------------------------------------------------
> EditLive Enterprise is the world's most technically advanced content
> authoring tool. Experience the power of Track Changes, Inline Image
> Editing and ensure content is compliant with Accessibility Checking.
> http://p.sf.net/sfu/ephox-dev2dev
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>
>
--
Ted Pedersen
http://www.d.umn.edu/~tpederse
------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
_______________________________________________
text-similarity-users mailing list
tex...@li...
https://lists.sourceforge.net/lists/listinfo/text-similarity-users
|