|
From: Ted P. <tpederse@d.umn.edu> - 2011-06-15 21:38:32
|
Hi David,
Nice question, and unfortunately I don't think there is a particularly
better way to do what you propose, other than a long series of
pairwise comparisons.
That said, I ran something of the same dimensionality that you want to
do (600 x 300) and the following script took 2.5 hours on a 5 year old
desktop...so, if this isn't something you need to do on a regular
basis, maybe it works out ok....
Below is my timing output...
ted@linux-zxku:~> time bash runit.sh
real 156m55.322s
user 124m11.270s
sys 24m30.416s
ted@linux-zxku:~>
And then there is the script I ran - I just took a file and made 600
individual 1 line files, and then did a bunch of pairwise similarities
with our command line tool. Using the API would in effect result in
the same thing...
ted@linux-zxku:~> more runit.sh
-----------------------
for line in {1..600..1}
do
head -$line text | tail -1 > text.$line
done
for linea in {1..600..1}
do
for lineb in {1..300..1}
do
text_similarity.pl --type Text::Similarity::Overlaps
text.$linea text.$lineb >> text.output
done
done
for line in {1..600..1}
do
rm text.$line
done
-----------
I hope this helps...please feel free to let us know of any additional
questions that might arise.
Cordially,
Ted
On Tue, Jun 14, 2011 at 3:52 PM, Throop, David R. (JSC-ER)[Jacobs
Technology] <dav...@na...> wrote:
> I need the pairwise similarity measurements between two corpora, MxN. I
> wondered about the efficiency of this in Text::Similarity.
>
> Here’s the task. We’re transitioning a piece of hardware from one program
> to another. The hardware was built to the old program’s requirements,
> (roughly 300 old requirements.) The new program has its own requirements
> (roughly 600 requirements.) Each requirement is ~ 100 words.
>
> I’m supporting a gap analysis. One task in the gap analysis can be stated
> as
>
> For each old requirement, find up to 3 new requirements which are
> most-similar to the old requirement.
>
>
> Example: Suppose I have an old requirement that reads “The Delivery-unit
> shall fold to a stowage volume that will fit within the Transport Bag
> dimensions of 48 by 20 by 14 inches and allow space for foam cushioning
> material.” Then I want to find any new requirements that are talking about
> delivery-units, stowage volume, dimensions, transport bags or foam
> cushioning.
>
> To do this, I want the pairwise similarity scores between all the old and
> new requirements, roughly 300x600 = 180,000 comparisons. I suspect that
> invoking
> $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459);
> isn’t the best way to do this. E.g, it would call sanitizeString on each
> old requirement 600 times.
>
> Am I missing something? Is there already to iterate efficiently over a such
> a pair of corpora?
>
> David Throop
>
>
> ------------------------------------------------------------------------------
> EditLive Enterprise is the world's most technically advanced content
> authoring tool. Experience the power of Track Changes, Inline Image
> Editing and ensure content is compliant with Accessibility Checking.
> http://p.sf.net/sfu/ephox-dev2dev
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>
>
--
Ted Pedersen
http://www.d.umn.edu/~tpederse
|