Re: [text-similarity-users] Taking many similarity measurements between two corpora

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi David,

Nice question, and unfortunately I don't think there is a particularly
better way to do what you propose, other than a long series of
pairwise comparisons.

That said, I ran something of the same dimensionality that you want to
do (600 x 300) and the following script took 2.5 hours on a 5 year old
desktop...so, if this isn't something you need to do on a regular
basis, maybe it works out ok....

Below is my timing output...

ted@linux-zxku:~> time bash runit.sh

real    156m55.322s
user    124m11.270s
sys     24m30.416s
ted@linux-zxku:~>

And then there is the script I ran - I just took a file and made 600
individual 1 line files, and then did a bunch of pairwise similarities
with our command line tool. Using the API would in effect result in
the same thing...

ted@linux-zxku:~> more runit.sh

-----------------------

for line in {1..600..1}
do
        head -$line text | tail -1 > text.$line
done

for linea in {1..600..1}
do
        for lineb in {1..300..1}
        do
                text_similarity.pl --type Text::Similarity::Overlaps
text.$linea text.$lineb >> text.output
        done
done

for line in {1..600..1}
do
        rm text.$line
done

-----------

I hope this helps...please feel free to let us know of any additional
questions that might arise.

Cordially,
Ted

On Tue, Jun 14, 2011 at 3:52 PM, Throop, David R. (JSC-ER)[Jacobs
Technology] <dav...@na...> wrote:
> I need the pairwise similarity measurements between two corpora, MxN.   I
> wondered about the efficiency of this in Text::Similarity.
>
> Here’s the task.  We’re transitioning a piece of hardware from one program
> to another.  The hardware was built to the old program’s requirements,
> (roughly 300 old requirements.)  The new program has its own requirements
> (roughly 600 requirements.)  Each requirement is ~ 100 words.
>
> I’m supporting a gap analysis.  One task in the gap analysis can be stated
> as
>
> For each old requirement, find up to 3 new requirements which are
> most-similar to the old requirement.
>
>
> Example: Suppose I have an old requirement that reads “The Delivery-unit
> shall fold to a stowage volume that will fit within the Transport Bag
> dimensions of 48 by 20 by 14 inches and allow space for foam cushioning
> material.”  Then I want to find any new requirements that are talking about
> delivery-units, stowage volume, dimensions, transport bags or foam
> cushioning.
>
> To do this, I want the pairwise similarity scores between all the old and
> new requirements, roughly 300x600 = 180,000 comparisons.  I suspect that
> invoking
> $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459);
> isn’t the best way to do this.   E.g, it would call sanitizeString on each
> old requirement 600 times.
>
> Am I missing something?  Is there already to iterate efficiently over a such
> a pair of corpora?
>
> David Throop
>
>
> ------------------------------------------------------------------------------
> EditLive Enterprise is the world's most technically advanced content
> authoring tool. Experience the power of Track Changes, Inline Image
> Editing and ensure content is compliant with Accessibility Checking.
> http://p.sf.net/sfu/ephox-dev2dev
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>
>

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse