we are currently developing a plagiarism detection tool and thinking about using the SimMetrics package for evaluating word similarity.
The specific use we have in mind is to compare documents that hold very little text, e.g. slides
Can you recommend something or give some thoughts?
if very little text and you are checking against another similar submissions then it maybe a good idea but is this on a slide by slide basis or across a large number of slides compared to a large number of slides?
We would have a lot of slides in a database (we use lucene) and then the idea would be to compare ONE given slide against the database to come up with (exact or partial) matches.