From: Adam R. <ad...@ex...> - 2011-06-23 11:37:34
|
If you have these modules working, then they could be added to eXist-db as extensions for everyone to use. Would you like to add these and become the de-facto maintainer? On 16 June 2011 16:38, Markus Kaindl <mar...@bs...> wrote: > still unsure whether eXist already offers a possibility to do this but > at least I got the aforementioned module working. > > in case anyone is interested: I stored the following two modules > > http://experimental.zorba-xquery.com/xqdoc/modules/www.zorba-xquery.com_modules_data-cleaning_string-similarity.xq > and > http://experimental.zorba-xquery.com/xqdoc/modules/www.zorba-xquery.com_modules_data-cleaning_set-similarity.xq > > im my test folder. you have to activate eXists built-in math module > (cf. conf.xml). > edit the first file: instead of using the predefined math module you > use eXists: > > import module namespace math = "http://exist-db.org/xquery/math"; > > plus, you have to use your local copy of the module [I named it set.xq > for convenience]: > > import module namespace set = > "http://www.zorba-xquery.com/modules/data-cleaning/set-similarity" > at "set.xq"; > > after that the module should work just fine! the following calculations > are now possible: > > * cosine-ngrams ( $s1 as xs:string, $s2 as xs:string, $n as xs:integer > ) as xs:double > Cosine similarity coefficient between sets of character n-grams > extracted extracted from two strings. > > * cosine-tokens ( $s1 as xs:string, $s2 as xs:string, $r as xs:string ) > as xs:double > Cosine similarity coefficient between sets of tokens extracted from two > strings. > > * dice-ngrams ( $s1 as xs:string, $s2 as xs:string, $n as xs:integer ) > as xs:double > Dice similarity coefficient between sets of character n-grams extracted > from two strings. > > * dice-tokens ( $s1 as xs:string, $s2 as xs:string, $r as xs:string ) > as xs:double > Dice similarity coefficient between sets of tokens extracted from two > strings. > > * editdistance ( $s1 as xs:string, $s2 as xs:string ) as xs:integer > Edit distance between two strings. > > * jaccard-ngrams ( $s1 as xs:string, $s2 as xs:string, $n as xs:integer > ) as xs:double > Jaccard similarity coefficient between sets of character n-grams > extracted from two strings. > > * jaccard-tokens ( $s1 as xs:string, $s2 as xs:string, $r as xs:string > ) as xs:double > Jaccard similarity coefficient between sets of tokens extracted from > two strings. > > * jaro ( $s1 as xs:string, $s2 as xs:string ) as xs:double > Jaro similarity coefficient between two strings. > > * jarowinkler ( $s1 as xs:string, $s2 as xs:string, $prefix as > xs:integer, $fact as xs:double ) as xs:double > Jaro-Winkler similarity coefficient between two strings. > > * ngrams ( $s as xs:string, $n as xs:integer ) as xs:string* > > Cheers, > Markus > > >>>> "Markus Kaindl" <mar...@bs...> 16.6.2011 16:19 >>> > > hi list, > > does eXist ship with a built-in function to compute the similarity of > two strings? > I am thinking of something like cosine, dice, ngrams or the > levenshtein > edit distance. > > I already tried to use the module from > http://experimental.zorba-xquery.com/xqdoc/www.zorba-xquery.com_modules_data-cleaning_string-similarity.html > but couldnt get it to run because of a problem with the math module > ("failed to load module 'http://www.zorba-xquery.com/modules/math' > from > 'http://www.zorba-xquery.com/modules/math. Source not found.") I added > the math module to my database but seems like some parts would need to > be rewritten. > > this [on the page of the module] could also be the source of the > problem: > "The logic contained in this module is not specific to any particular > XQuery implementation, although it requires the trigonometic functions > of XQuery 1.1 or a math extension function for computing sqrt()." > > I could continue to try to get this working (e.g. with exists math > module) but maybe there is a simpler way. I know that lucene offers > the > possibility to look for similar items. perhaps there is something like > that for a pure string comparison? > > thanks for your help, > markus > > > ------------------------------------------------------------------------------ > EditLive Enterprise is the world's most technically advanced content > authoring tool. Experience the power of Track Changes, Inline Image > Editing and ensure content is compliant with Accessibility Checking. > http://p.sf.net/sfu/ephox-dev2dev > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > > > ------------------------------------------------------------------------------ > EditLive Enterprise is the world's most technically advanced content > authoring tool. Experience the power of Track Changes, Inline Image > Editing and ensure content is compliant with Accessibility Checking. > http://p.sf.net/sfu/ephox-dev2dev > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > -- Adam Retter eXist Developer { United Kingdom } ad...@ex... irc://irc.freenode.net/existdb |