carrot2-cvscommits Mailing List for Carrot2 (Page 397)
Brought to you by:
dawidweiss,
stachoo
This list is closed, nobody may subscribe to it.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(26) |
Nov
(58) |
Dec
(1) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(16) |
Feb
(176) |
Mar
(2) |
Apr
(23) |
May
(854) |
Jun
(650) |
Jul
(248) |
Aug
(104) |
Sep
(58) |
Oct
(24) |
Nov
|
Dec
(27) |
2005 |
Jan
|
Feb
(186) |
Mar
(127) |
Apr
(54) |
May
(8) |
Jun
(103) |
Jul
(38) |
Aug
(75) |
Sep
(92) |
Oct
(110) |
Nov
(42) |
Dec
(146) |
2006 |
Jan
(733) |
Feb
(80) |
Mar
(23) |
Apr
(41) |
May
(31) |
Jun
(89) |
Jul
(137) |
Aug
(93) |
Sep
(96) |
Oct
(31) |
Nov
(36) |
Dec
(25) |
2007 |
Jan
(58) |
Feb
(25) |
Mar
(29) |
Apr
(68) |
May
(55) |
Jun
(43) |
Jul
(54) |
Aug
(104) |
Sep
(10) |
Oct
(24) |
Nov
(41) |
Dec
(32) |
2008 |
Jan
(80) |
Feb
(81) |
Mar
(141) |
Apr
(141) |
May
(94) |
Jun
(63) |
Jul
(141) |
Aug
(87) |
Sep
(66) |
Oct
(84) |
Nov
(110) |
Dec
(58) |
2009 |
Jan
(21) |
Feb
(56) |
Mar
(53) |
Apr
(67) |
May
(95) |
Jun
(10) |
Jul
(93) |
Aug
(41) |
Sep
(62) |
Oct
(54) |
Nov
(39) |
Dec
(40) |
2010 |
Jan
(81) |
Feb
(154) |
Mar
(123) |
Apr
(56) |
May
(38) |
Jun
(28) |
Jul
(53) |
Aug
(78) |
Sep
(64) |
Oct
(90) |
Nov
(12) |
Dec
(23) |
2011 |
Jan
(88) |
Feb
(24) |
Mar
(111) |
Apr
(59) |
May
(15) |
Jun
(8) |
Jul
(63) |
Aug
(37) |
Sep
(90) |
Oct
(7) |
Nov
(48) |
Dec
(39) |
2012 |
Jan
(7) |
Feb
(2) |
Mar
(16) |
Apr
(7) |
May
(35) |
Jun
(58) |
Jul
(17) |
Aug
(61) |
Sep
(18) |
Oct
(4) |
Nov
(25) |
Dec
(8) |
2013 |
Jan
(8) |
Feb
|
Mar
(13) |
Apr
(43) |
May
(26) |
Jun
(11) |
Jul
(16) |
Aug
(5) |
Sep
|
Oct
(43) |
Nov
(6) |
Dec
(10) |
2014 |
Jan
(22) |
Feb
(35) |
Mar
(5) |
Apr
(16) |
May
(8) |
Jun
(5) |
Jul
(12) |
Aug
(2) |
Sep
(4) |
Oct
|
Nov
(24) |
Dec
|
2015 |
Jan
(2) |
Feb
(31) |
Mar
(15) |
Apr
(3) |
May
(32) |
Jun
|
Jul
(11) |
Aug
(15) |
Sep
(5) |
Oct
(27) |
Nov
(3) |
Dec
|
2016 |
Jan
|
Feb
(16) |
Mar
(3) |
Apr
|
May
(7) |
Jun
|
Jul
(7) |
Aug
(29) |
Sep
(10) |
Oct
(8) |
Nov
(12) |
Dec
|
2017 |
Jan
|
Feb
(4) |
Mar
(6) |
Apr
(3) |
May
(1) |
Jun
|
Jul
(10) |
Aug
(1) |
Sep
(4) |
Oct
|
Nov
(3) |
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(32) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2019 |
Jan
(13) |
Feb
(43) |
Mar
(31) |
Apr
(32) |
May
(30) |
Jun
(13) |
Jul
(6) |
Aug
(30) |
Sep
(43) |
Oct
(43) |
Nov
(28) |
Dec
(26) |
2020 |
Jan
(19) |
Feb
(16) |
Mar
(4) |
Apr
(5) |
May
(15) |
Jun
(14) |
Jul
(22) |
Aug
(1) |
Sep
(12) |
Oct
(16) |
Nov
(50) |
Dec
(79) |
2021 |
Jan
(52) |
Feb
(5) |
Mar
(50) |
Apr
(8) |
May
(4) |
Jun
(20) |
Jul
(15) |
Aug
(25) |
Sep
(3) |
Oct
|
Nov
(15) |
Dec
(19) |
2022 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
|
May
(10) |
Jun
(2) |
Jul
(9) |
Aug
(15) |
Sep
(1) |
Oct
(6) |
Nov
(12) |
Dec
(2) |
2023 |
Jan
(4) |
Feb
(2) |
Mar
(2) |
Apr
|
May
(22) |
Jun
(1) |
Jul
(2) |
Aug
|
Sep
|
Oct
(12) |
Nov
(24) |
Dec
|
From: <daw...@us...> - 2004-02-15 18:50:46
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/lsicluster In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv27056 Modified Files: LsiClusteringStrategy.java Log Message: Debugging matrix dump was still present in the code. apologies. Index: LsiClusteringStrategy.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/lsicluster/LsiClusteringStrategy.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** LsiClusteringStrategy.java 10 Feb 2004 15:27:40 -0000 1.3 --- LsiClusteringStrategy.java 15 Feb 2004 18:43:21 -0000 1.4 *************** *** 210,223 **** // The SVD - - // dump the matrix. - try { - java.io.ObjectOutputStream os = new java.io.ObjectOutputStream (new java.io.FileOutputStream("f:\\matrix")); - os.writeObject(tdMatrix); - os.close(); - } - catch (Exception e) { - } - SingularValueDecomposition svd = tdMatrix.svd(); --- 210,213 ---- |
From: <daw...@us...> - 2004-02-15 17:30:18
|
Update of /cvsroot/carrot2/deploy/cron In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv10022 Modified Files: update-demo-cron.sh update-demo.sh update-docs.sh update-website.sh Log Message: changed absolute paths for new installation. Index: update-demo-cron.sh =================================================================== RCS file: /cvsroot/carrot2/deploy/cron/update-demo-cron.sh,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** update-demo-cron.sh 30 Sep 2003 16:57:39 -0000 1.1 --- update-demo-cron.sh 15 Feb 2004 17:22:45 -0000 1.2 *************** *** 7,9 **** /home/dweiss/carrot2/deploy/cron/update-demo.sh - --- 7,8 ---- Index: update-demo.sh =================================================================== RCS file: /cvsroot/carrot2/deploy/cron/update-demo.sh,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** update-demo.sh 15 Feb 2004 16:54:08 -0000 1.6 --- update-demo.sh 15 Feb 2004 17:22:46 -0000 1.7 *************** *** 4,21 **** # This is a Bash script for Cron that updates, compiles and # runs a carrot2 demo - # at http://ophelia.cs.put.poznan.pl:2001 # cd /home/dweiss/carrot2/deploy - JAVA_HOME=/usr/java/j2sdk JAVACMD=${JAVA_HOME}/bin/java - ANT_HOME=/usr/java/ant - PATH=${PATH}:/home/dweiss/xep - - export PATH - export JAVA_HOME export JAVACMD - export ANT_HOME # update the code --- 4,13 ---- *************** *** 34,37 **** --- 26,30 ---- fi done + # update the tests (if possible) for counter in `seq 1 10`; do *************** *** 71,80 **** # copy webapps to nightly binaries folder. ! rm -f /carrot/www/static/download/nightly/*.war ! rm -f /carrot/www/static/download/nightly/*.zip ! cp /home/dweiss/carrot2/runtime/context-webapps/*.war /carrot/www/static/download/nightly/ ! zip /carrot/www/static/download/nightly/shared-libraries.zip /home/dweiss/carrot2/runtime/shared/lib/*.jar ! # override webapps if needed. cp -f /home/dweiss/carrot2/override-modules/*.war /home/dweiss/carrot2/runtime/context-webapps/ --- 64,73 ---- # copy webapps to nightly binaries folder. ! rm -f /srv/www/vhosts/carrot/static/download/nightly/*.war ! rm -f /srv/www/vhosts/carrot/static/download/nightly/*.zip ! cp /home/dweiss/carrot2/runtime/context-webapps/*.war /srv/www/vhosts/carrot/static/download/nightly/ ! zip /srv/www/vhosts/carrot/static/download/nightly/shared-libraries.zip /home/dweiss/carrot2/runtime/shared/lib/*.jar ! # override webapps if needed. cp -f /home/dweiss/carrot2/override-modules/*.war /home/dweiss/carrot2/runtime/context-webapps/ Index: update-docs.sh =================================================================== RCS file: /cvsroot/carrot2/deploy/cron/update-docs.sh,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** update-docs.sh 24 Nov 2003 22:05:56 -0000 1.3 --- update-docs.sh 15 Feb 2004 17:22:46 -0000 1.4 *************** *** 9,21 **** cd /home/dweiss/carrot2/deploy - JAVA_HOME=/usr/java/j2sdk JAVACMD=${JAVA_HOME}/bin/java - ANT_HOME=/usr/java/ant - PATH=${PATH}:/home/dweiss/xep - - export PATH - export JAVA_HOME export JAVACMD - export ANT_HOME for counter in `seq 1 40`; do --- 9,14 ---- Index: update-website.sh =================================================================== RCS file: /cvsroot/carrot2/deploy/cron/update-website.sh,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** update-website.sh 24 Nov 2003 22:05:56 -0000 1.7 --- update-website.sh 15 Feb 2004 17:22:46 -0000 1.8 *************** *** 8,20 **** cd /home/dweiss/carrot2/deploy - JAVA_HOME=/usr/java/j2sdk JAVACMD=${JAVA_HOME}/bin/java - ANT_HOME=/usr/java/ant - PATH=${PATH}:/home/dweiss/xep - - export PATH - export JAVA_HOME export JAVACMD - export ANT_HOME for counter in `seq 1 40`; do --- 8,13 ---- |
From: <daw...@us...> - 2004-02-15 17:01:29
|
Update of /cvsroot/carrot2/deploy/cron In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3770/cron Modified Files: update-demo.sh Log Message: Index: update-demo.sh =================================================================== RCS file: /cvsroot/carrot2/deploy/cron/update-demo.sh,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** update-demo.sh 24 Nov 2003 22:05:56 -0000 1.5 --- update-demo.sh 15 Feb 2004 16:54:08 -0000 1.6 *************** *** 50,53 **** --- 50,63 ---- done + # stop tomcat. + ant -f build.demo.xml stop.tomcat + + # zip tomcat logs. + mkdir -p /home/dweiss/carrot2/logs-tomcat + zip -r /home/dweiss/carrot2/logs-tomcat/tomcat-logs-`date +%Y-%m-%d_%H-%M` /home/dweiss/carrot2/runtime/logs/* + + ant -f build.demo.xml clean.webapps + ant -f build.demo.xml copy.logs + if ant -Dno.cvsupdate=true -f build.demo.xml \ *************** *** 58,68 **** then # stop tomcat first, restart it in 'success' mode. - ant -f build.demo.xml stop.tomcat - ant -f build.demo.xml clean.webapps - ant -f build.demo.xml copy.logs ant -f build.demo.xml copy.webapps # copy webapps to nightly binaries folder. rm -f /carrot/www/static/download/nightly/*.war cp /home/dweiss/carrot2/runtime/context-webapps/*.war /carrot/www/static/download/nightly/ # run tomcat in the background, wait and test it after a couple of minutes (ant -f build.demo.xml start.tomcat.success)& --- 68,82 ---- then # stop tomcat first, restart it in 'success' mode. ant -f build.demo.xml copy.webapps + # copy webapps to nightly binaries folder. rm -f /carrot/www/static/download/nightly/*.war + rm -f /carrot/www/static/download/nightly/*.zip cp /home/dweiss/carrot2/runtime/context-webapps/*.war /carrot/www/static/download/nightly/ + zip /carrot/www/static/download/nightly/shared-libraries.zip /home/dweiss/carrot2/runtime/shared/lib/*.jar + + # override webapps if needed. + cp -f /home/dweiss/carrot2/override-modules/*.war /home/dweiss/carrot2/runtime/context-webapps/ + # run tomcat in the background, wait and test it after a couple of minutes (ant -f build.demo.xml start.tomcat.success)& *************** *** 77,83 **** else # stop tomcat first, restart it in 'failure' mode. - ant -f build.demo.xml stop.tomcat - ant -f build.demo.xml clean.webapps - ant -f build.demo.xml copy.logs ant -f build.demo.xml start.tomcat.failure fi --- 91,94 ---- |
From: <daw...@us...> - 2004-02-15 17:01:29
|
Update of /cvsroot/carrot2/deploy In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3770 Modified Files: tomcat.xml Log Message: Index: tomcat.xml =================================================================== RCS file: /cvsroot/carrot2/deploy/tomcat.xml,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** tomcat.xml 23 Sep 2003 15:42:20 -0000 1.1 --- tomcat.xml 15 Feb 2004 16:54:08 -0000 1.2 *************** *** 26,30 **** <echo message="Starting Tomcat (Catalina) in ${catalina.home}"/> ! <java classname="org.apache.catalina.startup.Bootstrap" fork="yes"> <jvmarg value="-Dcatalina.home=${tomcat.home}"/> <arg value="-config"/> --- 26,34 ---- <echo message="Starting Tomcat (Catalina) in ${catalina.home}"/> ! <java classname="org.apache.catalina.startup.Bootstrap" fork="yes" ! output="/home/dweiss/carrot2/logs-cron/catalina-stdout.log" ! error="/home/dweiss/carrot2/logs-cron/catalina-stderr.log" ! append="true" ! > <jvmarg value="-Dcatalina.home=${tomcat.home}"/> <arg value="-config"/> |
From: <daw...@us...> - 2004-02-10 18:25:00
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/Jama In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15302/src-test/Jama Modified Files: SingularValueDecomposition.java Log Message: A infinite loop test (Jama). Index: SingularValueDecomposition.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/Jama/SingularValueDecomposition.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** SingularValueDecomposition.java 10 Feb 2004 15:27:41 -0000 1.1 --- SingularValueDecomposition.java 10 Feb 2004 18:21:29 -0000 1.2 *************** *** 2,19 **** import Jama.util.*; ! /** Singular Value Decomposition. ! <P> ! For an m-by-n matrix A with m >= n, the singular value decomposition is ! an m-by-n orthogonal matrix U, an n-by-n diagonal matrix S, and ! an n-by-n orthogonal matrix V so that A = U*S*V'. ! <P> ! The singular values, sigma[k] = S[k][k], are ordered so that ! sigma[0] >= sigma[1] >= ... >= sigma[n-1]. ! <P> ! The singular value decompostion always exists, so the constructor will ! never fail. The matrix condition number and the effective numerical ! rank can be computed from this decomposition. ! */ public class SingularValueDecomposition implements java.io.Serializable { --- 2,25 ---- import Jama.util.*; ! /* ! This is a patched version that prevents underfull busy infinite loop ! problem. + Sent to me by Jiazheng Shi to whom I would like to express my gratitude. + */ + + /** Singular Value Decomposition. + <P> + For an m-by-n matrix A with m >= n, the singular value decomposition is + an m-by-n orthogonal matrix U, an n-by-n diagonal matrix S, and + an n-by-n orthogonal matrix V so that A = U*S*V'. + <P> + The singular values, sigma[k] = S[k][k], are ordered so that + sigma[0] >= sigma[1] >= ... >= sigma[n-1]. + <P> + The singular value decompostion always exists, so the constructor will + never fail. The matrix condition number and the effective numerical + rank can be computed from this decomposition. + */ public class SingularValueDecomposition implements java.io.Serializable { *************** *** 38,43 **** */ private int m, n; ! ! private final static int MAX_ITER = 600; /* ------------------------ --- 44,49 ---- */ private int m, n; ! ! private final static int MAX_ITER = 6000; /* ------------------------ *************** *** 250,253 **** --- 256,260 ---- int iter = 0; double eps = Math.pow(2.0,-52.0); + double tiny = Math.pow(2.0,-966.0); while (p > 0) { int k,kase; *************** *** 272,276 **** break; } ! if (Math.abs(e[k]) <= eps*(Math.abs(s[k]) + Math.abs(s[k+1]))) { e[k] = 0.0; break; --- 279,284 ---- break; } ! if (Math.abs(e[k]) <= ! tiny + eps*(Math.abs(s[k]) + Math.abs(s[k+1]))) { e[k] = 0.0; break; *************** *** 287,292 **** double t = (ks != p ? Math.abs(e[ks]) : 0.) + (ks != k+1 ? Math.abs(e[ks-1]) : 0.); ! ! if (Math.abs(s[ks]) <= eps*t) { s[ks] = 0.0; break; --- 295,299 ---- double t = (ks != p ? Math.abs(e[ks]) : 0.) + (ks != k+1 ? Math.abs(e[ks-1]) : 0.); ! if (Math.abs(s[ks]) <= tiny + eps*t) { s[ks] = 0.0; break; |
From: <daw...@us...> - 2004-02-10 18:24:59
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/jama-test In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15302/src-test/jama-test Modified Files: badmatrix Log Message: A infinite loop test (Jama). Index: badmatrix =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/jama-test/badmatrix,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvsuFoMER and /tmp/cvsTAebA0 differ |
From: <daw...@us...> - 2004-02-10 18:24:58
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot/filter/cluster/lsicluster In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15302/src-test/com/stachoodev/carrot/filter/cluster/lsicluster Added Files: TestJamaSVDInfiniteLoopTest.java Log Message: A infinite loop test (Jama). --- NEW FILE: TestJamaSVDInfiniteLoopTest.java --- /* * Carrot2 Project * Copyright (C) 2002-2003, Dawid Weiss * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. * All rights reserved. * * Refer to full text of the licence "carrot2.LICENCE" in the root folder * of CVS checkout or at: * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE */ package com.stachoodev.carrot.filter.cluster.lsicluster; import java.io.ObjectInputStream; import junit.framework.TestCase; import org.apache.log4j.Logger; import Jama.Matrix; import Jama.SingularValueDecomposition; /** * Jama's underflow bug. * @author Dawid Weiss */ public class TestJamaSVDInfiniteLoopTest extends TestCase { public TestJamaSVDInfiniteLoopTest(String arg0) { super(arg0); } public void testJamaUnderflow() throws Exception { org.apache.log4j.BasicConfigurator.configure(); Logger logger = Logger.getLogger("tests.performance"); System.out.println(this.getClass().getClassLoader().getResource("jama-test/badmatrix")); ObjectInputStream is = new ObjectInputStream( this.getClass().getClassLoader().getResourceAsStream("jama-test/badmatrix")); Matrix matrix = (Matrix) is.readObject(); is.close(); new SingularValueDecomposition(matrix); } } |
From: <daw...@us...> - 2004-02-10 18:18:20
|
Update of /cvsroot/carrot2/carrot2/lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13758 Modified Files: Jama-1.0.1-patched.jar Log Message: Another patch to Jama, this time it SOLVES the underful problem :) Index: Jama-1.0.1-patched.jar =================================================================== RCS file: /cvsroot/carrot2/carrot2/lib/Jama-1.0.1-patched.jar,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvsbNyIfi and /tmp/cvsDH5Fxx differ |
From: <daw...@us...> - 2004-02-10 15:31:07
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5837 Modified Files: .classpath build.xml Log Message: [new] Feature extraction now uses carrot tokenizer [bugfix] long queries now will terminate with a runtime exception. This is caused by some bug (?) in SVD decomposition. A new version of Jama (patched by us) has to be downloaded too. [refactoring] various small code refactorings. Index: .classpath =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/.classpath,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** .classpath 19 Sep 2003 10:16:25 -0000 1.1.1.1 --- .classpath 10 Feb 2004 15:27:41 -0000 1.2 *************** *** 2,5 **** --- 2,6 ---- <classpath> <classpathentry kind="src" path="src"/> + <classpathentry kind="src" path="src-test"/> <classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/> <classpathentry kind="var" path="CARROT2_CVS/lib/commons-beanutils.jar"/> *************** *** 14,28 **** <classpathentry kind="var" path="CARROT2_CVS/lib/dweiss-utils.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/gnu-regexp-1.1.4.jar"/> - <classpathentry kind="var" path="CARROT2_CVS/lib/jaxp.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/saxon.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xercesImpl.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xml-apis.jar"/> - <classpathentry kind="var" path="CARROT2_CVS/lib/junit.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/carrot2-shared-lib.jar"/> ! <classpathentry kind="lib" path="lib/FSA.jar"/> ! <classpathentry kind="lib" path="lib/Jama-1.0.1.jar"/> ! <classpathentry kind="lib" path="lib/junit.jar"/> ! <classpathentry kind="lib" path="lib/stemming.jar"/> <classpathentry kind="output" path="tmp/build/WEB-INF/classes"/> </classpath> --- 15,27 ---- <classpathentry kind="var" path="CARROT2_CVS/lib/dweiss-utils.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/gnu-regexp-1.1.4.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/saxon.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xercesImpl.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xml-apis.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/carrot2-shared-lib.jar"/> ! <classpathentry kind="var" path="CARROT2_CVS/lib/compile-time-only/junit.jar"/> ! <classpathentry kind="var" path="CARROT2_CVS/lib/FSA.jar"/> ! <classpathentry kind="var" path="CARROT2_CVS/lib/lametyzator.jar"/> ! <classpathentry kind="var" path="CARROT2_CVS/lib/Jama-1.0.1-patched.jar"/> <classpathentry kind="output" path="tmp/build/WEB-INF/classes"/> </classpath> Index: build.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/build.xml,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** build.xml 7 Feb 2004 08:02:17 -0000 1.4 --- build.xml 10 Feb 2004 15:27:41 -0000 1.5 *************** *** 135,138 **** --- 135,189 ---- <!-- }}} --> + + <!-- ##################################### --> + <!-- ### {{{ RUN LOCAL TESTS ### --> + <!-- ##################################### --> + + <target name="test" depends="compile" > + <!-- Compile test cases and copy resources first. --> + <javac destdir = "${build.dir}/WEB-INF/classes" + debug = "${java.debug}" + optimize = "${java.optimize}" + deprecation = "on" + includeantruntime = "false" + includejavaruntime = "false" + > + <classpath refid="classpath.dependencies" /> + <classpath location="${carrot2.cvs.dir}/lib/compile-time-only/servlet.jar" /> + <classpath location="${carrot2.cvs.dir}/lib/compile-time-only/junit.jar" /> + + <!-- add source code paths. --> + <src path="src-test" /> + </javac> + + <!-- copy any non-java files (resources) from the source path. --> + <copy toDir="${build.dir}/WEB-INF/classes"> + <fileset dir="src-test"> + <exclude name="**/*.java"/> + </fileset> + </copy> + + <!-- Run JUnit tests. --> + <junit dir="${build.dir}/WEB-INF/classes" fork="true" printsummary="true" + errorproperty="junit.error" failureproperty="junit.failure" + haltonerror="true" haltonfailure="true"> + + <formatter type="plain"/> + + <classpath refid="classpath.dependencies" /> + <classpath location="${carrot2.cvs.dir}/lib/compile-time-only/servlet.jar" /> + <classpath location="${carrot2.cvs.dir}/lib/compile-time-only/junit.jar" /> + <classpath location="${build.dir}/WEB-INF/classes" /> + + <batchtest todir="${build.dir}"> + <fileset dir="${build.dir}/WEB-INF/classes"> + <include name="**/TestDataFilesClusteringTest.class" /> + </fileset> + </batchtest> + </junit> + + </target> + <!-- }}} --> + <!-- ##################################### --> *************** *** 147,151 **** <exclude name="WEB-INF/lib/**" /> </fileset> ! <classes dir="${build.dir}/WEB-INF/classes"> <exclude name="**/*Test"/> --- 198,202 ---- <exclude name="WEB-INF/lib/**" /> </fileset> ! <classes dir="${build.dir}/WEB-INF/classes"> <exclude name="**/*Test"/> |
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot/filter/cluster/lsicluster In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5837/src-test/com/stachoodev/carrot/filter/cluster/lsicluster Added Files: TestDataFilesClusteringTest.java LsiClusteringStrategyTest.java Log Message: [new] Feature extraction now uses carrot tokenizer [bugfix] long queries now will terminate with a runtime exception. This is caused by some bug (?) in SVD decomposition. A new version of Jama (patched by us) has to be downloaded too. [refactoring] various small code refactorings. --- NEW FILE: TestDataFilesClusteringTest.java --- /* * Carrot2 Project * Copyright (C) 2002-2003, Dawid Weiss * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. * All rights reserved. * * Refer to full text of the licence "carrot2.LICENCE" in the root folder * of CVS checkout or at: * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE */ package com.stachoodev.carrot.filter.cluster.lsicluster; import com.stachoodev.carrot.filter.cluster.common.*; import com.stachoodev.util.log.TimeLogger; import junit.framework.TestCase; import java.util.*; import java.io.*; import org.apache.log4j.Logger; import org.jdom.Document; import org.jdom.Element; import org.jdom.input.SAXBuilder; import org.put.util.xml.JDOMHelper; /** * @author Dawid Weiss */ public class TestDataFilesClusteringTest extends TestCase { public TestDataFilesClusteringTest(String arg0) { super(arg0); } public void testClusteringOfDataFiles() throws Exception { org.apache.log4j.BasicConfigurator.configure(); Logger logger = Logger.getLogger("tests.performance"); File dataDir = new File("data"); if (!dataDir.exists() || !dataDir.isDirectory()) { fail("'data' directory not available: " + dataDir.getAbsolutePath()); } File [] tests = dataDir.listFiles(new FilenameFilter() { public boolean accept(File dir, String name) { return name.endsWith(".xml"); } }); for (int f = 0; f < tests.length ; f++) { SAXBuilder builder = new SAXBuilder(); FileInputStream is = new FileInputStream( tests[f] ); Document doc; try { doc = builder.build(is); } finally { is.close(); } List documentList = JDOMHelper.getElements("searchresult/document", doc.getRootElement()); logger.info("Clustering: " + tests[f].getName() + ", " + documentList.size() + " documents."); TimeLogger tlogger = new TimeLogger(); tlogger.start(); Map lingoOptions = new HashMap(); lingoOptions.put("stemmer.english", "com.dawidweiss.carrot.filter.stemming.porter.PorterStemmer"); lingoOptions.put("stemmer.polish", "com.dawidweiss.carrot.filter.stemming.lametyzator.Lametyzator"); lingoOptions.put("preprocessing.class", // "com.stachoodev.carrot.filter.cluster.common.MultilingualPreprocessingStrategy"); CarrotLibTokenizerPreprocessingStrategy.class.getName()); lingoOptions.put("lsi.threshold.clusterAssignment", "0.150"); lingoOptions.put("lsi.threshold.candidateCluster", "0.775"); MultilingualClusteringContext clusteringContext = new MultilingualClusteringContext(new File("../.."), lingoOptions); for (Iterator j = documentList.iterator(); j.hasNext() ;) { Element document = (Element) j.next(); String title = document.getChildText("title"); String snippet = document.getChildText("snippet"); clusteringContext.addSnippet( new Snippet(document.getAttributeValue("id"), title, snippet) ); } // Query clusteringContext.setQuery(doc.getRootElement().getChildText("query")); // Cluster ! ClusteringResults clusteringResults; try { clusteringResults = clusteringContext.cluster(); } catch (Exception e) { logger.error("Error in clustering.", e); continue; } Cluster [] clusters = clusteringResults.getClusters(); tlogger.logElapsedAndStart(logger, "clustering " + documentList.size() + " results. " + ", features: " + clusteringContext.getFeatures().length + ", clusters: " + clusters.length); if (logger.isEnabledFor(org.apache.log4j.Level.DEBUG)) { StringBuffer buf = new StringBuffer(); for (int j=0;j<clusters.length;j++) { buf.append(Arrays.asList(clusters[j].getLabels())); buf.append("\n\t" + clusters[j].getSnippets()[0].getId()); buf.append("\n\n"); } logger.debug(buf); } } } } --- NEW FILE: LsiClusteringStrategyTest.java --- /* * Carrot2 Project * Copyright (C) 2002-2003, Dawid Weiss * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. * All rights reserved. * * Refer to full text of the licence "carrot2.LICENCE" in the root folder * of CVS checkout or at: * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE */ package com.stachoodev.carrot.filter.cluster.lsicluster; import com.stachoodev.carrot.filter.cluster.common.*; import com.stachoodev.util.log.TimeLogger; import junit.framework.TestCase; import java.util.*; import java.io.*; import org.apache.log4j.Logger; import org.jdom.Document; import org.jdom.Element; import org.jdom.input.SAXBuilder; import org.put.util.xml.JDOMHelper; /** * @author Dawid Weiss */ public class LsiClusteringStrategyTest extends TestCase { public LsiClusteringStrategyTest(String arg0) { super(arg0); } public void testClustering() throws Exception { org.apache.log4j.BasicConfigurator.configure(); Logger logger = Logger.getLogger("tests.performance"); SAXBuilder builder = new SAXBuilder(); Document doc = builder.build(this.getClass().getClassLoader() .getResourceAsStream("data/data-mining.xml")); // .getResourceAsStream("longq.xml")); List documentList = JDOMHelper.getElements("searchresult/document", doc.getRootElement()); for (int i=50;i<documentList.size();i+=50) { TimeLogger tlogger = new TimeLogger(); tlogger.start(); Map lingoOptions = new HashMap(); lingoOptions.put("stemmer.english", "com.dawidweiss.carrot.filter.stemming.porter.PorterStemmer"); lingoOptions.put("stemmer.polish", "com.dawidweiss.carrot.filter.stemming.lametyzator.Lametyzator"); lingoOptions.put("preprocessing.class", // "com.stachoodev.carrot.filter.cluster.common.MultilingualPreprocessingStrategy"); CarrotLibTokenizerPreprocessingStrategy.class.getName()); //lingoOptions.put("feature.extraction.strategy", // MultilingualFeatureExtractionStrategyWithCutoff.class.getName()); lingoOptions.put("lsi.threshold.clusterAssignment", "0.150"); lingoOptions.put("lsi.threshold.candidateCluster", "0.775"); MultilingualClusteringContext clusteringContext = new MultilingualClusteringContext(new File("."), lingoOptions); int max = i; for (Iterator j = documentList.iterator(); j.hasNext() && max > 0 ; max--) { Element document = (Element) j.next(); String title = document.getChildText("title"); String snippet = document.getChildText("snippet"); clusteringContext.addSnippet( new Snippet(document.getAttributeValue("id"), title, snippet) ); } // Query clusteringContext.setQuery(doc.getRootElement().getChildText("query")); // Cluster ! ClusteringResults clusteringResults = clusteringContext.cluster(); Cluster [] clusters = clusteringResults.getClusters(); tlogger.logElapsedAndStart(logger, "clustering " + i + " results. " + ", features: " + clusteringContext.getFeatures().length + ", clusters: " + clusters.length); StringBuffer buf = new StringBuffer(); for (int j=0;j<clusters.length;j++) { buf.append(Arrays.asList(clusters[j].getLabels())); buf.append("\n\t" + clusters[j].getSnippets()[0].getId()); buf.append("\n\n"); } logger.debug(buf); } } } |
From: <daw...@us...> - 2004-02-10 15:31:06
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/util/log In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5837/src/com/stachoodev/util/log Modified Files: TimeLogger.java Log Message: [new] Feature extraction now uses carrot tokenizer [bugfix] long queries now will terminate with a runtime exception. This is caused by some bug (?) in SVD decomposition. A new version of Jama (patched by us) has to be downloaded too. [refactoring] various small code refactorings. Index: TimeLogger.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/util/log/TimeLogger.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** TimeLogger.java 30 Sep 2003 11:35:46 -0000 1.2 --- TimeLogger.java 10 Feb 2004 15:27:41 -0000 1.3 *************** *** 101,105 **** return numberFormat.format(elapsed / 1000.0f) + " sec."; ! case UNIT_MILISECONDS:default: return Long.toString(elapsed) + " msec."; } --- 101,106 ---- return numberFormat.format(elapsed / 1000.0f) + " sec."; ! case UNIT_MILISECONDS: ! default: return Long.toString(elapsed) + " msec."; } |
From: <daw...@us...> - 2004-02-10 15:31:06
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/lsicluster In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5837/src/com/stachoodev/carrot/filter/cluster/lsicluster Modified Files: LsiClusteringStrategy.java DummyClusteringStrategy.java Removed Files: DummyClusteringStrategyTest.java LsiClusteringStrategyTest.java Log Message: [new] Feature extraction now uses carrot tokenizer [bugfix] long queries now will terminate with a runtime exception. This is caused by some bug (?) in SVD decomposition. A new version of Jama (patched by us) has to be downloaded too. [refactoring] various small code refactorings. Index: LsiClusteringStrategy.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/lsicluster/LsiClusteringStrategy.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** LsiClusteringStrategy.java 30 Sep 2003 11:35:46 -0000 1.2 --- LsiClusteringStrategy.java 10 Feb 2004 15:27:40 -0000 1.3 *************** *** 95,107 **** // Init parameters ! if (clusteringContext.getParameter("lsi.threshold.clusterAssignment") != null) { - String value = (String) clusteringContext.getParameter( - "lsi.threshold.clusterAssignment" - ); - try { ! clusterAssignmentThreshold = Double.parseDouble(value); } catch (NumberFormatException e) --- 95,104 ---- // Init parameters ! Object value; ! if ( (value = clusteringContext.getParameter("lsi.threshold.clusterAssignment")) != null) { try { ! clusterAssignmentThreshold = Double.parseDouble(unwrapString(value)); } catch (NumberFormatException e) *************** *** 112,124 **** } ! if (clusteringContext.getParameter("lsi.threshold.candidateCluster") != null) { - String value = (String) clusteringContext.getParameter( - "lsi.threshold.candidateCluster" - ); - try { ! candidateClusterThreshold = Double.parseDouble(value); } catch (NumberFormatException e) --- 109,117 ---- } ! if ((value = clusteringContext.getParameter("lsi.threshold.candidateCluster")) != null) { try { ! candidateClusterThreshold = Double.parseDouble(unwrapString(value)); } catch (NumberFormatException e) *************** *** 145,148 **** --- 138,151 ---- } + /** + * Unwraps a String out of a list, if needed. + */ + private String unwrapString(Object value) { + if (value instanceof List) { + return (String) ((List) value).get(0); + } else { + return (String) value; + } + } /** *************** *** 179,183 **** // Create TD matrix TdMatrixBuildingStrategy tdMatrixBuildingStrategy = new TfidfTdMatrixBuildingStrategy( ! 2, 400 * 200 ); --- 182,186 ---- // Create TD matrix TdMatrixBuildingStrategy tdMatrixBuildingStrategy = new TfidfTdMatrixBuildingStrategy( ! 2, 250 * 150 ); *************** *** 207,210 **** --- 210,224 ---- // The SVD + + // dump the matrix. + try { + java.io.ObjectOutputStream os = new java.io.ObjectOutputStream (new java.io.FileOutputStream("f:\\matrix")); + os.writeObject(tdMatrix); + os.close(); + } + catch (Exception e) { + } + + SingularValueDecomposition svd = tdMatrix.svd(); Index: DummyClusteringStrategy.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/lsicluster/DummyClusteringStrategy.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** DummyClusteringStrategy.java 30 Sep 2003 11:35:46 -0000 1.2 --- DummyClusteringStrategy.java 10 Feb 2004 15:27:40 -0000 1.3 *************** *** 62,74 **** int termCount = tdMatrix.getRowDimension(); - int docCount = tdMatrix.getColumnDimension(); int clusterCount = 2; - // Check dimensions - boolean transposed = false; - if (tdMatrix.getColumnDimension() > tdMatrix.getRowDimension()) { - transposed = true; tdMatrix = tdMatrix.transpose(); } --- 62,69 ---- --- DummyClusteringStrategyTest.java DELETED --- --- LsiClusteringStrategyTest.java DELETED --- |
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/common In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5837/src/com/stachoodev/carrot/filter/cluster/common Modified Files: DefaultClusteringContext.java MultilingualPreprocessingStrategy.java AbstractClusteringContext.java ClusteringResults.java TfidfTdMatrixBuildingStrategy.java AbstractSnippetsIntWrapper.java MultilingualClusteringContext.java Added Files: CarrotLibTokenizerPreprocessingStrategy.java Log Message: [new] Feature extraction now uses carrot tokenizer [bugfix] long queries now will terminate with a runtime exception. This is caused by some bug (?) in SVD decomposition. A new version of Jama (patched by us) has to be downloaded too. [refactoring] various small code refactorings. --- NEW FILE: CarrotLibTokenizerPreprocessingStrategy.java --- /* * Carrot2 Project * Copyright (C) 2002-2003, Dawid Weiss * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. * All rights reserved. * * Refer to full text of the licence "carrot2.LICENCE" in the root folder * of CVS checkout or at: * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE */ package com.stachoodev.carrot.filter.cluster.common; import com.dawidweiss.carrot.filter.stemming.DirectStemmer; import com.dawidweiss.carrot.tokenizer.Tokenizer; import com.dawidweiss.carrot.util.StringUtils; import org.apache.log4j.Logger; import java.util.*; /** * @author Dawid Weiss */ public final class CarrotLibTokenizerPreprocessingStrategy implements PreprocessingStrategy { /** Logger */ protected static final Logger logger = Logger.getLogger( CarrotLibTokenizerPreprocessingStrategy.class); /** Linguistic information */ protected Map stemSets; protected Map inflectedSets; protected Map stopWordSets; protected Map nonStopWordSets; protected Map stemmers; protected Set strongWords; protected Set queryWords; protected Set lowCaseWords; protected Map caseCheck; /** */ protected Map inflectedFreqSets; /** * @see java.lang.Object#Object() */ public CarrotLibTokenizerPreprocessingStrategy() { } /** * @see com.stachoodev.carrot.filter.cluster.common.PreprocessingStrategy#preprocess(com.stachoodev.carrot.filter.cluster.common.Snippet) */ public Snippet [] preprocess(AbstractClusteringContext clusteringContext) { Tokenizer tokenizer = Tokenizer.getTokenizer(); Snippet [] snippets = clusteringContext.getSnippets(); Snippet [] preprocessedSnippets = new Snippet[snippets.length]; stopWordSets = ((MultilingualClusteringContext) clusteringContext).getStopWordSets(); nonStopWordSets = ((MultilingualClusteringContext) clusteringContext).getNonStopWordSets(); stemSets = ((MultilingualClusteringContext) clusteringContext).getStemSets(); inflectedSets = ((MultilingualClusteringContext) clusteringContext).getInflectedSets(); strongWords = ((MultilingualClusteringContext) clusteringContext).getStrongWords(); queryWords = ((MultilingualClusteringContext) clusteringContext).getQueryWords(); stemmers = ((MultilingualClusteringContext) clusteringContext).getStemmers(); inflectedFreqSets = new HashMap(); lowCaseWords = new HashSet(); caseCheck = new HashMap(); // Clean and guess language for (int i = 0; i < snippets.length; i++) { preprocessedSnippets[i] = preprocess(snippets[i], tokenizer); } // Change "unidentified" to the most common language HashMap languageFreq = new HashMap(); String mostCommonLanguage = MultilingualClusteringContext.UNIDENTIFIED_LANGUAGE_NAME; int maxLanguageFreq = 0; for (int i = 0; i < preprocessedSnippets.length; i++) { if (!languageFreq.containsKey(preprocessedSnippets[i].getLanguage())) { languageFreq.put(preprocessedSnippets[i].getLanguage(), new Integer(1)); if ( (maxLanguageFreq < 1) && !preprocessedSnippets[i].getLanguage().equals( MultilingualClusteringContext.UNIDENTIFIED_LANGUAGE_NAME ) ) { maxLanguageFreq = 1; mostCommonLanguage = preprocessedSnippets[i].getLanguage(); } } else { int freq = ((Integer) languageFreq.get(preprocessedSnippets[i].getLanguage())) .intValue(); languageFreq.put(preprocessedSnippets[i].getLanguage(), new Integer(freq + 1)); if ( (maxLanguageFreq < (freq + 1)) && !preprocessedSnippets[i].getLanguage().equals( MultilingualClusteringContext.UNIDENTIFIED_LANGUAGE_NAME ) ) { maxLanguageFreq = freq + 1; mostCommonLanguage = preprocessedSnippets[i].getLanguage(); } } } for (int i = 0; i < snippets.length; i++) { if ( preprocessedSnippets[i].getLanguage().equals( MultilingualClusteringContext.UNIDENTIFIED_LANGUAGE_NAME ) ) { preprocessedSnippets[i].setLanguage(mostCommonLanguage); } preprocessedSnippets[i] = stemming(preprocessedSnippets[i]); } // Create inflectedSets Iterator languages = inflectedFreqSets.keySet().iterator(); while (languages.hasNext()) { String language = (String) languages.next(); HashMap inflectedFreq = (HashMap) inflectedFreqSets.get(language); HashMap inflected = new HashMap(); inflectedSets.put(language, inflected); Iterator stems = inflectedFreq.keySet().iterator(); while (stems.hasNext()) { String stem = (String) stems.next(); HashMap inflectedForStem = (HashMap) inflectedFreq.get(stem); if (inflectedForStem != null) { int maxFreq = 0; String bestInflected = stem; Iterator inflecteds = inflectedForStem.keySet().iterator(); while (inflecteds.hasNext()) { String infl = (String) inflecteds.next(); Integer freq = (Integer) inflectedForStem.get(infl); if (freq.intValue() > maxFreq) { maxFreq = freq.intValue(); bestInflected = infl; } } inflected.put(stem, bestInflected); } } } return preprocessedSnippets; } /** * Method clean. */ protected Snippet preprocess(Snippet snippet, Tokenizer tokenizer) { String title = tokenizeAndClean(snippet.getTitle(), tokenizer); String body = tokenizeAndClean(snippet.getBody(), tokenizer); String language = guessLanguage( (title.equals("") ? "" : (title + " ")) + "." + (body.equals("") ? "" : (" " + body)) ); Snippet preprocessedSnippet = new Snippet(snippet.getId(), title, body, language); return preprocessedSnippet; } /** * @param snippet * * @return */ protected Snippet stemming(Snippet snippet) { Snippet stemmedSnippet = new Snippet( snippet.getId(), stemming(snippet.getTitle(), snippet.getLanguage(), true), stemming(snippet.getBody(), snippet.getLanguage(), false), snippet.getLanguage() ); return stemmedSnippet; } /** * @param text * @param language * @param strong */ private String stemming(String text, String language, boolean strong) { StringBuffer stringBuffer = new StringBuffer(); StringTokenizer stringTokenizer = new StringTokenizer(text); DirectStemmer stemmer = (DirectStemmer) stemmers.get(language); HashMap stems = (HashMap) stemSets.get(language); HashSet stopWords = (HashSet) stopWordSets.get(language); HashSet nonStopWords = (HashSet) nonStopWordSets.get(language); if (!inflectedFreqSets.containsKey(language)) { inflectedFreqSets.put(language, new HashMap()); } HashMap inflectedFreq = (HashMap) inflectedFreqSets.get(language); while (stringTokenizer.hasMoreTokens()) { String token = stringTokenizer.nextToken(); if (token.equals(".")) { if (stringBuffer.length() > 0) { stringBuffer.append(" ."); } continue; } // Remove one-character-long terms if ( (token.length() < 2) && ((stopWords == null) || ((stopWords != null) && !stopWords.contains(token.toLowerCase()))) ) { continue; } // Remove overly long terms if (token.length() > 25) { continue; } // Case processing if (StringUtils.capitalizedRatio(token) > 0.5) { if (lowCaseWords.contains(token.toLowerCase())) { token = token.toLowerCase(); } } else { token = token.toLowerCase(); } // Stemming if ( (stemmer != null) && !language.equalsIgnoreCase( MultilingualClusteringContext.UNIDENTIFIED_LANGUAGE_NAME ) && !stopWords.contains(token) ) { String stem; if (!stems.containsKey(token)) { synchronized (stemmer) { stem = stemmer.getStem(token.toCharArray(), 0, token.length()); } if (stem != null) // ineffective ! { stems.put(token, stem); } else { stems.put(token, token); } } else { stem = (String) stems.get(token); } if (!inflectedFreq.containsKey(stem)) { inflectedFreq.put(stem, new HashMap()); } HashMap inflectedForStem = (HashMap) inflectedFreq.get(stem); if (!inflectedForStem.containsKey(token)) { inflectedForStem.put(token, new Integer(1)); } else { Integer freq = (Integer) inflectedForStem.get(token); inflectedForStem.put(token, new Integer(freq.intValue() + 1)); } token = (String) stems.get(token); } // Strong terms if (strong) { strongWords.add(token); } // Non-stop words if ( language.equalsIgnoreCase(MultilingualClusteringContext.UNIDENTIFIED_LANGUAGE_NAME) || !stopWords.contains(token) ) { nonStopWords.add(token); } if (stringBuffer.length() == 0) { stringBuffer.append(token); } else { stringBuffer.append(" "); stringBuffer.append(token); } } return stringBuffer.toString(); } /** * @param snippet */ private String guessLanguage(String text) { StringTokenizer stringTokenizer = new StringTokenizer(text); HashMap stopWordFrequencies = new HashMap(); String language = MultilingualClusteringContext.UNIDENTIFIED_LANGUAGE_NAME; int maxStopWordFrequency = 0; while (stringTokenizer.hasMoreTokens()) { String token = stringTokenizer.nextToken(); if (StringUtils.capitalizedRatio(token) > 0.5) { continue; } if (token.equals(".")) { continue; } Iterator keys = stopWordSets.keySet().iterator(); while (keys.hasNext()) { String key = (String) keys.next(); HashSet stopWords = (HashSet) stopWordSets.get(key); if (stopWords.contains(token)) { if (!stopWordFrequencies.containsKey(key)) { stopWordFrequencies.put(key, new Integer(1)); if (1 > maxStopWordFrequency) { maxStopWordFrequency = 1; language = key; } } else { int stopWordFrequency = ((Integer) stopWordFrequencies.get(key)).intValue(); stopWordFrequencies.put(key, new Integer(stopWordFrequency + 1)); if ((stopWordFrequency + 1) > maxStopWordFrequency) { maxStopWordFrequency = stopWordFrequency + 1; language = key; } } } } } // Check for "draws" boolean draw = false; HashSet values = new HashSet(); for (Iterator val = stopWordFrequencies.values().iterator(); val.hasNext();) { Integer v = (Integer) val.next(); if (v.intValue() == maxStopWordFrequency) { if (!values.contains(v)) { values.add(v); } else { draw = true; break; } } } return (draw ? MultilingualClusteringContext.UNIDENTIFIED_LANGUAGE_NAME : language); } /** * Tokenizes the input text and returns a "cleaned" version containing only recognizable tokens * and sequence markers. */ private String tokenizeAndClean(String text, Tokenizer tokenizer) { StringBuffer stringBuffer = new StringBuffer(text.length()); tokenizer.restartTokenizerOn(text); int [] tokenType = { 0 }; String tokenImage; String tokenImageLowerCase; int lastAddedType = Tokenizer.TYPE_SENTENCEMARKER; while ( (tokenImage = tokenizer.getNextToken(tokenType)) != null) { tokenImageLowerCase = tokenImage.toLowerCase(); switch (tokenType[0]) { case Tokenizer.TYPE_PERSON: // Pick the last contiguous component of a person's name. int i = tokenImage.length()-1; outerLoop: while (i>=0) { switch (tokenImage.charAt(i)) { case ' ': case '.': case '\'': // O'Brian -- maybe we should skip this? i++; break outerLoop; default: } i--; } tokenImage = tokenImage.substring(i); case Tokenizer.TYPE_TERM: Object previousTokenImage = caseCheck.get(tokenImageLowerCase); if (previousTokenImage == null) { caseCheck.put(tokenImageLowerCase, tokenImage); } else { if (!tokenImage.equals(previousTokenImage)) { lowCaseWords.add(tokenImageLowerCase); } } if (lastAddedType == Tokenizer.TYPE_TERM) stringBuffer.append(' '); stringBuffer.append(tokenImage); lastAddedType = Tokenizer.TYPE_TERM; break; case Tokenizer.TYPE_EMAIL: case Tokenizer.TYPE_URL: // ignore these. break; case Tokenizer.TYPE_SENTENCEMARKER: if (lastAddedType != Tokenizer.TYPE_SENTENCEMARKER) { stringBuffer.append(" . "); lastAddedType = Tokenizer.TYPE_SENTENCEMARKER; } default: // ignore unknown. break; } } return stringBuffer.toString(); } } Index: DefaultClusteringContext.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/common/DefaultClusteringContext.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** DefaultClusteringContext.java 30 Sep 2003 11:35:46 -0000 1.2 --- DefaultClusteringContext.java 10 Feb 2004 15:27:40 -0000 1.3 *************** *** 46,50 **** { snippets = new ArrayList(); - additionalData = new HashMap(); stems = new HashMap(); inflected = new HashMap(); --- 46,49 ---- *************** *** 141,180 **** /** - * Method putData. - * - * @param key - * @param data - */ - public void putData(Object key, Object data) - { - additionalData.put(key, data); - } - - - /** - * Method getData. - * - * @param key - * - * @return Object - */ - public Object getData(Object key) - { - return additionalData.get(key); - } - - - /** - * Method removeData. - * - * @param key - */ - public void removeData(Object key) - { - additionalData.remove(key); - } - - - /** * Returns the snippets. * --- 140,143 ---- Index: MultilingualPreprocessingStrategy.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/common/MultilingualPreprocessingStrategy.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** MultilingualPreprocessingStrategy.java 30 Sep 2003 11:35:46 -0000 1.2 --- MultilingualPreprocessingStrategy.java 10 Feb 2004 15:27:40 -0000 1.3 *************** *** 36,40 **** /** The sentence delimiters over which phrases cannot be spanned */ ! private static final String [] DEFAULT_SENTENCE_DELIMITERS = { ".", "?", "!", "|", ";" }; /** Sentence delimiters */ --- 36,41 ---- /** The sentence delimiters over which phrases cannot be spanned */ ! private static final String [] DEFAULT_SENTENCE_DELIMITERS = ! { ".", "?", "!", "|", ";" }; /** Sentence delimiters */ *************** *** 222,229 **** /** * Method clean. - * - * @param string - * - * @return String */ protected Snippet preprocess(Snippet snippet) --- 223,226 ---- *************** *** 492,497 **** /** ! * Regular expression for matching numbers. TODO: this code should be perhaps replaced with ! * tokenizer class from carrot-utils */ private static final RE numberPattern; --- 489,493 ---- /** ! * Regular expression for matching numbers. */ private static final RE numberPattern; Index: AbstractClusteringContext.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/common/AbstractClusteringContext.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** AbstractClusteringContext.java 30 Sep 2003 11:35:46 -0000 1.2 --- AbstractClusteringContext.java 10 Feb 2004 15:27:40 -0000 1.3 *************** *** 4,8 **** * Carrot2 Project * Copyright (C) 2002-2003, Dawid Weiss ! * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. * All rights reserved. * --- 4,8 ---- * Carrot2 Project * Copyright (C) 2002-2003, Dawid Weiss ! * Portions (C) Contributors listed in carrot2.CONTRIBUTORS file. * All rights reserved. * *************** *** 21,29 **** /** ! * @author stachoo To change this generated comment go to Window>Preferences>Java>Code ! * Generation>Code Template ! */ ! /** ! * @author stachoo */ public abstract class AbstractClusteringContext --- 21,25 ---- /** ! * @author StanisÅaw OsiÅski */ public abstract class AbstractClusteringContext *************** *** 44,50 **** protected Feature [] features; - /** Additional information that specific algorithms may wish to store. */ - protected HashMap additionalData; - /** Clustering strategy */ protected ClusteringStrategy clusteringStrategy; --- 40,43 ---- *************** *** 62,66 **** { snippets = new ArrayList(); - additionalData = new HashMap(); strongWords = new HashSet(); parameters = new HashMap(); --- 55,58 ---- *************** *** 85,124 **** /** - * Method putData. - * - * @param key - * @param data - */ - public void putData(Object key, Object data) - { - additionalData.put(key, data); - } - - - /** - * Method getData. - * - * @param key - * - * @return Object - */ - public Object getData(Object key) - { - return additionalData.get(key); - } - - - /** - * Method removeData. - * - * @param key - */ - public void removeData(Object key) - { - additionalData.remove(key); - } - - - /** * Returns the snippets. * --- 77,80 ---- *************** *** 226,239 **** public Object getParameter(Object key) { ! LinkedList param = (LinkedList) parameters.get(key); ! ! if (param != null) ! { ! return (String) param.get(0); ! } ! else ! { ! return null; ! } } --- 182,186 ---- public Object getParameter(Object key) { ! return parameters.get(key); } Index: ClusteringResults.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/common/ClusteringResults.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** ClusteringResults.java 30 Sep 2003 11:35:46 -0000 1.2 --- ClusteringResults.java 10 Feb 2004 15:27:40 -0000 1.3 *************** *** 15,40 **** package com.stachoodev.carrot.filter.cluster.common; - - import com.stachoodev.util.suffixarrays.wrapper.Substring; - import java.util.Properties; - - /** ! * */ public class ClusteringResults { - /** */ private Cluster [] clusters; - /** */ - private Substring [] keywords; - - /** */ - private Properties techInfo; - - /** */ - private double [][] termTermMatrix; - /** * Method ClusteringResults. --- 15,25 ---- package com.stachoodev.carrot.filter.cluster.common; /** ! * @author Stanislaw Osinski */ public class ClusteringResults { private Cluster [] clusters; /** * Method ClusteringResults. *************** *** 44,70 **** public ClusteringResults(Cluster [] clusters) { - this(clusters, null); - } - - - /** - * Method ClusteringResults. - * - * @param keywords - */ - public ClusteringResults(Substring [] keywords) - { - this(null, keywords); - } - - - /** - * Method ClusteringResults. - * - * @param keywords - */ - public ClusteringResults(Cluster [] clusters, Substring [] keywords) - { - this.keywords = keywords; this.clusters = clusters; } --- 29,32 ---- *************** *** 80,136 **** } - - /** - * Returns the keywords. - * - * @return Substring[] - */ - public Substring [] getKeywords() - { - return keywords; - } - - - /** - * Returns the techInfo. - * - * @return Properties - */ - public Properties getTechInfo() - { - return techInfo; - } - - - /** - * Sets the techInfo. - * - * @param techInfo The techInfo to set - */ - public void setTechInfo(Properties techInfo) - { - this.techInfo = techInfo; - } - - - /** - * Returns the termTermMatrix. - * - * @return double[][] - */ - public double [][] getTermTermMatrix() - { - return termTermMatrix; - } - - - /** - * Sets the termTermMatrix. - * - * @param termTermMatrix The termTermMatrix to set - */ - public void setTermTermMatrix(double [][] termTermMatrix) - { - this.termTermMatrix = termTermMatrix; - } } --- 42,44 ---- Index: TfidfTdMatrixBuildingStrategy.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/common/TfidfTdMatrixBuildingStrategy.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** TfidfTdMatrixBuildingStrategy.java 30 Sep 2003 11:35:46 -0000 1.2 --- TfidfTdMatrixBuildingStrategy.java 10 Feb 2004 15:27:40 -0000 1.3 *************** *** 71,75 **** while ( ! !features[rows].isStopWord() && (features[rows].getTf() >= minimumTd) && ((maximumSize < 1) || (size <= maximumSize)) ) --- 71,75 ---- while ( ! features.length > rows && !features[rows].isStopWord() && (features[rows].getTf() >= minimumTd) && ((maximumSize < 1) || (size <= maximumSize)) ) Index: AbstractSnippetsIntWrapper.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/common/AbstractSnippetsIntWrapper.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** AbstractSnippetsIntWrapper.java 19 Sep 2003 10:18:39 -0000 1.1.1.1 --- AbstractSnippetsIntWrapper.java 10 Feb 2004 15:27:40 -0000 1.2 *************** *** 97,101 **** if (documents[i].length() > 0) { ! stringBuffer.append(" | "); stringBuffer.append(documents[i]); } --- 97,103 ---- if (documents[i].length() > 0) { ! stringBuffer.append(' '); ! stringBuffer.append(DOCUMENT_DELIMITER); ! stringBuffer.append(' '); stringBuffer.append(documents[i]); } Index: MultilingualClusteringContext.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/common/MultilingualClusteringContext.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** MultilingualClusteringContext.java 30 Sep 2003 11:35:46 -0000 1.2 --- MultilingualClusteringContext.java 10 Feb 2004 15:27:40 -0000 1.3 *************** *** 52,70 **** public static final String UNIDENTIFIED_LANGUAGE_NAME = "unidentified"; - /** - * This constructor will probably be used for test purposes only. - */ - public MultilingualClusteringContext() - { - this(new File(System.getProperty("user.dir"))); - } - /** * @param dataDir */ ! public MultilingualClusteringContext(File dataDir) { this.dataDir = dataDir; stopWordSets = new HashMap(); --- 52,64 ---- public static final String UNIDENTIFIED_LANGUAGE_NAME = "unidentified"; /** * @param dataDir */ ! public MultilingualClusteringContext(File dataDir, Map params) { this.dataDir = dataDir; + if (params != null) + super.setParameters(params); stopWordSets = new HashMap(); *************** *** 80,85 **** initLanguageProcessing(); ! preprocessingStrategy = new MultilingualPreprocessingStrategy(); ! featureExtractionStrategy = new MultilingualFeatureExtractionStrategy(); clusteringStrategy = new LsiClusteringStrategy(); } --- 74,114 ---- initLanguageProcessing(); ! Object value; ! if ( (value = this.getParameter("preprocessing.class")) != null) { ! if (value instanceof List) { ! value = ((List) value).get(0); ! } ! try ! { ! preprocessingStrategy = (PreprocessingStrategy) Thread.currentThread() ! .getContextClassLoader().loadClass((String) value).newInstance(); ! } ! catch (Exception e) ! { ! logger.warn("Preprocessing strategy instantiation error",e); ! throw new RuntimeException("Preprocessing strategy could not be loaded: " ! + value + ", " + e.toString()); ! } ! } else { ! preprocessingStrategy = new CarrotLibTokenizerPreprocessingStrategy(); ! } ! ! if ((value = this.getParameter("feature.extraction.strategy")) != null) { ! try ! { ! featureExtractionStrategy = (FeatureExtractionStrategy) ! Thread.currentThread().getContextClassLoader() ! .loadClass((String) value).newInstance(); ! } ! catch (Exception e) ! { ! logger.warn("Feature extraction strategy instantiation error",e); ! throw new RuntimeException("Feature extraction strategy could not be loaded: " ! + value + ", " + e.toString()); ! } ! } else { ! featureExtractionStrategy = new MultilingualFeatureExtractionStrategy(); ! } ! clusteringStrategy = new LsiClusteringStrategy(); } |
From: <daw...@us...> - 2004-02-10 15:31:06
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/Jama In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5837/src-test/Jama Added Files: SingularValueDecomposition.java Log Message: [new] Feature extraction now uses carrot tokenizer [bugfix] long queries now will terminate with a runtime exception. This is caused by some bug (?) in SVD decomposition. A new version of Jama (patched by us) has to be downloaded too. [refactoring] various small code refactorings. --- NEW FILE: SingularValueDecomposition.java --- package Jama; import Jama.util.*; /** Singular Value Decomposition. <P> For an m-by-n matrix A with m >= n, the singular value decomposition is an m-by-n orthogonal matrix U, an n-by-n diagonal matrix S, and an n-by-n orthogonal matrix V so that A = U*S*V'. <P> The singular values, sigma[k] = S[k][k], are ordered so that sigma[0] >= sigma[1] >= ... >= sigma[n-1]. <P> The singular value decompostion always exists, so the constructor will never fail. The matrix condition number and the effective numerical rank can be computed from this decomposition. */ public class SingularValueDecomposition implements java.io.Serializable { /* ------------------------ Class variables * ------------------------ */ /** Arrays for internal storage of U and V. @serial internal storage of U. @serial internal storage of V. */ private double[][] U, V; /** Array for internal storage of singular values. @serial internal storage of singular values. */ private double[] s; /** Row and column dimensions. @serial row dimension. @serial column dimension. */ private int m, n; private final static int MAX_ITER = 600; /* ------------------------ Constructor * ------------------------ */ /** Construct the singular value decomposition @param A Rectangular matrix @return Structure to access U, S and V. */ public SingularValueDecomposition (Matrix Arg) { // Derived from LINPACK code. // Initialize. double[][] A = Arg.getArrayCopy(); m = Arg.getRowDimension(); n = Arg.getColumnDimension(); int nu = Math.min(m,n); s = new double [Math.min(m+1,n)]; U = new double [m][nu]; V = new double [n][n]; double[] e = new double [n]; double[] work = new double [m]; boolean wantu = true; boolean wantv = true; // Reduce A to bidiagonal form, storing the diagonal elements // in s and the super-diagonal elements in e. int nct = Math.min(m-1,n); int nrt = Math.max(0,Math.min(n-2,m)); for (int k = 0; k < Math.max(nct,nrt); k++) { if (k < nct) { // Compute the transformation for the k-th column and // place the k-th diagonal in s[k]. // Compute 2-norm of k-th column without under/overflow. s[k] = 0; for (int i = k; i < m; i++) { s[k] = Maths.hypot(s[k],A[i][k]); } if (s[k] != 0.0) { if (A[k][k] < 0.0) { s[k] = -s[k]; } for (int i = k; i < m; i++) { A[i][k] /= s[k]; } A[k][k] += 1.0; } s[k] = -s[k]; } for (int j = k+1; j < n; j++) { if ((k < nct) & (s[k] != 0.0)) { // Apply the transformation. double t = 0; for (int i = k; i < m; i++) { t += A[i][k]*A[i][j]; } t = -t/A[k][k]; for (int i = k; i < m; i++) { A[i][j] += t*A[i][k]; } } // Place the k-th row of A into e for the // subsequent calculation of the row transformation. e[j] = A[k][j]; } if (wantu & (k < nct)) { // Place the transformation in U for subsequent back // multiplication. for (int i = k; i < m; i++) { U[i][k] = A[i][k]; } } if (k < nrt) { // Compute the k-th row transformation and place the // k-th super-diagonal in e[k]. // Compute 2-norm without under/overflow. e[k] = 0; for (int i = k+1; i < n; i++) { e[k] = Maths.hypot(e[k],e[i]); } if (e[k] != 0.0) { if (e[k+1] < 0.0) { e[k] = -e[k]; } for (int i = k+1; i < n; i++) { e[i] /= e[k]; } e[k+1] += 1.0; } e[k] = -e[k]; if ((k+1 < m) & (e[k] != 0.0)) { // Apply the transformation. for (int i = k+1; i < m; i++) { work[i] = 0.0; } for (int j = k+1; j < n; j++) { for (int i = k+1; i < m; i++) { work[i] += e[j]*A[i][j]; } } for (int j = k+1; j < n; j++) { double t = -e[j]/e[k+1]; for (int i = k+1; i < m; i++) { A[i][j] += t*work[i]; } } } if (wantv) { // Place the transformation in V for subsequent // back multiplication. for (int i = k+1; i < n; i++) { V[i][k] = e[i]; } } } } // Set up the final bidiagonal matrix or order p. int p = Math.min(n,m+1); if (nct < n) { s[nct] = A[nct][nct]; } if (m < p) { s[p-1] = 0.0; } if (nrt+1 < p) { e[nrt] = A[nrt][p-1]; } e[p-1] = 0.0; // If required, generate U. if (wantu) { for (int j = nct; j < nu; j++) { for (int i = 0; i < m; i++) { U[i][j] = 0.0; } U[j][j] = 1.0; } for (int k = nct-1; k >= 0; k--) { if (s[k] != 0.0) { for (int j = k+1; j < nu; j++) { double t = 0; for (int i = k; i < m; i++) { t += U[i][k]*U[i][j]; } t = -t/U[k][k]; for (int i = k; i < m; i++) { U[i][j] += t*U[i][k]; } } for (int i = k; i < m; i++ ) { U[i][k] = -U[i][k]; } U[k][k] = 1.0 + U[k][k]; for (int i = 0; i < k-1; i++) { U[i][k] = 0.0; } } else { for (int i = 0; i < m; i++) { U[i][k] = 0.0; } U[k][k] = 1.0; } } } // If required, generate V. if (wantv) { for (int k = n-1; k >= 0; k--) { if ((k < nrt) & (e[k] != 0.0)) { for (int j = k+1; j < nu; j++) { double t = 0; for (int i = k+1; i < n; i++) { t += V[i][k]*V[i][j]; } t = -t/V[k+1][k]; for (int i = k+1; i < n; i++) { V[i][j] += t*V[i][k]; } } } for (int i = 0; i < n; i++) { V[i][k] = 0.0; } V[k][k] = 1.0; } } // Main iteration loop for the singular values. int pp = p-1; int iter = 0; double eps = Math.pow(2.0,-52.0); while (p > 0) { int k,kase; // Here is where a test for too many iterations would go. if (iter > MAX_ITER) { throw new RuntimeException("Infinite loop in SVD."); } // This section of the program inspects for // negligible elements in the s and e arrays. On // completion the variables kase and k are set as follows. // kase = 1 if s(p) and e[k-1] are negligible and k<p // kase = 2 if s(k) is negligible and k<p // kase = 3 if e[k-1] is negligible, k<p, and // s(k), ..., s(p) are not negligible (qr step). // kase = 4 if e(p-1) is negligible (convergence). for (k = p-2; k >= -1; k--) { if (k == -1) { break; } if (Math.abs(e[k]) <= eps*(Math.abs(s[k]) + Math.abs(s[k+1]))) { e[k] = 0.0; break; } } if (k == p-2) { kase = 4; } else { int ks; for (ks = p-1; ks >= k; ks--) { if (ks == k) { break; } double t = (ks != p ? Math.abs(e[ks]) : 0.) + (ks != k+1 ? Math.abs(e[ks-1]) : 0.); if (Math.abs(s[ks]) <= eps*t) { s[ks] = 0.0; break; } } if (ks == k) { kase = 3; } else if (ks == p-1) { kase = 1; } else { kase = 2; k = ks; } } k++; // Perform the task indicated by kase. switch (kase) { // Deflate negligible s(p). case 1: { double f = e[p-2]; e[p-2] = 0.0; for (int j = p-2; j >= k; j--) { double t = Maths.hypot(s[j],f); double cs = s[j]/t; double sn = f/t; s[j] = t; if (j != k) { f = -sn*e[j-1]; e[j-1] = cs*e[j-1]; } if (wantv) { for (int i = 0; i < n; i++) { t = cs*V[i][j] + sn*V[i][p-1]; V[i][p-1] = -sn*V[i][j] + cs*V[i][p-1]; V[i][j] = t; } } } } break; // Split at negligible s(k). case 2: { double f = e[k-1]; e[k-1] = 0.0; for (int j = k; j < p; j++) { double t = Maths.hypot(s[j],f); double cs = s[j]/t; double sn = f/t; s[j] = t; f = -sn*e[j]; e[j] = cs*e[j]; if (wantu) { for (int i = 0; i < m; i++) { t = cs*U[i][j] + sn*U[i][k-1]; U[i][k-1] = -sn*U[i][j] + cs*U[i][k-1]; U[i][j] = t; } } } } break; // Perform one qr step. case 3: { // Calculate the shift. double scale = Math.max(Math.max(Math.max(Math.max( Math.abs(s[p-1]),Math.abs(s[p-2])),Math.abs(e[p-2])), Math.abs(s[k])),Math.abs(e[k])); double sp = s[p-1]/scale; double spm1 = s[p-2]/scale; double epm1 = e[p-2]/scale; double sk = s[k]/scale; double ek = e[k]/scale; double b = ((spm1 + sp)*(spm1 - sp) + epm1*epm1)/2.0; double c = (sp*epm1)*(sp*epm1); double shift = 0.0; if ((b != 0.0) | (c != 0.0)) { shift = Math.sqrt(b*b + c); if (b < 0.0) { shift = -shift; } shift = c/(b + shift); } double f = (sk + sp)*(sk - sp) + shift; double g = sk*ek; // Chase zeros. for (int j = k; j < p-1; j++) { double t = Maths.hypot(f,g); double cs = f/t; double sn = g/t; if (j != k) { e[j-1] = t; } f = cs*s[j] + sn*e[j]; e[j] = cs*e[j] - sn*s[j]; g = sn*s[j+1]; s[j+1] = cs*s[j+1]; if (wantv) { for (int i = 0; i < n; i++) { t = cs*V[i][j] + sn*V[i][j+1]; V[i][j+1] = -sn*V[i][j] + cs*V[i][j+1]; V[i][j] = t; } } t = Maths.hypot(f,g); cs = f/t; sn = g/t; s[j] = t; f = cs*e[j] + sn*s[j+1]; s[j+1] = -sn*e[j] + cs*s[j+1]; g = sn*e[j+1]; e[j+1] = cs*e[j+1]; if (wantu && (j < m-1)) { for (int i = 0; i < m; i++) { t = cs*U[i][j] + sn*U[i][j+1]; U[i][j+1] = -sn*U[i][j] + cs*U[i][j+1]; U[i][j] = t; } } } e[p-2] = f; iter = iter + 1; } break; // Convergence. case 4: { // Make the singular values positive. if (s[k] <= 0.0) { s[k] = (s[k] < 0.0 ? -s[k] : 0.0); if (wantv) { for (int i = 0; i <= pp; i++) { V[i][k] = -V[i][k]; } } } // Order the singular values. while (k < pp) { if (s[k] >= s[k+1]) { break; } double t = s[k]; s[k] = s[k+1]; s[k+1] = t; if (wantv && (k < n-1)) { for (int i = 0; i < n; i++) { t = V[i][k+1]; V[i][k+1] = V[i][k]; V[i][k] = t; } } if (wantu && (k < m-1)) { for (int i = 0; i < m; i++) { t = U[i][k+1]; U[i][k+1] = U[i][k]; U[i][k] = t; } } k++; } iter = 0; p--; } break; } } } /* ------------------------ Public Methods * ------------------------ */ /** Return the left singular vectors @return U */ public Matrix getU () { return new Matrix(U,m,Math.min(m+1,n)); } /** Return the right singular vectors @return V */ public Matrix getV () { return new Matrix(V,n,n); } /** Return the one-dimensional array of singular values @return diagonal of S. */ public double[] getSingularValues () { return s; } /** Return the diagonal matrix of singular values @return S */ public Matrix getS () { Matrix X = new Matrix(n,n); double[][] S = X.getArray(); for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { S[i][j] = 0.0; } S[i][i] = this.s[i]; } return X; } /** Two norm @return max(S) */ public double norm2 () { return s[0]; } /** Two norm condition number @return max(S)/min(S) */ public double cond () { return s[0]/s[Math.min(m,n)-1]; } /** Effective numerical matrix rank @return Number of nonnegligible singular values. */ public int rank () { double eps = Math.pow(2.0,-52.0); double tol = Math.max(m,n)*s[0]*eps; int r = 0; for (int i = 0; i < s.length; i++) { if (s[i] > tol) { r++; } } return r; } } |
From: <daw...@us...> - 2004-02-10 15:31:06
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5837/src/com/stachoodev/carrot/filter/cluster Modified Files: MultilingualLsiClustererRequestProcessor.java Log Message: [new] Feature extraction now uses carrot tokenizer [bugfix] long queries now will terminate with a runtime exception. This is caused by some bug (?) in SVD decomposition. A new version of Jama (patched by us) has to be downloaded too. [refactoring] various small code refactorings. Index: MultilingualLsiClustererRequestProcessor.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src/com/stachoodev/carrot/filter/cluster/MultilingualLsiClustererRequestProcessor.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** MultilingualLsiClustererRequestProcessor.java 30 Sep 2003 11:35:46 -0000 1.2 --- MultilingualLsiClustererRequestProcessor.java 10 Feb 2004 15:27:41 -0000 1.3 *************** *** 69,73 **** // Prepare data MultilingualClusteringContext clusteringContext = new MultilingualClusteringContext( ! new File(getServletConfig().getServletContext().getRealPath("")) ); --- 69,73 ---- // Prepare data MultilingualClusteringContext clusteringContext = new MultilingualClusteringContext( ! new File(getServletConfig().getServletContext().getRealPath("")), new HashMap() ); |
From: <daw...@us...> - 2004-02-10 15:31:05
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/jama-test In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5837/src-test/jama-test Added Files: badmatrix Log Message: [new] Feature extraction now uses carrot tokenizer [bugfix] long queries now will terminate with a runtime exception. This is caused by some bug (?) in SVD decomposition. A new version of Jama (patched by us) has to be downloaded too. [refactoring] various small code refactorings. --- NEW FILE: badmatrix --- (This appears to be a binary file; contents omitted.) |
From: <daw...@us...> - 2004-02-10 15:29:52
|
Update of /cvsroot/carrot2/carrot2/lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7436 Modified Files: jama.dep.xml Added Files: Jama-1.0.1-patched.jar Removed Files: Jama-1.0.1.jar Log Message: patched version of Jama (throws a runtime exception after a couple of dozen iterations to prevent deadlocks). --- NEW FILE: Jama-1.0.1-patched.jar --- (This appears to be a binary file; contents omitted.) Index: jama.dep.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/lib/jama.dep.xml,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** jama.dep.xml 6 Feb 2004 18:16:29 -0000 1.1 --- jama.dep.xml 10 Feb 2004 15:26:28 -0000 1.2 *************** *** 1,5 **** <component name="jama"> <!-- Files that this component is composed of --> ! <file location="Jama-1.0.1.jar" /> </component> --- 1,9 ---- + <!-- + The SVD decomposition in Jama has been patched to throw a runtimeexception if + the decomposition goes beyond a certain number of iterations. + --> <component name="jama"> <!-- Files that this component is composed of --> ! <file location="Jama-1.0.1-patched.jar" /> </component> --- Jama-1.0.1.jar DELETED --- |
From: <daw...@us...> - 2004-02-10 15:28:45
|
Update of /cvsroot/carrot2/carrot2 In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7079 Modified Files: .cvsignore history.xml Log Message: no message Index: .cvsignore =================================================================== RCS file: /cvsroot/carrot2/carrot2/.cvsignore,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** .cvsignore 19 Sep 2003 10:14:51 -0000 1.1.1.1 --- .cvsignore 10 Feb 2004 15:25:20 -0000 1.2 *************** *** 1,2 **** --- 1,3 ---- tmp local-build.properties + lib-src Index: history.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/history.xml,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** history.xml 9 Feb 2004 20:48:22 -0000 1.14 --- history.xml 10 Feb 2004 15:25:21 -0000 1.15 *************** *** 9,12 **** --- 9,28 ---- <history> <changelist> + <date>2004-02-10</date> + <committer>dawid</committer> + + <change component="carrot2.filter.lingo-clusterer" type="new"> + Carrot shared tokenizer used for feature extraction. + </change> + + <change component="carrot2.filter.lingo-clusterer" type="bugfix"> + If Jama falls into an infinite SVD decomposition, it will + throw a runtime exception. This situation has happened before + (more -- seems to be quite common). Download a patched Jama too. + I have contacted Jama's authors at NIST, we will see what they say. + </change> + </changelist> + + <changelist> <date>2004-02-09</date> <committer>dawid</committer> *************** *** 21,25 **** It is now fixed. </change> - </changelist> --- 37,40 ---- |
From: <daw...@us...> - 2004-02-10 15:23:29
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/jama-test In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/jama-test Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/jama-test added to the repository |
From: <daw...@us...> - 2004-02-10 15:23:28
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot/filter/cluster In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/com/stachoodev/carrot/filter/cluster Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot/filter/cluster added to the repository |
From: <daw...@us...> - 2004-02-10 15:23:28
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/data In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/data Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/data added to the repository |
From: <daw...@us...> - 2004-02-10 15:23:28
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot/filter In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/com/stachoodev/carrot/filter Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot/filter added to the repository |
From: <daw...@us...> - 2004-02-10 15:23:28
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot/filter/cluster/lsicluster In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/com/stachoodev/carrot/filter/cluster/lsicluster Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot/filter/cluster/lsicluster added to the repository |
From: <daw...@us...> - 2004-02-10 15:23:27
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/com/stachoodev/carrot Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev/carrot added to the repository |
From: <daw...@us...> - 2004-02-10 15:23:27
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/com/stachoodev Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com/stachoodev added to the repository |