carrot2-cvscommits Mailing List for Carrot2 (Page 398)
Brought to you by:
dawidweiss,
stachoo
This list is closed, nobody may subscribe to it.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(26) |
Nov
(58) |
Dec
(1) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(16) |
Feb
(176) |
Mar
(2) |
Apr
(23) |
May
(854) |
Jun
(650) |
Jul
(248) |
Aug
(104) |
Sep
(58) |
Oct
(24) |
Nov
|
Dec
(27) |
2005 |
Jan
|
Feb
(186) |
Mar
(127) |
Apr
(54) |
May
(8) |
Jun
(103) |
Jul
(38) |
Aug
(75) |
Sep
(92) |
Oct
(110) |
Nov
(42) |
Dec
(146) |
2006 |
Jan
(733) |
Feb
(80) |
Mar
(23) |
Apr
(41) |
May
(31) |
Jun
(89) |
Jul
(137) |
Aug
(93) |
Sep
(96) |
Oct
(31) |
Nov
(36) |
Dec
(25) |
2007 |
Jan
(58) |
Feb
(25) |
Mar
(29) |
Apr
(68) |
May
(55) |
Jun
(43) |
Jul
(54) |
Aug
(104) |
Sep
(10) |
Oct
(24) |
Nov
(41) |
Dec
(32) |
2008 |
Jan
(80) |
Feb
(81) |
Mar
(141) |
Apr
(141) |
May
(94) |
Jun
(63) |
Jul
(141) |
Aug
(87) |
Sep
(66) |
Oct
(84) |
Nov
(110) |
Dec
(58) |
2009 |
Jan
(21) |
Feb
(56) |
Mar
(53) |
Apr
(67) |
May
(95) |
Jun
(10) |
Jul
(93) |
Aug
(41) |
Sep
(62) |
Oct
(54) |
Nov
(39) |
Dec
(40) |
2010 |
Jan
(81) |
Feb
(154) |
Mar
(123) |
Apr
(56) |
May
(38) |
Jun
(28) |
Jul
(53) |
Aug
(78) |
Sep
(64) |
Oct
(90) |
Nov
(12) |
Dec
(23) |
2011 |
Jan
(88) |
Feb
(24) |
Mar
(111) |
Apr
(59) |
May
(15) |
Jun
(8) |
Jul
(63) |
Aug
(37) |
Sep
(90) |
Oct
(7) |
Nov
(48) |
Dec
(39) |
2012 |
Jan
(7) |
Feb
(2) |
Mar
(16) |
Apr
(7) |
May
(35) |
Jun
(58) |
Jul
(17) |
Aug
(61) |
Sep
(18) |
Oct
(4) |
Nov
(25) |
Dec
(8) |
2013 |
Jan
(8) |
Feb
|
Mar
(13) |
Apr
(43) |
May
(26) |
Jun
(11) |
Jul
(16) |
Aug
(5) |
Sep
|
Oct
(43) |
Nov
(6) |
Dec
(10) |
2014 |
Jan
(22) |
Feb
(35) |
Mar
(5) |
Apr
(16) |
May
(8) |
Jun
(5) |
Jul
(12) |
Aug
(2) |
Sep
(4) |
Oct
|
Nov
(24) |
Dec
|
2015 |
Jan
(2) |
Feb
(31) |
Mar
(15) |
Apr
(3) |
May
(32) |
Jun
|
Jul
(11) |
Aug
(15) |
Sep
(5) |
Oct
(27) |
Nov
(3) |
Dec
|
2016 |
Jan
|
Feb
(16) |
Mar
(3) |
Apr
|
May
(7) |
Jun
|
Jul
(7) |
Aug
(29) |
Sep
(10) |
Oct
(8) |
Nov
(12) |
Dec
|
2017 |
Jan
|
Feb
(4) |
Mar
(6) |
Apr
(3) |
May
(1) |
Jun
|
Jul
(10) |
Aug
(1) |
Sep
(4) |
Oct
|
Nov
(3) |
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(32) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2019 |
Jan
(13) |
Feb
(43) |
Mar
(31) |
Apr
(32) |
May
(30) |
Jun
(13) |
Jul
(6) |
Aug
(30) |
Sep
(43) |
Oct
(43) |
Nov
(28) |
Dec
(26) |
2020 |
Jan
(19) |
Feb
(16) |
Mar
(4) |
Apr
(5) |
May
(15) |
Jun
(14) |
Jul
(22) |
Aug
(1) |
Sep
(12) |
Oct
(16) |
Nov
(50) |
Dec
(79) |
2021 |
Jan
(52) |
Feb
(5) |
Mar
(50) |
Apr
(8) |
May
(4) |
Jun
(20) |
Jul
(15) |
Aug
(25) |
Sep
(3) |
Oct
|
Nov
(15) |
Dec
(19) |
2022 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
|
May
(10) |
Jun
(2) |
Jul
(9) |
Aug
(15) |
Sep
(1) |
Oct
(6) |
Nov
(12) |
Dec
(2) |
2023 |
Jan
(4) |
Feb
(2) |
Mar
(2) |
Apr
|
May
(22) |
Jun
(1) |
Jul
(2) |
Aug
|
Sep
|
Oct
(12) |
Nov
(24) |
Dec
|
From: <daw...@us...> - 2004-02-10 15:23:27
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/Jama In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/Jama Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/Jama added to the repository |
From: <daw...@us...> - 2004-02-10 15:23:26
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test added to the repository |
From: <daw...@us...> - 2004-02-10 15:23:26
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5228/src-test/com Log Message: Directory /cvsroot/carrot2/carrot2/components/filters/clustering/lingo-clustering/src-test/com added to the repository |
From: <daw...@us...> - 2004-02-09 20:52:23
|
Update of /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src/com/dawidweiss/carrot/tokenizer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8916/components/carrot2-shared-lib/src/com/dawidweiss/carrot/tokenizer Modified Files: ParseException.java SimpleCharStream.java Token.java TokenMgrError.java Tokenizer.java TokenizerImpl.java TokenizerImpl.jj TokenizerImplConstants.java TokenizerImplTokenManager.java Log Message: [change], component: carrot2-shared-lib The tokenizer now recognizes numeric types and a wider range of url (and www addresses). Index: ParseException.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src/com/dawidweiss/carrot/tokenizer/ParseException.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** ParseException.java 19 Sep 2003 10:14:54 -0000 1.1.1.1 --- ParseException.java 9 Feb 2004 20:48:22 -0000 1.2 *************** *** 1,248 **** - - - /* - * Carrot2 Project - * Copyright (C) 2002-2003, Dawid Weiss - * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. - * All rights reserved. - * - * Refer to full text of the licence "carrot2.LICENCE" in the root folder - * of CVS checkout or at: - * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE - */ - - /* Generated By:JavaCC: Do not edit this line. ParseException.java Version 2.1 */ package com.dawidweiss.carrot.tokenizer; - /** ! * This exception is thrown when parse errors are encountered. You can explicitly create objects of ! * this exception type by calling the method generateParseException in the generated parser. You ! * can modify this class to customize your error reporting mechanisms so long as you retain the ! * public fields. */ ! public class ParseException ! extends Exception ! { ! /** ! * This constructor is used by the method "generateParseException" in the generated parser. ! * Calling this constructor generates a new object of this type with the fields ! * "currentToken", "expectedTokenSequences", and "tokenImage" set. The boolean flag ! * "specialConstructor" is also set to true to indicate that this constructor was used to ! * create this object. This constructor calls its super class with the empty string to force ! * the "toString" method of parent class "Throwable" to print the error message in the form: ! * ParseException: <result of getMessage> ! */ ! public ParseException( ! Token currentTokenVal, int [][] expectedTokenSequencesVal, String [] tokenImageVal ! ) ! { ! super(""); ! specialConstructor = true; ! currentToken = currentTokenVal; ! expectedTokenSequences = expectedTokenSequencesVal; ! tokenImage = tokenImageVal; ! } ! ! ! /** ! * The following constructors are for use by you for whatever purpose you can think of. ! * Constructing the exception in this manner makes the exception behave in the normal way - ! * i.e., as documented in the class "Throwable". The fields "errorToken", ! * "expectedTokenSequences", and "tokenImage" do not contain relevant information. The JavaCC ! * generated code does not use these constructors. ! */ ! public ParseException() ! { ! super(); ! specialConstructor = false; ! } ! ! ! public ParseException(String message) ! { ! super(message); ! specialConstructor = false; ! } ! ! /** ! * This variable determines which constructor was used to create this object and thereby ! * affects the semantics of the "getMessage" method (see below). ! */ ! protected boolean specialConstructor; ! ! /** ! * This is the last token that has been consumed successfully. If this object has been created ! * due to a parse error, the token followng this token will (therefore) be the first error ! * token. ! */ ! public Token currentToken; ! ! /** ! * Each entry in this array is an array of integers. Each array of integers represents a ! * sequence of tokens (by their ordinal values) that is expected at this point of the parse. ! */ ! public int [][] expectedTokenSequences; ! ! /** ! * This is a reference to the "tokenImage" array of the generated parser within which the parse ! * error occurred. This array is defined in the generated ...Constants interface. ! */ ! public String [] tokenImage; ! ! /** ! * This method has the standard behavior when this object has been created using the standard ! * constructors. Otherwise, it uses "currentToken" and "expectedTokenSequences" to generate a ! * parse error message and returns it. If this object has been created due to a parse error, ! * and you do not catch it (it gets thrown from the parser), then this method is called during ! * the printing of the final stack trace, and hence the correct error message gets displayed. ! */ ! public String getMessage() ! { ! if (!specialConstructor) ! { ! return super.getMessage(); ! } ! ! String expected = ""; ! int maxSize = 0; ! ! for (int i = 0; i < expectedTokenSequences.length; i++) ! { ! if (maxSize < expectedTokenSequences[i].length) ! { ! maxSize = expectedTokenSequences[i].length; ! } ! ! for (int j = 0; j < expectedTokenSequences[i].length; j++) ! { ! expected += (tokenImage[expectedTokenSequences[i][j]] + " "); ! } ! ! if (expectedTokenSequences[i][expectedTokenSequences[i].length - 1] != 0) ! { ! expected += "..."; ! } ! ! expected += (eol + " "); ! } ! String retval = "Encountered \""; ! Token tok = currentToken.next; ! for (int i = 0; i < maxSize; i++) ! { ! if (i != 0) ! { ! retval += " "; ! } ! if (tok.kind == 0) ! { ! retval += tokenImage[0]; ! break; ! } ! retval += add_escapes(tok.image); ! tok = tok.next; ! } ! retval += ("\" at line " + currentToken.next.beginLine + ", column " ! + currentToken.next.beginColumn); ! retval += ("." + eol); ! if (expectedTokenSequences.length == 1) ! { ! retval += ("Was expecting:" + eol + " "); ! } ! else ! { ! retval += ("Was expecting one of:" + eol + " "); ! } ! retval += expected; ! return retval; } ! /** The end of line string for this machine. */ ! protected String eol = System.getProperty("line.separator", "\n"); ! ! /** ! * Used to convert raw characters to their escaped version when these raw version cannot be ! * used as part of an ASCII string literal. ! */ ! protected String add_escapes(String str) ! { ! StringBuffer retval = new StringBuffer(); ! char ch; ! ! for (int i = 0; i < str.length(); i++) { ! switch (str.charAt(i)) ! { ! case 0: ! ! continue; ! ! case '\b': ! retval.append("\\b"); ! ! continue; ! ! case '\t': ! retval.append("\\t"); ! ! continue; ! ! case '\n': ! retval.append("\\n"); ! ! continue; ! ! case '\f': ! retval.append("\\f"); ! ! continue; ! ! case '\r': ! retval.append("\\r"); ! ! continue; ! ! case '\"': ! retval.append("\\\""); ! ! continue; ! ! case '\'': ! retval.append("\\\'"); ! ! continue; ! ! case '\\': ! retval.append("\\\\"); ! ! continue; ! ! default: ! ! if (((ch = str.charAt(i)) < 0x20) || (ch > 0x7e)) ! { ! String s = "0000" + Integer.toString(ch, 16); ! retval.append("\\u" + s.substring(s.length() - 4, s.length())); ! } ! else ! { ! retval.append(ch); ! } ! ! continue; ! } } - return retval.toString(); - } } --- 1,192 ---- /* Generated By:JavaCC: Do not edit this line. ParseException.java Version 2.1 */ package com.dawidweiss.carrot.tokenizer; /** ! * This exception is thrown when parse errors are encountered. ! * You can explicitly create objects of this exception type by ! * calling the method generateParseException in the generated ! * parser. ! * ! * You can modify this class to customize your error reporting ! * mechanisms so long as you retain the public fields. */ ! public class ParseException extends Exception { ! /** ! * This constructor is used by the method "generateParseException" ! * in the generated parser. Calling this constructor generates ! * a new object of this type with the fields "currentToken", ! * "expectedTokenSequences", and "tokenImage" set. The boolean ! * flag "specialConstructor" is also set to true to indicate that ! * this constructor was used to create this object. ! * This constructor calls its super class with the empty string ! * to force the "toString" method of parent class "Throwable" to ! * print the error message in the form: ! * ParseException: <result of getMessage> ! */ ! public ParseException(Token currentTokenVal, ! int[][] expectedTokenSequencesVal, ! String[] tokenImageVal ! ) ! { ! super(""); ! specialConstructor = true; ! currentToken = currentTokenVal; ! expectedTokenSequences = expectedTokenSequencesVal; ! tokenImage = tokenImageVal; ! } ! /** ! * The following constructors are for use by you for whatever ! * purpose you can think of. Constructing the exception in this ! * manner makes the exception behave in the normal way - i.e., as ! * documented in the class "Throwable". The fields "errorToken", ! * "expectedTokenSequences", and "tokenImage" do not contain ! * relevant information. The JavaCC generated code does not use ! * these constructors. ! */ ! public ParseException() { ! super(); ! specialConstructor = false; ! } ! public ParseException(String message) { ! super(message); ! specialConstructor = false; ! } ! /** ! * This variable determines which constructor was used to create ! * this object and thereby affects the semantics of the ! * "getMessage" method (see below). ! */ ! protected boolean specialConstructor; ! /** ! * This is the last token that has been consumed successfully. If ! * this object has been created due to a parse error, the token ! * followng this token will (therefore) be the first error token. ! */ ! public Token currentToken; ! /** ! * Each entry in this array is an array of integers. Each array ! * of integers represents a sequence of tokens (by their ordinal ! * values) that is expected at this point of the parse. ! */ ! public int[][] expectedTokenSequences; ! /** ! * This is a reference to the "tokenImage" array of the generated ! * parser within which the parse error occurred. This array is ! * defined in the generated ...Constants interface. ! */ ! public String[] tokenImage; ! /** ! * This method has the standard behavior when this object has been ! * created using the standard constructors. Otherwise, it uses ! * "currentToken" and "expectedTokenSequences" to generate a parse ! * error message and returns it. If this object has been created ! * due to a parse error, and you do not catch it (it gets thrown ! * from the parser), then this method is called during the printing ! * of the final stack trace, and hence the correct error message ! * gets displayed. ! */ ! public String getMessage() { ! if (!specialConstructor) { ! return super.getMessage(); } + String expected = ""; + int maxSize = 0; + for (int i = 0; i < expectedTokenSequences.length; i++) { + if (maxSize < expectedTokenSequences[i].length) { + maxSize = expectedTokenSequences[i].length; + } + for (int j = 0; j < expectedTokenSequences[i].length; j++) { + expected += tokenImage[expectedTokenSequences[i][j]] + " "; + } + if (expectedTokenSequences[i][expectedTokenSequences[i].length - 1] != 0) { + expected += "..."; + } + expected += eol + " "; + } + String retval = "Encountered \""; + Token tok = currentToken.next; + for (int i = 0; i < maxSize; i++) { + if (i != 0) retval += " "; + if (tok.kind == 0) { + retval += tokenImage[0]; + break; + } + retval += add_escapes(tok.image); + tok = tok.next; + } + retval += "\" at line " + currentToken.next.beginLine + ", column " + currentToken.next.beginColumn; + retval += "." + eol; + if (expectedTokenSequences.length == 1) { + retval += "Was expecting:" + eol + " "; + } else { + retval += "Was expecting one of:" + eol + " "; + } + retval += expected; + return retval; + } ! /** ! * The end of line string for this machine. ! */ ! protected String eol = System.getProperty("line.separator", "\n"); ! ! /** ! * Used to convert raw characters to their escaped version ! * when these raw version cannot be used as part of an ASCII ! * string literal. ! */ ! protected String add_escapes(String str) { ! StringBuffer retval = new StringBuffer(); ! char ch; ! for (int i = 0; i < str.length(); i++) { ! switch (str.charAt(i)) { ! case 0 : ! continue; ! case '\b': ! retval.append("\\b"); ! continue; ! case '\t': ! retval.append("\\t"); ! continue; ! case '\n': ! retval.append("\\n"); ! continue; ! case '\f': ! retval.append("\\f"); ! continue; ! case '\r': ! retval.append("\\r"); ! continue; ! case '\"': ! retval.append("\\\""); ! continue; ! case '\'': ! retval.append("\\\'"); ! continue; ! case '\\': ! retval.append("\\\\"); ! continue; ! default: ! if ((ch = str.charAt(i)) < 0x20 || ch > 0x7e) { ! String s = "0000" + Integer.toString(ch, 16); ! retval.append("\\u" + s.substring(s.length() - 4, s.length())); ! } else { ! retval.append(ch); ! } ! continue; } + } + return retval.toString(); + } } Index: SimpleCharStream.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src/com/dawidweiss/carrot/tokenizer/SimpleCharStream.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** SimpleCharStream.java 19 Sep 2003 10:14:54 -0000 1.1.1.1 --- SimpleCharStream.java 9 Feb 2004 20:48:23 -0000 1.2 *************** *** 1,485 **** - - - /* - * Carrot2 Project - * Copyright (C) 2002-2003, Dawid Weiss - * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. - * All rights reserved. - * - * Refer to full text of the licence "carrot2.LICENCE" in the root folder - * of CVS checkout or at: - * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE - */ - - /* Generated By:JavaCC: Do not edit this line. SimpleCharStream.java Version 2.1 */ package com.dawidweiss.carrot.tokenizer; - /** ! * An implementation of interface CharStream, where the stream is assumed to contain only ASCII ! * characters (without unicode processing). */ public final class SimpleCharStream { ! public static final boolean staticFlag = false; ! int bufsize; ! int available; ! int tokenBegin; ! public int bufpos = -1; ! private int [] bufline; ! private int [] bufcolumn; ! private int column = 0; ! private int line = 1; ! private boolean prevCharIsCR = false; ! private boolean prevCharIsLF = false; ! private java.io.Reader inputStream; ! private char [] buffer; ! private int maxNextCharInd = 0; ! private int inBuf = 0; ! ! private final void ExpandBuff(boolean wrapAround) ! { ! char [] newbuffer = new char[bufsize + 2048]; ! int [] newbufline = new int[bufsize + 2048]; ! int [] newbufcolumn = new int[bufsize + 2048]; ! ! try ! { ! if (wrapAround) ! { ! System.arraycopy(buffer, tokenBegin, newbuffer, 0, bufsize - tokenBegin); ! System.arraycopy(buffer, 0, newbuffer, bufsize - tokenBegin, bufpos); ! buffer = newbuffer; ! System.arraycopy(bufline, tokenBegin, newbufline, 0, bufsize - tokenBegin); ! System.arraycopy(bufline, 0, newbufline, bufsize - tokenBegin, bufpos); ! bufline = newbufline; ! System.arraycopy(bufcolumn, tokenBegin, newbufcolumn, 0, bufsize - tokenBegin); ! System.arraycopy(bufcolumn, 0, newbufcolumn, bufsize - tokenBegin, bufpos); ! bufcolumn = newbufcolumn; ! maxNextCharInd = (bufpos += (bufsize - tokenBegin)); ! } ! else ! { ! System.arraycopy(buffer, tokenBegin, newbuffer, 0, bufsize - tokenBegin); ! buffer = newbuffer; ! System.arraycopy(bufline, tokenBegin, newbufline, 0, bufsize - tokenBegin); ! bufline = newbufline; ! System.arraycopy(bufcolumn, tokenBegin, newbufcolumn, 0, bufsize - tokenBegin); ! bufcolumn = newbufcolumn; ! maxNextCharInd = (bufpos -= tokenBegin); ! } ! } ! catch (Throwable t) { ! throw new Error(t.getMessage()); ! } ! bufsize += 2048; ! available = bufsize; ! tokenBegin = 0; ! } ! private final void FillBuff() ! throws java.io.IOException ! { ! if (maxNextCharInd == available) ! { ! if (available == bufsize) ! { ! if (tokenBegin > 2048) ! { ! bufpos = maxNextCharInd = 0; ! available = tokenBegin; ! } ! else if (tokenBegin < 0) ! { ! bufpos = maxNextCharInd = 0; ! } ! else ! { ! ExpandBuff(false); ! } ! } ! else if (available > tokenBegin) ! { ! available = bufsize; ! } ! else if ((tokenBegin - available) < 2048) ! { ! ExpandBuff(true); ! } ! else ! { ! available = tokenBegin; ! } } ! ! int i; ! ! try { ! if ((i = inputStream.read(buffer, maxNextCharInd, available - maxNextCharInd)) == -1) ! { ! inputStream.close(); ! throw new java.io.IOException(); ! } ! else ! { ! maxNextCharInd += i; ! } ! return; ! } ! catch (java.io.IOException e) ! { ! --bufpos; ! backup(0); ! if (tokenBegin == -1) ! { ! tokenBegin = bufpos; ! } ! throw e; } ! } ! ! ! public final char BeginToken() ! throws java.io.IOException ! { ! tokenBegin = -1; ! ! char c = readChar(); ! tokenBegin = bufpos; ! ! return c; ! } ! private final void UpdateLineColumn(char c) ! { ! column++; ! if (prevCharIsLF) ! { ! prevCharIsLF = false; ! line += (column = 1); ! } ! else if (prevCharIsCR) { ! prevCharIsCR = false; ! ! if (c == '\n') ! { ! prevCharIsLF = true; ! } ! else ! { ! line += (column = 1); ! } } ! switch (c) { ! case '\r': ! prevCharIsCR = true; ! ! break; ! ! case '\n': ! prevCharIsLF = true; ! ! break; ! ! case '\t': ! column--; ! column += (8 - (column & 07)); ! ! break; ! ! default: ! break; } ! bufline[bufpos] = line; ! bufcolumn[bufpos] = column; ! } ! ! ! public final char readChar() ! throws java.io.IOException ! { ! if (inBuf > 0) ! { ! --inBuf; ! if (++bufpos == bufsize) ! { ! bufpos = 0; ! } ! return buffer[bufpos]; ! } ! if (++bufpos >= maxNextCharInd) { ! FillBuff(); } ! char c = buffer[bufpos]; ! ! UpdateLineColumn(c); ! ! return (c); ! } ! ! ! /** ! * @see #getEndColumn ! * @deprecated ! */ ! public final int getColumn() ! { ! return bufcolumn[bufpos]; ! } ! ! ! /** ! * @see #getEndLine ! * @deprecated ! */ ! public final int getLine() ! { ! return bufline[bufpos]; ! } ! ! ! public final int getEndColumn() ! { ! return bufcolumn[bufpos]; ! } ! ! ! public final int getEndLine() ! { ! return bufline[bufpos]; ! } ! ! ! public final int getBeginColumn() ! { ! return bufcolumn[tokenBegin]; ! } ! ! ! public final int getBeginLine() ! { ! return bufline[tokenBegin]; ! } ! public final void backup(int amount) ! { ! inBuf += amount; ! if ((bufpos -= amount) < 0) ! { ! bufpos += bufsize; ! } ! } ! public SimpleCharStream(java.io.Reader dstream, int startline, int startcolumn, int buffersize) ! { ! inputStream = dstream; ! line = startline; ! column = startcolumn - 1; ! available = bufsize = buffersize; ! buffer = new char[buffersize]; ! bufline = new int[buffersize]; ! bufcolumn = new int[buffersize]; ! } ! public SimpleCharStream(java.io.Reader dstream, int startline, int startcolumn) ! { ! this(dstream, startline, startcolumn, 4096); ! } ! public SimpleCharStream(java.io.Reader dstream) ! { ! this(dstream, 1, 1, 4096); ! } ! public void ReInit(java.io.Reader dstream, int startline, int startcolumn, int buffersize) ! { ! inputStream = dstream; ! line = startline; ! column = startcolumn - 1; ! if ((buffer == null) || (buffersize != buffer.length)) ! { ! available = bufsize = buffersize; ! buffer = new char[buffersize]; ! bufline = new int[buffersize]; ! bufcolumn = new int[buffersize]; ! } ! prevCharIsLF = prevCharIsCR = false; ! tokenBegin = inBuf = maxNextCharInd = 0; ! bufpos = -1; ! } ! public void ReInit(java.io.Reader dstream, int startline, int startcolumn) ! { ! ReInit(dstream, startline, startcolumn, 4096); ! } ! public void ReInit(java.io.Reader dstream) ! { ! ReInit(dstream, 1, 1, 4096); ! } ! public SimpleCharStream( ! java.io.InputStream dstream, int startline, int startcolumn, int buffersize ! ) ! { ! this(new java.io.InputStreamReader(dstream), startline, startcolumn, 4096); ! } ! public SimpleCharStream(java.io.InputStream dstream, int startline, int startcolumn) ! { ! this(dstream, startline, startcolumn, 4096); ! } ! public SimpleCharStream(java.io.InputStream dstream) ! { ! this(dstream, 1, 1, 4096); ! } ! public void ReInit(java.io.InputStream dstream, int startline, int startcolumn, int buffersize) { ! ReInit(new java.io.InputStreamReader(dstream), startline, startcolumn, 4096); } ! public void ReInit(java.io.InputStream dstream) ! { ! ReInit(dstream, 1, 1, 4096); ! } ! ! public void ReInit(java.io.InputStream dstream, int startline, int startcolumn) ! { ! ReInit(dstream, startline, startcolumn, 4096); ! } ! public final String GetImage() ! { ! if (bufpos >= tokenBegin) ! { ! return new String(buffer, tokenBegin, bufpos - tokenBegin + 1); ! } ! else ! { ! return new String(buffer, tokenBegin, bufsize - tokenBegin) ! + new String(buffer, 0, bufpos + 1); ! } ! } ! public final char [] GetSuffix(int len) ! { ! char [] ret = new char[len]; ! if ((bufpos + 1) >= len) ! { ! System.arraycopy(buffer, bufpos - len + 1, ret, 0, len); ! } ! else ! { ! System.arraycopy(buffer, bufsize - (len - bufpos - 1), ret, 0, len - bufpos - 1); ! System.arraycopy(buffer, 0, ret, len - bufpos - 1, bufpos + 1); ! } ! return ret; ! } ! public void Done() ! { ! buffer = null; ! bufline = null; ! bufcolumn = null; ! } ! /** ! * Method to adjust line and column numbers for the start of a token.<BR> ! */ ! public void adjustBeginLineColumn(int newLine, int newCol) ! { ! int start = tokenBegin; ! int len; ! if (bufpos >= tokenBegin) ! { ! len = bufpos - tokenBegin + inBuf + 1; ! } ! else ! { ! len = bufsize - tokenBegin + bufpos + 1 + inBuf; ! } ! int i = 0; ! int j = 0; ! int k = 0; ! int nextColDiff = 0; ! int columnDiff = 0; ! while ((i < len) && (bufline[j = start % bufsize] == bufline[k = ++start % bufsize])) { ! bufline[j] = newLine; ! nextColDiff = (columnDiff + bufcolumn[k]) - bufcolumn[j]; ! bufcolumn[j] = newCol + columnDiff; ! columnDiff = nextColDiff; ! i++; } ! if (i < len) ! { ! bufline[j] = newLine++; ! bufcolumn[j] = newCol + columnDiff; ! ! while (i++ < len) ! { ! if (bufline[j = start % bufsize] != bufline[++start % bufsize]) ! { ! bufline[j] = newLine++; ! } ! else ! { ! bufline[j] = newLine; ! } ! } ! } - line = bufline[j]; - column = bufcolumn[j]; - } } --- 1,401 ---- /* Generated By:JavaCC: Do not edit this line. SimpleCharStream.java Version 2.1 */ package com.dawidweiss.carrot.tokenizer; /** ! * An implementation of interface CharStream, where the stream is assumed to ! * contain only ASCII characters (without unicode processing). */ + public final class SimpleCharStream { ! public static final boolean staticFlag = false; ! int bufsize; ! int available; ! int tokenBegin; ! public int bufpos = -1; ! private int bufline[]; ! private int bufcolumn[]; ! private int column = 0; ! private int line = 1; ! private boolean prevCharIsCR = false; ! private boolean prevCharIsLF = false; ! private java.io.Reader inputStream; ! private char[] buffer; ! private int maxNextCharInd = 0; ! private int inBuf = 0; ! private final void ExpandBuff(boolean wrapAround) ! { ! char[] newbuffer = new char[bufsize + 2048]; ! int newbufline[] = new int[bufsize + 2048]; ! int newbufcolumn[] = new int[bufsize + 2048]; ! try ! { ! if (wrapAround) { ! System.arraycopy(buffer, tokenBegin, newbuffer, 0, bufsize - tokenBegin); ! System.arraycopy(buffer, 0, newbuffer, ! bufsize - tokenBegin, bufpos); ! buffer = newbuffer; ! System.arraycopy(bufline, tokenBegin, newbufline, 0, bufsize - tokenBegin); ! System.arraycopy(bufline, 0, newbufline, bufsize - tokenBegin, bufpos); ! bufline = newbufline; + System.arraycopy(bufcolumn, tokenBegin, newbufcolumn, 0, bufsize - tokenBegin); + System.arraycopy(bufcolumn, 0, newbufcolumn, bufsize - tokenBegin, bufpos); + bufcolumn = newbufcolumn; ! maxNextCharInd = (bufpos += (bufsize - tokenBegin)); } ! else { ! System.arraycopy(buffer, tokenBegin, newbuffer, 0, bufsize - tokenBegin); ! buffer = newbuffer; ! System.arraycopy(bufline, tokenBegin, newbufline, 0, bufsize - tokenBegin); ! bufline = newbufline; ! System.arraycopy(bufcolumn, tokenBegin, newbufcolumn, 0, bufsize - tokenBegin); ! bufcolumn = newbufcolumn; ! maxNextCharInd = (bufpos -= tokenBegin); } ! } ! catch (Throwable t) ! { ! throw new Error(t.getMessage()); ! } ! bufsize += 2048; ! available = bufsize; ! tokenBegin = 0; ! } ! private final void FillBuff() throws java.io.IOException ! { ! if (maxNextCharInd == available) ! { ! if (available == bufsize) { ! if (tokenBegin > 2048) ! { ! bufpos = maxNextCharInd = 0; ! available = tokenBegin; ! } ! else if (tokenBegin < 0) ! bufpos = maxNextCharInd = 0; ! else ! ExpandBuff(false); } + else if (available > tokenBegin) + available = bufsize; + else if ((tokenBegin - available) < 2048) + ExpandBuff(true); + else + available = tokenBegin; + } ! int i; ! try { ! if ((i = inputStream.read(buffer, maxNextCharInd, ! available - maxNextCharInd)) == -1) { ! inputStream.close(); ! throw new java.io.IOException(); } + else + maxNextCharInd += i; + return; + } + catch(java.io.IOException e) { + --bufpos; + backup(0); + if (tokenBegin == -1) + tokenBegin = bufpos; + throw e; + } + } ! public final char BeginToken() throws java.io.IOException ! { ! tokenBegin = -1; ! char c = readChar(); ! tokenBegin = bufpos; ! return c; ! } ! private final void UpdateLineColumn(char c) ! { ! column++; ! if (prevCharIsLF) ! { ! prevCharIsLF = false; ! line += (column = 1); ! } ! else if (prevCharIsCR) ! { ! prevCharIsCR = false; ! if (c == '\n') { ! prevCharIsLF = true; } + else + line += (column = 1); + } ! switch (c) ! { ! case '\r' : ! prevCharIsCR = true; ! break; ! case '\n' : ! prevCharIsLF = true; ! break; ! case '\t' : ! column--; ! column += (8 - (column & 07)); ! break; ! default : ! break; ! } + bufline[bufpos] = line; + bufcolumn[bufpos] = column; + } ! public final char readChar() throws java.io.IOException ! { ! if (inBuf > 0) ! { ! --inBuf; ! if (++bufpos == bufsize) ! bufpos = 0; ! return buffer[bufpos]; ! } ! if (++bufpos >= maxNextCharInd) ! FillBuff(); + char c = buffer[bufpos]; ! UpdateLineColumn(c); ! return (c); ! } + /** + * @deprecated + * @see #getEndColumn + */ ! public final int getColumn() { ! return bufcolumn[bufpos]; ! } ! /** ! * @deprecated ! * @see #getEndLine ! */ ! public final int getLine() { ! return bufline[bufpos]; ! } ! public final int getEndColumn() { ! return bufcolumn[bufpos]; ! } + public final int getEndLine() { + return bufline[bufpos]; + } ! public final int getBeginColumn() { ! return bufcolumn[tokenBegin]; ! } + public final int getBeginLine() { + return bufline[tokenBegin]; + } ! public final void backup(int amount) { ! inBuf += amount; ! if ((bufpos -= amount) < 0) ! bufpos += bufsize; ! } + public SimpleCharStream(java.io.Reader dstream, int startline, + int startcolumn, int buffersize) + { + inputStream = dstream; + line = startline; + column = startcolumn - 1; ! available = bufsize = buffersize; ! buffer = new char[buffersize]; ! bufline = new int[buffersize]; ! bufcolumn = new int[buffersize]; ! } + public SimpleCharStream(java.io.Reader dstream, int startline, + int startcolumn) + { + this(dstream, startline, startcolumn, 4096); + } ! public SimpleCharStream(java.io.Reader dstream) ! { ! this(dstream, 1, 1, 4096); ! } ! public void ReInit(java.io.Reader dstream, int startline, ! int startcolumn, int buffersize) ! { ! inputStream = dstream; ! line = startline; ! column = startcolumn - 1; ! if (buffer == null || buffersize != buffer.length) { ! available = bufsize = buffersize; ! buffer = new char[buffersize]; ! bufline = new int[buffersize]; ! bufcolumn = new int[buffersize]; } + prevCharIsLF = prevCharIsCR = false; + tokenBegin = inBuf = maxNextCharInd = 0; + bufpos = -1; + } + public void ReInit(java.io.Reader dstream, int startline, + int startcolumn) + { + ReInit(dstream, startline, startcolumn, 4096); + } ! public void ReInit(java.io.Reader dstream) ! { ! ReInit(dstream, 1, 1, 4096); ! } ! public SimpleCharStream(java.io.InputStream dstream, int startline, ! int startcolumn, int buffersize) ! { ! this(new java.io.InputStreamReader(dstream), startline, startcolumn, 4096); ! } ! public SimpleCharStream(java.io.InputStream dstream, int startline, ! int startcolumn) ! { ! this(dstream, startline, startcolumn, 4096); ! } + public SimpleCharStream(java.io.InputStream dstream) + { + this(dstream, 1, 1, 4096); + } ! public void ReInit(java.io.InputStream dstream, int startline, ! int startcolumn, int buffersize) ! { ! ReInit(new java.io.InputStreamReader(dstream), startline, startcolumn, 4096); ! } + public void ReInit(java.io.InputStream dstream) + { + ReInit(dstream, 1, 1, 4096); + } + public void ReInit(java.io.InputStream dstream, int startline, + int startcolumn) + { + ReInit(dstream, startline, startcolumn, 4096); + } + public final String GetImage() + { + if (bufpos >= tokenBegin) + return new String(buffer, tokenBegin, bufpos - tokenBegin + 1); + else + return new String(buffer, tokenBegin, bufsize - tokenBegin) + + new String(buffer, 0, bufpos + 1); + } ! public final char[] GetSuffix(int len) ! { ! char[] ret = new char[len]; ! if ((bufpos + 1) >= len) ! System.arraycopy(buffer, bufpos - len + 1, ret, 0, len); ! else ! { ! System.arraycopy(buffer, bufsize - (len - bufpos - 1), ret, 0, ! len - bufpos - 1); ! System.arraycopy(buffer, 0, ret, len - bufpos - 1, bufpos + 1); ! } ! return ret; ! } + public void Done() + { + buffer = null; + bufline = null; + bufcolumn = null; + } ! /** ! * Method to adjust line and column numbers for the start of a token.<BR> ! */ ! public void adjustBeginLineColumn(int newLine, int newCol) ! { ! int start = tokenBegin; ! int len; + if (bufpos >= tokenBegin) + { + len = bufpos - tokenBegin + inBuf + 1; + } + else + { + len = bufsize - tokenBegin + bufpos + 1 + inBuf; + } ! int i = 0, j = 0, k = 0; ! int nextColDiff = 0, columnDiff = 0; ! while (i < len && ! bufline[j = start % bufsize] == bufline[k = ++start % bufsize]) ! { ! bufline[j] = newLine; ! nextColDiff = columnDiff + bufcolumn[k] - bufcolumn[j]; ! bufcolumn[j] = newCol + columnDiff; ! columnDiff = nextColDiff; ! i++; ! } ! if (i < len) ! { ! bufline[j] = newLine++; ! bufcolumn[j] = newCol + columnDiff; ! while (i++ < len) { ! if (bufline[j = start % bufsize] != bufline[++start % bufsize]) ! bufline[j] = newLine++; ! else ! bufline[j] = newLine; } + } ! line = bufline[j]; ! column = bufcolumn[j]; ! } } Index: Token.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src/com/dawidweiss/carrot/tokenizer/Token.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** Token.java 19 Sep 2003 10:14:54 -0000 1.1.1.1 --- Token.java 9 Feb 2004 20:48:23 -0000 1.2 *************** *** 1,102 **** - - - /* - * Carrot2 Project - * Copyright (C) 2002-2003, Dawid Weiss - * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. - * All rights reserved. - * - * Refer to full text of the licence "carrot2.LICENCE" in the root folder - * of CVS checkout or at: - * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE - */ - - /* Generated By:JavaCC: Do not edit this line. Token.java Version 2.1 */ package com.dawidweiss.carrot.tokenizer; - /** * Describes the input token stream. */ - public class Token - { - /** - * An integer that describes the kind of this token. This numbering system is determined by - * JavaCCParser, and a table of these numbers is stored in the file ...Constants.java. - */ - public int kind; - - /** - * beginLine and beginColumn describe the position of the first character of this token; - * endLine and endColumn describe the position of the last character of this token. - */ - public int beginLine; ! /** ! * beginLine and beginColumn describe the position of the first character of this token; ! * endLine and endColumn describe the position of the last character of this token. ! */ ! public int beginColumn; ! /** ! * beginLine and beginColumn describe the position of the first character of this token; ! * endLine and endColumn describe the position of the last character of this token. ! */ ! public int endLine; ! /** ! * beginLine and beginColumn describe the position of the first character of this token; ! * endLine and endColumn describe the position of the last character of this token. ! */ ! public int endColumn; ! /** The string image of the token. */ ! public String image; ! /** ! * A reference to the next regular (non-special) token from the input stream. If this is the ! * last token from the input stream, or if the token manager has not read tokens beyond this ! * one, this field is set to null. This is true only if this token is also a regular token. ! * Otherwise, see below for a description of the contents of this field. ! */ ! public Token next; ! /** ! * This field is used to access special tokens that occur prior to this token, but after the ! * immediately preceding regular (non-special) token. If there are no such special tokens, ! * this field is set to null. When there are more than one such special token, this field ! * refers to the last of these special tokens, which in turn refers to the next previous ! * special token through its specialToken field, and so on until the first special token ! * (whose specialToken field is null). The next fields of special tokens refer to other ! * special tokens that immediately follow it (without an intervening regular token). If there ! * is no such token, this field is null. ! */ ! public Token specialToken; ! /** ! * Returns the image. ! */ ! public final String toString() ! { ! return image; ! } - /** - * Returns a new Token object, by default. However, if you want, you can create and return - * subclass objects based on the value of ofKind. Simply add the cases to the switch for all - * those special cases. For example, if you have a subclass of Token called IDToken that you - * want to create if ofKind is ID, simlpy add something like : case MyParserConstants.ID : - * return new IDToken(); to the following switch statement. Then you can cast matchedToken - * variable to the appropriate type and use it in your lexical actions. - */ - public static final Token newToken(int ofKind) - { - switch (ofKind) - { - default: - return new Token(); - } - } } --- 1,81 ---- /* Generated By:JavaCC: Do not edit this line. Token.java Version 2.1 */ package com.dawidweiss.carrot.tokenizer; /** * Describes the input token stream. */ ! public class Token { ! /** ! * An integer that describes the kind of this token. This numbering ! * system is determined by JavaCCParser, and a table of these numbers is ! * stored in the file ...Constants.java. ! */ ! public int kind; ! /** ! * beginLine and beginColumn describe the position of the first character ! * of this token; endLine and endColumn describe the position of the ! * last character of this token. ! */ ! public int beginLine, beginColumn, endLine, endColumn; ! /** ! * The string image of the token. ! */ ! public String image; ! /** ! * A reference to the next regular (non-special) token from the input ! * stream. If this is the last token from the input stream, or if the ! * token manager has not read tokens beyond this one, this field is ! * set to null. This is true only if this token is also a regular ! * token. Otherwise, see below for a description of the contents of ! * this field. ! */ ! public Token next; ! /** ! * This field is used to access special tokens that occur prior to this ! * token, but after the immediately preceding regular (non-special) token. ! * If there are no such special tokens, this field is set to null. ! * When there are more than one such special token, this field refers ! * to the last of these special tokens, which in turn refers to the next ! * previous special token through its specialToken field, and so on ! * until the first special token (whose specialToken field is null). ! * The next fields of special tokens refer to other special tokens that ! * immediately follow it (without an intervening regular token). If there ! * is no such token, this field is null. ! */ ! public Token specialToken; ! /** ! * Returns the image. ! */ ! public final String toString() ! { ! return image; ! } + /** + * Returns a new Token object, by default. However, if you want, you + * can create and return subclass objects based on the value of ofKind. + * Simply add the cases to the switch for all those special cases. + * For example, if you have a subclass of Token called IDToken that + * you want to create if ofKind is ID, simlpy add something like : + * + * case MyParserConstants.ID : return new IDToken(); + * + * to the following switch statement. Then you can cast matchedToken + * variable to the appropriate type and use it in your lexical actions. + */ + public static final Token newToken(int ofKind) + { + switch(ofKind) + { + default : return new Token(); + } + } } Index: TokenMgrError.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src/com/dawidweiss/carrot/tokenizer/TokenMgrError.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** TokenMgrError.java 19 Sep 2003 10:14:54 -0000 1.1.1.1 --- TokenMgrError.java 9 Feb 2004 20:48:23 -0000 1.2 *************** *** 1,172 **** - - - /* - * Carrot2 Project - * Copyright (C) 2002-2003, Dawid Weiss - * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. - * All rights reserved. - * - * Refer to full text of the licence "carrot2.LICENCE" in the root folder - * of CVS checkout or at: - * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE - */ - - /* Generated By:JavaCC: Do not edit this line. TokenMgrError.java Version 2.1 */ package com.dawidweiss.carrot.tokenizer; ! ! public class TokenMgrError ! extends Error { ! /* ! * Ordinals for various reasons why an Error of this type can be thrown. ! */ ! ! /** Lexical error occured. */ ! static final int LEXICAL_ERROR = 0; ! /** An attempt wass made to create a second instance of a static token manager. */ ! static final int STATIC_LEXER_ERROR = 1; ! /** Tried to change to an invalid lexical state. */ ! static final int INVALID_LEXICAL_STATE = 2; ! /** Detected (and bailed out of) an infinite loop in the token manager. */ ! static final int LOOP_DETECTED = 3; ! /** Indicates the reason why the exception is thrown. It will have one of the above 4 values. */ ! int errorCode; ! /** ! * Replaces unprintable characters by their espaced (or unicode escaped) equivalents in the ! * given string ! */ ! protected static final String addEscapes(String str) ! { ! StringBuffer retval = new StringBuffer(); ! char ch; ! for (int i = 0; i < str.length(); i++) { ! switch (str.charAt(i)) ! { ! case 0: ! ! continue; ! ! case '\b': ! retval.append("\\b"); ! ! continue; ! ! case '\t': ! retval.append("\\t"); ! ! continue; ! ! case '\n': ! retval.append("\\n"); ! ! continue; ! ! case '\f': ! retval.append("\\f"); ! ! continue; ! ! case '\r': ! retval.append("\\r"); ! ! continue; ! ! case '\"': ! retval.append("\\\""); ! ! continue; ! ! case '\'': ! retval.append("\\\'"); ! ! continue; ! ! case '\\': ! retval.append("\\\\"); ! ! continue; ! ! default: ! ! if (((ch = str.charAt(i)) < 0x20) || (ch > 0x7e)) ! { ! String s = "0000" + Integer.toString(ch, 16); ! retval.append("\\u" + s.substring(s.length() - 4, s.length())); ! } ! else ! { ! retval.append(ch); ! } ! ! continue; ! } } ! return retval.toString(); ! } ! ! ! /** ! * Returns a detailed message for the Error when it is thrown by the token manager to indicate ! * a lexical error. Parameters : EOFSeen : indicates if EOF caused the lexicl error ! * curLexState : lexical state in which this error occured errorLine : line number when the ! * error occured errorColumn : column number when the error occured errorAfter : prefix that ! * was seen before this error occured curchar : the offending character Note: You can ! * customize the lexical error message by modifying this method. ! */ ! private static final String LexicalError( ! boolean EOFSeen, int lexState, int errorLine, int errorColumn, String errorAfter, ! char curChar ! ) ! { ! return ("Lexical error at line " + errorLine + ", column " + errorColumn ! + ". Encountered: " ! + (EOFSeen ? "<EOF> " ! : (("\"" + addEscapes(String.valueOf(curChar)) + "\"") + " (" + (int) curChar ! + "), ")) + "after : \"" + addEscapes(errorAfter) + "\""); ! } ! ! ! /** ! * You can also modify the body of this method to customize your error messages. For example, ! * cases like LOOP_DETECTED and INVALID_LEXICAL_STATE are not of end-users concern, so you can ! * return something like : "Internal Error : Please file a bug report .... " from this method ! * for such cases in the release version of your parser. ! */ ! public String getMessage() ! { ! return super.getMessage(); ! } ! /* ! * Constructors of various flavors follow. ! */ ! public TokenMgrError() ! { ! } ! public TokenMgrError(String message, int reason) ! { ! super(message); ! errorCode = reason; ! } ! public TokenMgrError( ! boolean EOFSeen, int lexState, int errorLine, int errorColumn, String errorAfter, ! char curChar, int reason ! ) ! { ! this(LexicalError(EOFSeen, lexState, errorLine, errorColumn, errorAfter, curChar), reason); ! } } --- 1,133 ---- /* Generated By:JavaCC: Do not edit this line. TokenMgrError.java Version 2.1 */ package com.dawidweiss.carrot.tokenizer; ! public class TokenMgrError extends Error { ! /* ! * Ordinals for various reasons why an Error of this type can be thrown. ! */ ! /** ! * Lexical error occured. ! */ ! static final int LEXICAL_ERROR = 0; ! /** ! * An attempt wass made to create a second instance of a static token manager. ! */ ! static final int STATIC_LEXER_ERROR = 1; ! /** ! * Tried to change to an invalid lexical state. ! */ ! static final int INVALID_LEXICAL_STATE = 2; ! /** ! * Detected (and bailed out of) an infinite loop in the token manager. ! */ ! static final int LOOP_DETECTED = 3; ! /** ! * Indicates the reason why the exception is thrown. It will have ! * one of the above 4 values. ! */ ! int errorCode; ! /** ! * Replaces unprintable characters by their espaced (or unicode escaped) ! * equivalents in the given string ! */ ! protected static final String addEscapes(String str) { ! StringBuffer retval = new StringBuffer(); ! char ch; ! for (int i = 0; i < str.length(); i++) { ! switch (str.charAt(i)) { ! case 0 : ! continue; ! case '\b': ! retval.append("\\b"); ! continue; ! case '\t': ! retval.append("\\t"); ! continue; ! case '\n': ! retval.append("\\n"); ! continue; ! case '\f': ! retval.append("\\f"); ! continue; ! case '\r': ! retval.append("\\r"); ! continue; ! case '\"': ! retval.append("\\\""); ! continue; ! case '\'': ! retval.append("\\\'"); ! continue; ! case '\\': ! retval.append("\\\\"); ! continue; ! default: ! if ((ch = str.charAt(i)) < 0x20 || ch > 0x7e) { ! String s = "0000" + Integer.toString(ch, 16); ! retval.append("\\u" + s.substring(s.length() - 4, s.length())); ! } else { ! retval.append(ch); ! } ! continue; } + } + return retval.toString(); + } ! /** ! * Returns a detailed message for the Error when it is thrown by the ! * token manager to indicate a lexical error. ! * Parameters : ! * EOFSeen : indicates if EOF caused the lexicl error ! * curLexState : lexical state in which this error occured ! * errorLine : line number when the error occured ! * errorColumn : column number when the error occured ! * errorAfter : prefix that was seen before this error occured ! * curchar : the offending character ! * Note: You can customize the lexical error message by modifying this method. ! */ ! private static final String LexicalError(boolean EOFSeen, int lexState, int errorLine, int errorColumn, String errorAfter, char curChar) { ! return("Lexical error at line " + ! errorLine + ", column " + ! errorColumn + ". Encountered: " + ! (EOFSeen ? "<EOF> " : ("\"" + addEscapes(String.valueOf(curChar)) + "\"") + " (" + (int)curChar + "), ") + ! "after : \"" + addEscapes(errorAfter) + "\""); ! } ! /** ! * You can also modify the body of this method to customize your error messages. ! * For example, cases like LOOP_DETECTED and INVALID_LEXICAL_STATE are not ! * of end-users concern, so you can return something like : ! * ! * "Internal Error : Please file a bug report .... " ! * ! * from this method for such cases in the release version of your parser. ! */ ! public String getMessage() { ! return super.getMessage(); ! } + /* + * Constructors of various flavors follow. + */ ! public TokenMgrError() { ! } + public TokenMgrError(String message, int reason) { + super(message); + errorCode = reason; + } ! public TokenMgrError(boolean EOFSeen, int lexState, int errorLine, int errorColumn, String errorAfter, char curChar, int reason) { ! this(LexicalError(EOFSeen, lexState, errorLine, errorColumn, errorAfter, curChar), reason); ! } } Index: Tokenizer.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src/com/dawidweiss/carrot/tokenizer/Tokenizer.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** Tokenizer.java 19 Sep 2003 10:14:54 -0000 1.1.1.1 --- Tokenizer.java 9 Feb 2004 20:48:23 -0000 1.2 *************** *** 49,52 **** --- 49,58 ---- public static final int TYPE_SENTENCEMARKER = 0x0005; + /** INNER-SENTENCE PUNCTUATION MARK */ + public static final int TYPE_PUNCTUATION = 0x0006; + + /** Numeric sequence */ + public static final int TYPE_NUMERIC = 0x0007; + /** * Use factory method to acquire instances of this class. *************** *** 127,131 **** --- 133,144 ---- case TokenizerImplConstants.SENTENCEMARKER: tokenTypeHolder[0] = TYPE_SENTENCEMARKER; + break; + case TokenizerImplConstants.PUNCTUATION: + tokenTypeHolder[0] = TYPE_PUNCTUATION; + break; + + case TokenizerImplConstants.NUMERIC: + tokenTypeHolder[0] = TYPE_NUMERIC; break; Index: TokenizerImpl.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src/com/dawidweiss/carrot/tokenizer/TokenizerImpl.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** TokenizerImpl.java 19 Sep 2003 10:14:54 -0000 1.1.1.1 --- TokenizerImpl.java 9 Feb 2004 20:48:23 -0000 1.2 *************** *** 1,261 **** - - - /* - * Carrot2 Project - * Copyright (C) 2002-2003, Dawid Weiss - * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. - * All rights reserved. - * - * Refer to full text of the licence "carrot2.LICENCE" in the root folder - * of CVS checkout or at: - * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE - */ - - /* Generated By:JavaCC: Do not edit this line. TokenizerImpl.java */ package com.dawidweiss.carrot.tokenizer; - /** ! * Implementation of abstract Tokenizer class generated by JavaCC parser generator. Based on ! * examples from the Egothor project (www.egothor.org). */ ! class TokenizerImpl ! implements TokenizerImplConstants ! { ! public TokenizerImplTokenManager token_source; ! SimpleCharStream jj_input_stream; ! public Token token; ! public Token jj_nt; ! private int jj_ntk; ! private int jj_gen; ! private final int [] jj_la1 = new int[0]; ! private final int [] jj_la1_0 = { }; ! ! public TokenizerImpl(java.io.InputStream stream) ! { ! jj_input_stream = new SimpleCharStream(stream, 1, 1); ! token_source = new TokenizerImplTokenManager(jj_input_stream); ! token = new Token(); ! jj_ntk = -1; ! jj_gen = 0; ! ! for (int i = 0; i < 0; i++) ! { ! jj_la1[i] = -1; ! } ! } ! public void ReInit(java.io.InputStream stream) ! { ! jj_input_stream.ReInit(stream, 1, 1); ! token_source.ReInit(jj_input_stream); ! token = new Token(); ! jj_ntk = -1; ! jj_gen = 0; ! for (int i = 0; i < 0; i++) ! { ! jj_la1[i] = -1; ! } ! } ! public TokenizerImpl(java.io.Reader stream) ! { ! jj_input_stream = new SimpleCharStream(stream, 1, 1); ! token_source = new TokenizerImplTokenManager(jj_input_stream); ! token = new Token(); ! jj_ntk = -1; ! jj_gen = 0; ! for (int i = 0; i < 0; i++) ! { ! jj_la1[i] = -1; ! } ! } ! public void ReInit(java.io.Reader stream) ! { ! jj_input_stream.ReInit(stream, 1, 1); ! token_source.ReInit(jj_input_stream); ! token = new Token(); ! jj_ntk = -1; ! jj_gen = 0; ! for (int i = 0; i < 0; i++) ! { ! jj_la1[i] = -1; ! } ! } ! public TokenizerImpl(TokenizerImplTokenManager tm) ! { ! token_source = tm; ! token = new Token(); ! jj_ntk = -1; ! jj_gen = 0; ! for (int i = 0; i < 0; i++) ! { ! jj_la1[... [truncated message content] |
From: <daw...@us...> - 2004-02-09 20:52:13
|
Update of /cvsroot/carrot2/carrot2 In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8916 Modified Files: history.xml Log Message: [change], component: carrot2-shared-lib The tokenizer now recognizes numeric types and a wider range of url (and www addresses). Index: history.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/history.xml,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** history.xml 8 Feb 2004 20:47:49 -0000 1.13 --- history.xml 9 Feb 2004 20:48:22 -0000 1.14 *************** *** 9,12 **** --- 9,28 ---- <history> <changelist> + <date>2004-02-09</date> + <committer>dawid</committer> + + <change component="carrot2-shared-lib" type="change"> + The tokenizer now recognizes numeric types and a wider range of url (and + www addresses). + </change> + + <change component="carrot2.input.snippet-reader" type="bugfix"> + HTML entities were emitted to the XML stream in the CDATA block. + It is now fixed. + </change> + + </changelist> + + <changelist> <date>2004-02-08</date> <committer>dawid</committer> |
From: <daw...@us...> - 2004-02-09 20:52:13
|
Update of /cvsroot/carrot2/carrot2/components/carrot2-shared-lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8916/components/carrot2-shared-lib Modified Files: carrot2-shared-lib.dep.xml Log Message: [change], component: carrot2-shared-lib The tokenizer now recognizes numeric types and a wider range of url (and www addresses). Index: carrot2-shared-lib.dep.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/carrot2-shared-lib.dep.xml,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** carrot2-shared-lib.dep.xml 6 Feb 2004 18:16:14 -0000 1.1 --- carrot2-shared-lib.dep.xml 9 Feb 2004 20:48:22 -0000 1.2 *************** *** 11,13 **** --- 11,14 ---- <dependency name="put-utils" /> <dependency name="log4j" /> + <dependency name="gnu-regexp" /> </component> |
From: <daw...@us...> - 2004-02-09 20:51:41
|
Update of /cvsroot/carrot2/carrot2/lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8916/lib Modified Files: carrot2-shared-lib.dep.xml carrot2-shared-lib.jar Log Message: [change], component: carrot2-shared-lib The tokenizer now recognizes numeric types and a wider range of url (and www addresses). Index: carrot2-shared-lib.dep.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/lib/carrot2-shared-lib.dep.xml,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** carrot2-shared-lib.dep.xml 6 Feb 2004 18:16:27 -0000 1.1 --- carrot2-shared-lib.dep.xml 9 Feb 2004 20:48:23 -0000 1.2 *************** *** 11,13 **** --- 11,14 ---- <dependency name="put-utils" /> <dependency name="log4j" /> + <dependency name="gnu-regexp" /> </component> Index: carrot2-shared-lib.jar =================================================================== RCS file: /cvsroot/carrot2/carrot2/lib/carrot2-shared-lib.jar,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 Binary files /tmp/cvsi3YUkq and /tmp/cvsM4NdkB differ |
From: <daw...@us...> - 2004-02-09 20:51:41
|
Update of /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src/com/dawidweiss/carrot/util In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8916/components/carrot2-shared-lib/src/com/dawidweiss/carrot/util Added Files: HTMLTextStripper.java Log Message: [change], component: carrot2-shared-lib The tokenizer now recognizes numeric types and a wider range of url (and www addresses). --- NEW FILE: HTMLTextStripper.java --- /* * Carrot2 Project * Copyright (C) 2002-2003, Dawid Weiss * All rights reserved. * * Refer to full text of the licence "carrot2.LICENCE" in the root folder * of CVS checkout or at: * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE */ package com.dawidweiss.carrot.util; import java.util.HashMap; import java.util.Map; import gnu.regexp.RE; import gnu.regexp.REException; /** * Utility class for stripping HTML tags and decoding some of HTML * entities. * * Instances of this class are not guaranteed to be thread-safe. */ public class HTMLTextStripper { /** * Returns an instance of the serializer. The instance is not thread-safe, * but can be reused many times (and should be). */ public static HTMLTextStripper getInstance() { return new HTMLTextStripper(); } /** * Use static <code>getInstance</code> method. */ private HTMLTextStripper() { } /** * Strips all HTML tags from a string. Inserts a blank space for all tags it removes. */ private static final String pattern = "(<.*?>)|(<script.*?/script>)"; private static final RE patternMatch; private static final Map namedEntities; static { try { patternMatch = new RE( pattern, RE.REG_DOT_NEWLINE | RE.REG_ICASE ); } catch (REException e) { throw new java.lang.Error( "RegExp pattern does not compile!" ); } namedEntities = new HashMap(); namedEntities.put("amp", "&"); namedEntities.put("lt", "<"); namedEntities.put("gt", ">"); namedEntities.put("quot", "\""); namedEntities.put("apos", "'"); } /** * Returns a textual representation of a block of HTML code. * SLOOOOOW implementation right now. */ public String htmlToText(String html) { if (html != null) { String plain = patternMatch.substituteAll(html, " "); // now substitute character entities and // named entities StringBuffer buf = new StringBuffer( plain.length() ); int max = plain.length(); for (int i=0;i<max;i++) { if (plain.charAt(i) == '&') { int j; int maxlookahead = Math.min( max, i+20); for (j = i+1; j<maxlookahead;j++) { if (plain.charAt(j) == ';') { break; } } if (j==maxlookahead) { // no end-of-entity semicolon? // just place the ampersand then. buf.append('&'); } else { if (plain.charAt(i+1)=='#') { try { if (plain.charAt(i+2)=='x' || plain.charAt(i+2) == 'X') { // hex int value = Integer.parseInt( plain.substring(i+3, j), 16); buf.append((char) value); } else { // dec int value = Integer.parseInt( plain.substring(i+2, j), 10); buf.append((char) value); } } catch (NumberFormatException f) { // ignore wrong entities. } } else { // named entity? Object named; if ((named = namedEntities.get( plain.substring(i+1, j)))!=null) { buf.append(named); } else { // unrecognized named entity. } } // go to the end of entity declaration i = j; } } else { buf.append(plain.charAt(i)); } } return buf.toString(); } else return html; } } |
From: <daw...@us...> - 2004-02-09 20:51:40
|
Update of /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src-test/com/dawidweiss/carrot/util In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8916/components/carrot2-shared-lib/src-test/com/dawidweiss/carrot/util Added Files: HTMLTextStripperTest.java Log Message: [change], component: carrot2-shared-lib The tokenizer now recognizes numeric types and a wider range of url (and www addresses). --- NEW FILE: HTMLTextStripperTest.java --- /* * Carrot2 Project * Copyright (C) 2002-2003, Dawid Weiss * Portions (C) Contributors listen in carrot2.CONTRIBUTORS file. * All rights reserved. * * Refer to full text of the licence "carrot2.LICENCE" in the root folder * of CVS checkout or at: * http://www.cs.put.poznan.pl/dweiss/carrot2.LICENCE */ package com.dawidweiss.carrot.util; import junit.framework.TestCase; /** * The <code>HTMLTextStripperTest</code> test cases. */ public class HTMLTextStripperTest extends TestCase { public HTMLTextStripperTest(String s) { super(s); } public HTMLTextStripperTest() { super(); } public void testSimpleStrings() { String [][] pairs = new String [][] { {"no changes here!", "no changes here!"}, {"", ""} }; compare( pairs ); } public void testCorrectTags() { String [][] pairs = new String [][] { {"abc <here is a tag> def", "abc def"}, {"abc <start>def</start> gh", "abc def gh"} }; compare( pairs ); } public void testStandardEntities() { String [][] pairs = new String [][] { {"abc&<>"'def", "abc&<>\"'def"} }; compare( pairs ); } public void testNumericDecimalEntities() { String [][] pairs = new String [][] { {"abcAdef", "abcAdef"} }; compare( pairs ); } public void testNumericHexEntities() { String [][] pairs = new String [][] { {"abcAdef", "abcAdef"} }; compare( pairs ); } public void testMissingNamedEntities() { String [][] pairs = new String [][] { {"abc&namedEntity;def", "abcdef"} }; compare( pairs ); } public void testIncorrectNumericalEntities() { String [][] pairs = new String [][] { {"abc&#abc;def", "abcdef"} }; compare( pairs ); } public void testAmpersandNotAnEntity() { String [][] pairs = new String [][] { {"abc & typical not entity.", "abc & typical not entity."}, {"&&&&", "&&&&" } }; compare( pairs ); } private final void compare( String [][] pairs ) { for (int i=0;i<pairs.length;i++) { assertEquals( normalize(pairs[i][1]), normalize(HTMLTextStripper.getInstance().htmlToText(pairs[i][0]))); } } private String normalize(String t) { String p = t.trim(); t = ""; for (int i=0;i<p.length();i++) { if (t.length() > 0) { if (Character.isWhitespace(p.charAt(i)) && Character.isWhitespace(t.charAt(t.length()-1))) { continue; } } t = t + p.charAt(i); } System.out.println(t); return t; } } |
From: <daw...@us...> - 2004-02-09 20:51:40
|
Update of /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src-test/com/dawidweiss/carrot/tokenizer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8916/components/carrot2-shared-lib/src-test/com/dawidweiss/carrot/tokenizer Modified Files: TokenizerImplTest.java Log Message: [change], component: carrot2-shared-lib The tokenizer now recognizes numeric types and a wider range of url (and www addresses). Index: TokenizerImplTest.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/src-test/com/dawidweiss/carrot/tokenizer/TokenizerImplTest.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** TokenizerImplTest.java 19 Sep 2003 10:14:57 -0000 1.1.1.1 --- TokenizerImplTest.java 9 Feb 2004 20:48:23 -0000 1.2 *************** *** 89,95 **** break; default: ! typeName = "UNRECOGNIZED?!"; } --- 89,105 ---- break; + case Tokenizer.TYPE_PERSON: + typeName = "PERSON"; + break; + case Tokenizer.TYPE_PUNCTUATION: + typeName = "PUNCTIATION"; + break; + + case Tokenizer.TYPE_NUMERIC: + typeName = "NUMERIC"; + break; default: ! typeName = "UNRECOGNIZED?! = " + type; } *************** *** 100,104 **** public void test_Tokenizer_TYPE_TERM() { ! String test = " simple terms: simpleterm 9numterm numerm99x \"quoted string\""; TokenImage [] tokens = { --- 110,114 ---- public void test_Tokenizer_TYPE_TERM() { ! String test = " simple terms simpleterm 9numterm numerm99x \"quoted string\""; TokenImage [] tokens = { *************** *** 118,122 **** public void test_Tokenizer_TYPE_EMAIL() { ! String test = "e-mails: dw...@go... daw...@go... bu...@so... me...@me... bu...@ya..."; TokenImage [] tokens = { --- 128,132 ---- public void test_Tokenizer_TYPE_EMAIL() { ! String test = "e-mails dw...@go... daw...@go... bu...@so... me...@me... bu...@ya..."; TokenImage [] tokens = { *************** *** 136,141 **** { String test = ! " urls: http://www.google.com http://www.cs.put.poznan.pl/index.jsp?query=term&query2=term " ! + " ftp://ftp.server"; TokenImage [] tokens = { --- 146,151 ---- { String test = ! " urls http://www.google.com http://www.cs.put.poznan.pl/index.jsp?query=term&query2=term " ! + " ftp://ftp.server www.google.com not.an.url go2.pl/mail http://www.digimine.com/usama/datamine/."; TokenImage [] tokens = { *************** *** 144,148 **** new TokenImage( "http://www.cs.put.poznan.pl/index.jsp?query=term&query2=term", Tokenizer.TYPE_URL ! ), new TokenImage("ftp://ftp.server", Tokenizer.TYPE_URL) }; --- 154,171 ---- new TokenImage( "http://www.cs.put.poznan.pl/index.jsp?query=term&query2=term", Tokenizer.TYPE_URL ! ), ! new TokenImage("ftp://ftp.server", Tokenizer.TYPE_URL), ! new TokenImage("www.google.com", Tokenizer.TYPE_URL), ! ! new TokenImage("not", Tokenizer.TYPE_TERM), ! new TokenImage(".", Tokenizer.TYPE_SENTENCEMARKER), ! new TokenImage("an", Tokenizer.TYPE_TERM), ! new TokenImage(".", Tokenizer.TYPE_SENTENCEMARKER), ! new TokenImage("url", Tokenizer.TYPE_TERM), ! ! new TokenImage("go2.pl/mail", Tokenizer.TYPE_URL), ! ! new TokenImage("http://www.digimine.com/usama/datamine/", Tokenizer.TYPE_URL), ! new TokenImage(".", Tokenizer.TYPE_SENTENCEMARKER) }; *************** *** 153,160 **** public void test_Tokenizer_TYPE_PERSON() { ! String test = " O'J'Simpson and D.Weiss and D. Weiss and E.A.Bloober"; TokenImage [] tokens = { ! new TokenImage("O'J'Simpson", Tokenizer.TYPE_PERSON), new TokenImage("and", Tokenizer.TYPE_TERM), new TokenImage("D.Weiss", Tokenizer.TYPE_PERSON), --- 176,187 ---- public void test_Tokenizer_TYPE_PERSON() { ! String test = " O'J'Simpson and D.Weiss and D. Weiss and E.A.Bloober and SentenceEnD. Bloober"; TokenImage [] tokens = { ! new TokenImage("O", Tokenizer.TYPE_TERM), ! new TokenImage("'", Tokenizer.TYPE_PUNCTUATION), ! new TokenImage("J", Tokenizer.TYPE_TERM), ! new TokenImage("'", Tokenizer.TYPE_PUNCTUATION), ! new TokenImage("Simpson", Tokenizer.TYPE_TERM), new TokenImage("and", Tokenizer.TYPE_TERM), new TokenImage("D.Weiss", Tokenizer.TYPE_PERSON), *************** *** 162,166 **** new TokenImage("D. Weiss", Tokenizer.TYPE_PERSON), new TokenImage("and", Tokenizer.TYPE_TERM), ! new TokenImage("E.A.Bloober", Tokenizer.TYPE_PERSON) }; --- 189,197 ---- new TokenImage("D. Weiss", Tokenizer.TYPE_PERSON), new TokenImage("and", Tokenizer.TYPE_TERM), ! new TokenImage("E.A.Bloober", Tokenizer.TYPE_PERSON), ! new TokenImage("and", Tokenizer.TYPE_TERM), ! new TokenImage("SentenceEnD", Tokenizer.TYPE_TERM), ! new TokenImage(".", Tokenizer.TYPE_SENTENCEMARKER), ! new TokenImage("Bloober", Tokenizer.TYPE_TERM) }; *************** *** 171,175 **** public void test_Tokenizer_TYPE_TERM_acronyms() { ! String test = " acronyms: I.B.M. S.C. z o.o. AT&T garey&johnson&willet"; TokenImage [] tokens = { --- 202,206 ---- public void test_Tokenizer_TYPE_TERM_acronyms() { ! String test = " acronyms I.B.M. S.C. z o.o. AT&T garey&johnson&willet"; TokenImage [] tokens = { *************** *** 187,190 **** --- 218,241 ---- } + public void test_Tokenizer_TYPE_NUMERIC() + { + String test = " numeric 127 0 12.87 12,12 12-2003/23 term2003 2003term "; + TokenImage [] tokens = + { + new TokenImage("numeric", Tokenizer.TYPE_TERM), + + new TokenImage("127", Tokenizer.TYPE_NUMERIC), + new TokenImage("0", Tokenizer.TYPE_NUMERIC), + new TokenImage("12.87", Tokenizer.TYPE_NUMERIC), + new TokenImage("12,12", Tokenizer.TYPE_NUMERIC), + new TokenImage("12-2003/23", Tokenizer.TYPE_NUMERIC), + new TokenImage("term2003", Tokenizer.TYPE_TERM), + new TokenImage("2003term", Tokenizer.TYPE_TERM) + + }; + + compareTokenArrays(test, tokens); + } + private static void compareTokenArrays(String test, TokenImage [] expectedTokens) *************** *** 219,221 **** --- 270,298 ---- } } + + + public static void main(String [] args) throws Exception { + if (args.length > 0) { + for (int i=0;i<args.length;i++) { + java.io.File f = new java.io.File( args[i] ); + if (f.canRead()) { + byte [] fufu = new byte [(int) f.length()]; + java.io.FileInputStream is = new java.io.FileInputStream(f); + is.read(fufu); + is.close(); + + Tokenizer t = Tokenizer.getTokenizer(); + t.restartTokenizerOn(new String( fufu, "UTF-8")); + int [] type = {0}; + String image; + while ((image = t.getNextToken(type)) != null) { + TokenImage timage = new TokenImage(image, type[0]); + System.out.println( timage ); + } + } else { + System.err.println("Cannot read: " + f.getAbsolutePath()); + } + } + } + } } |
From: <daw...@us...> - 2004-02-09 20:51:12
|
Update of /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/src/org/put/snippetreader/readers/HtmlMultipage In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8815/snippet-reader/src/org/put/snippetreader/readers/HtmlMultipage Modified Files: HttpMultiPageReader.java Log Message: [bugfix], component: carrot2.input.snippet-reader HTML entities were emitted to the XML stream in the CDATA block. It is now fixed Index: HttpMultiPageReader.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/src/org/put/snippetreader/readers/HtmlMultipage/HttpMultiPageReader.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** HttpMultiPageReader.java 19 Jan 2004 21:30:34 -0000 1.3 --- HttpMultiPageReader.java 9 Feb 2004 20:47:49 -0000 1.4 *************** *** 57,61 **** public byte [] getFirstResultsPage(String query, int resultsNeeded, String encoding, Element pageInfo) throws IOException { - String inputEncoding = encoding; String outputEncoding = encoding; --- 57,60 ---- |
From: <daw...@us...> - 2004-02-09 20:51:12
|
Update of /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/src/org/put/snippetreader/readers In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8815/snippet-reader/src/org/put/snippetreader/readers Modified Files: WebSnippetReader.java Log Message: [bugfix], component: carrot2.input.snippet-reader HTML entities were emitted to the XML stream in the CDATA block. It is now fixed Index: WebSnippetReader.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/src/org/put/snippetreader/readers/WebSnippetReader.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** WebSnippetReader.java 19 Jan 2004 21:30:34 -0000 1.3 --- WebSnippetReader.java 9 Feb 2004 20:47:49 -0000 1.4 *************** *** 25,30 **** --- 25,35 ---- import org.put.util.text.HtmlHelper; import org.put.util.xml.JDOMHelper; + + import com.dawidweiss.carrot.util.HTMLTextStripper; + import com.dawidweiss.carrot.util.XMLSerializerHelper; + import gnu.regexp.RE; import java.io.*; + import java.net.URL; import java.util.Vector; *************** *** 39,42 **** --- 44,49 ---- HttpMultiPageReader reader; RegExpSnippetExtractor extractor; + String baseURL; + String relativeBaseURL; /** *************** *** 49,52 **** --- 56,69 ---- config = configuration; + URL serviceURL = new URL(JDOMHelper.getStringFromJDOM("/service/request/service#url", configuration, true)); + baseURL = serviceURL.getProtocol() + "://" + serviceURL.getHost() + ( serviceURL.getPort() == -1 ? "" : ":" + serviceURL.getPort()) + "/"; + relativeBaseURL = JDOMHelper.getStringFromJDOM("/service/request/service#url", configuration, true); + if (relativeBaseURL.lastIndexOf('/') > 0) { + relativeBaseURL = relativeBaseURL.substring(0,relativeBaseURL.lastIndexOf('/') + 1); + } + + log.debug("Base service URL: " + baseURL); + log.debug("Base relative service URL: " + relativeBaseURL); + FormActionInfo actionInfo = new FormActionInfo( JDOMHelper.getElement("/service/request", config) *************** *** 116,119 **** --- 133,138 ---- int nosummary = 0; int recognized = 0; + HTMLTextStripper htmlStripper = HTMLTextStripper.getInstance(); + XMLSerializerHelper xmlSerializer = XMLSerializerHelper.getInstance(); public void snippetHasNoTitle() *************** *** 158,176 **** "<document id=\"" + recognized + "\">\n\t<title>" ); ! outputStream.write( ! xmlencode(HtmlHelper.removeHtmlTags(s.getTitle())) ! ); outputStream.write("</title>\n"); ! outputStream.write("\t<url>"); ! outputStream.write(xmlencode(s.getDocumentURL())); ! outputStream.write("</url>\n"); if (s.getSummary() != null) { outputStream.write("\t<snippet>"); ! outputStream.write( ! xmlencode(HtmlHelper.removeHtmlTags(s.getSummary())) ! ); outputStream.write("</snippet>\n"); } --- 177,203 ---- "<document id=\"" + recognized + "\">\n\t<title>" ); ! xmlSerializer.writeValidXmlText(outputStream, ! htmlStripper.htmlToText(s.getTitle()),false); outputStream.write("</title>\n"); ! outputStream.write("\t<url><![CDATA["); ! String docUrl = s.getDocumentURL(); ! if (docUrl.startsWith("/")) { ! outputStream.write( baseURL ); ! outputStream.write( docUrl ); ! } else if (docUrl.indexOf(':') < 0) { ! outputStream.write( relativeBaseURL ); ! outputStream.write(docUrl); ! } else { ! outputStream.write(docUrl); ! } ! outputStream.write("]]></url>\n"); if (s.getSummary() != null) { outputStream.write("\t<snippet>"); ! xmlSerializer.writeValidXmlText(outputStream, ! htmlStripper.htmlToText(s.getSummary()),false); ! outputStream.write("</snippet>\n"); } *************** *** 184,193 **** } } - - - private String xmlencode(String x) - { - return "<![CDATA[" + x + "]]>"; - } } ); --- 211,214 ---- |
From: <daw...@us...> - 2004-02-08 20:57:25
|
Update of /cvsroot/carrot2/carrot2/components/filters/clustering/ahc-clustering/web/WEB-INF In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv10558 Modified Files: log4j.properties Log Message: log4j appender now propagates to the root instead of throwing stuff at the console. Index: log4j.properties =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/filters/clustering/ahc-clustering/web/WEB-INF/log4j.properties,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** log4j.properties 19 Sep 2003 10:16:25 -0000 1.1.1.1 --- log4j.properties 8 Feb 2004 20:54:10 -0000 1.2 *************** *** 1,10 **** # logging components definition. ! log4j.logger.com.mwroblewski.carrot = INFO,ahc.console #log4j.logger.com.mwroblewski.carrot.filter.ahcfilter.AHCFilter = DEBUG #log4j.logger.com.mwroblewski.carrot.filter.ahcfilter.ahc.AHC = DEBUG #log4j.logger.com.mwroblewski.carrot.filter.termsfilter.TermsFilter = DEBUG - - log4j.appender.ahc.console = org.apache.log4j.ConsoleAppender - log4j.appender.ahc.console.layout = org.apache.log4j.PatternLayout - log4j.appender.ahc.console.layout.ConversionPattern = %d %-5p %-20c{2} %3x - %m%n \ No newline at end of file --- 1,6 ---- # logging components definition. ! log4j.logger.com.mwroblewski.carrot = INFO #log4j.logger.com.mwroblewski.carrot.filter.ahcfilter.AHCFilter = DEBUG #log4j.logger.com.mwroblewski.carrot.filter.ahcfilter.ahc.AHC = DEBUG #log4j.logger.com.mwroblewski.carrot.filter.termsfilter.TermsFilter = DEBUG |
From: <daw...@us...> - 2004-02-08 20:51:24
|
Update of /cvsroot/carrot2/carrot2/components/controllers/carrot2-web-controller In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9059/components/controllers/carrot2-web-controller Modified Files: .classpath Log Message: no message Index: .classpath =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/controllers/carrot2-web-controller/.classpath,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** .classpath 19 Sep 2003 10:14:57 -0000 1.1.1.1 --- .classpath 8 Feb 2004 20:48:18 -0000 1.2 *************** *** 14,27 **** <classpathentry kind="var" path="CARROT2_CVS/lib/dweiss-utils.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/gnu-regexp-1.1.4.jar"/> - <classpathentry kind="var" path="CARROT2_CVS/lib/jaxp.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/saxon.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xercesImpl.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xml-apis.jar"/> - <classpathentry kind="var" path="CARROT2_CVS/lib/junit.jar"/> <classpathentry kind="lib" path="lib/bsf.jar"/> <classpathentry kind="lib" path="lib/bsh-1.2b7.jar"/> - <classpathentry kind="lib" path="lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/carrot2-shared-lib.jar"/> <classpathentry kind="output" path="tmp/build/WEB-INF/classes"/> </classpath> --- 14,25 ---- <classpathentry kind="var" path="CARROT2_CVS/lib/dweiss-utils.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/gnu-regexp-1.1.4.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/saxon.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xercesImpl.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xml-apis.jar"/> <classpathentry kind="lib" path="lib/bsf.jar"/> <classpathentry kind="lib" path="lib/bsh-1.2b7.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/carrot2-shared-lib.jar"/> + <classpathentry kind="var" path="CARROT2_CVS/lib/compile-time-only/junit.jar"/> <classpathentry kind="output" path="tmp/build/WEB-INF/classes"/> </classpath> |
From: <daw...@us...> - 2004-02-08 20:51:24
|
Update of /cvsroot/carrot2/carrot2/components/carrot2-shared-lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9059/components/carrot2-shared-lib Modified Files: .classpath build.xml Log Message: no message Index: .classpath =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/.classpath,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** .classpath 6 Feb 2004 18:16:14 -0000 1.2 --- .classpath 8 Feb 2004 20:48:18 -0000 1.3 *************** *** 16,26 **** <classpathentry kind="var" path="CARROT2_CVS/lib/dweiss-utils.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/gnu-regexp-1.1.4.jar"/> - <classpathentry kind="var" path="CARROT2_CVS/lib/jaxp.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/saxon.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xercesImpl.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xml-apis.jar"/> - <classpathentry kind="var" path="CARROT2_CVS/lib/junit.jar"/> <classpathentry kind="var" path="ECLIPSE_HOME/plugins/org.apache.ant_1.5.3/ant.jar"/> <classpathentry kind="output" path="tmp/build/classes"/> </classpath> --- 16,25 ---- <classpathentry kind="var" path="CARROT2_CVS/lib/dweiss-utils.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/gnu-regexp-1.1.4.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/saxon.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xercesImpl.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xml-apis.jar"/> <classpathentry kind="var" path="ECLIPSE_HOME/plugins/org.apache.ant_1.5.3/ant.jar"/> + <classpathentry kind="var" path="CARROT2_CVS/lib/compile-time-only/junit.jar"/> <classpathentry kind="output" path="tmp/build/classes"/> </classpath> Index: build.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/carrot2-shared-lib/build.xml,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** build.xml 6 Feb 2004 18:16:14 -0000 1.4 --- build.xml 8 Feb 2004 20:48:18 -0000 1.5 *************** *** 189,192 **** --- 189,193 ---- <fileset dir="${build.dir}/classes"> <include name="com/dawidweiss/carrot/adapters/**" /> + <include name="com/dawidweiss/carrot/util/XMLSerializerHelper*.*" /> <exclude name="com/dawidweiss/carrot/adapters/Test*" /> <exclude name="com/dawidweiss/carrot/adapters/*.xml" /> |
From: <daw...@us...> - 2004-02-08 20:51:05
|
Update of /cvsroot/carrot2/carrot2 In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8933 Modified Files: build.xml history.xml issues.txt Log Message: no message Index: build.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/build.xml,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** build.xml 6 Feb 2004 18:16:14 -0000 1.4 --- build.xml 8 Feb 2004 20:47:49 -0000 1.5 *************** *** 67,72 **** <ant antfile="build.xml" dir="components/carrot2-shared-lib" inheritall="false" target="clean" /> <ant antfile="build.xml" dir="components/carrot2-shared-lib" inheritall="false" target="build"> - <property name="distribution.dir" location="lib" /> </ant> <!-- copy the dependency specification --> <copy file="components/carrot2-shared-lib/carrot2-shared-lib.dep.xml" --- 67,73 ---- <ant antfile="build.xml" dir="components/carrot2-shared-lib" inheritall="false" target="clean" /> <ant antfile="build.xml" dir="components/carrot2-shared-lib" inheritall="false" target="build"> </ant> + <copy file="components/carrot2-shared-lib/tmp/dist/carrot2-shared-lib.jar" + tofile="lib/carrot2-shared-lib.jar" overwrite="true" /> <!-- copy the dependency specification --> <copy file="components/carrot2-shared-lib/carrot2-shared-lib.dep.xml" Index: history.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/history.xml,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** history.xml 8 Feb 2004 11:01:50 -0000 1.12 --- history.xml 8 Feb 2004 20:47:49 -0000 1.13 *************** *** 12,15 **** --- 12,26 ---- <committer>dawid</committer> + <change component="carrot2.input.snippet-reader" type="bugfix"> + Some of Google result pages were not recognized due to a silly regexp + error (minimum distance beteen 'of about' and 'Search took' was set to + 20 characters, while it may be more than that). + </change> + + <change component="carrot2.input.snippet-reader" type="bugfix"> + When a template was not recognized, the component returned 0-sized + result. Now it returns a HTTP 500 error together with a result. + </change> + <change component="global" type="new"> Added an Egothor Search Engine input component adapter. Index: issues.txt =================================================================== RCS file: /cvsroot/carrot2/carrot2/issues.txt,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** issues.txt 6 Feb 2004 18:16:14 -0000 1.2 --- issues.txt 8 Feb 2004 20:47:49 -0000 1.3 *************** *** 3,6 **** --- 3,8 ---- ------------ + 1. + Massimo Miccoli reports: *************** *** 9,10 **** --- 11,16 ---- so Carrot2 can work. Is not a true solution for security reason, but it work. + + 2. + + TreeSnippetMiner component has a bug where it falls into an infinite busy loop. |
From: <daw...@us...> - 2004-02-08 20:50:07
|
Update of /cvsroot/carrot2/carrot2/components/filters/linguistic/pl-eng-stemming/src/com/dawidweiss/carrot/filter/stemming/porter In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8343/src/com/dawidweiss/carrot/filter/stemming/porter Added Files: PorterStemmer.java Log Message: Sources added again, small weird bug fixed (fsa file was read in chunks of 900 bytes... buffered reader added and now it works faster). |
From: <daw...@us...> - 2004-02-08 20:50:06
|
Update of /cvsroot/carrot2/carrot2/components/filters/linguistic/pl-eng-stemming/src/com/dawidweiss/carrot/filter/stemming/lametyzator In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8343/src/com/dawidweiss/carrot/filter/stemming/lametyzator Added Files: Lametyzator.java polski.fsa Log Message: Sources added again, small weird bug fixed (fsa file was read in chunks of 900 bytes... buffered reader added and now it works faster). |
From: <daw...@us...> - 2004-02-08 20:50:06
|
Update of /cvsroot/carrot2/carrot2/components/filters/linguistic/pl-eng-stemming/src/com/dawidweiss/carrot/filter/stemming In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv8343/src/com/dawidweiss/carrot/filter/stemming Added Files: DirectStemmer.java Log Message: Sources added again, small weird bug fixed (fsa file was read in chunks of 900 bytes... buffered reader added and now it works faster). |
From: <daw...@us...> - 2004-02-08 20:47:06
|
Update of /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/src/org/put/snippetreader In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7948/src/org/put/snippetreader Modified Files: XmlRpcServlet.java Log Message: [bugfix] google description fixed. [bugfix] null-results on exception fixed. Index: XmlRpcServlet.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/src/org/put/snippetreader/XmlRpcServlet.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** XmlRpcServlet.java 30 Sep 2003 11:35:47 -0000 1.2 --- XmlRpcServlet.java 8 Feb 2004 20:43:57 -0000 1.3 *************** *** 325,328 **** --- 325,335 ---- { log.error("Exception when processing request.", e); + if (res.isCommitted()==false) { + // send error code + res.resetBuffer(); + res.sendError(HttpServletResponse.SC_INTERNAL_SERVER_ERROR, + "Could not process request: " + e.toString()); + } + return; } } |
From: <daw...@us...> - 2004-02-08 20:47:05
|
Update of /cvsroot/carrot2/carrot2/components/inputs/snippet-reader In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7948 Modified Files: .classpath Log Message: [bugfix] google description fixed. [bugfix] null-results on exception fixed. Index: .classpath =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/.classpath,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** .classpath 19 Sep 2003 10:20:37 -0000 1.1.1.1 --- .classpath 8 Feb 2004 20:43:57 -0000 1.2 *************** *** 14,25 **** <classpathentry kind="var" path="CARROT2_CVS/lib/dweiss-utils.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/gnu-regexp-1.1.4.jar"/> - <classpathentry kind="var" path="CARROT2_CVS/lib/jaxp.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/saxon.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xercesImpl.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xml-apis.jar"/> - <classpathentry kind="var" path="CARROT2_CVS/lib/junit.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/carrot2-shared-lib.jar"/> <classpathentry kind="lib" path="lib/xmlrpc-1.1.jar"/> <classpathentry kind="output" path="tmp/build/WEB-INF/classes"/> </classpath> --- 14,24 ---- <classpathentry kind="var" path="CARROT2_CVS/lib/dweiss-utils.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/gnu-regexp-1.1.4.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/saxon.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/struts.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xercesImpl.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/xml-apis.jar"/> <classpathentry kind="var" path="CARROT2_CVS/lib/carrot2-shared-lib.jar"/> <classpathentry kind="lib" path="lib/xmlrpc-1.1.jar"/> + <classpathentry kind="var" path="CARROT2_CVS/lib/compile-time-only/junit.jar"/> <classpathentry kind="output" path="tmp/build/WEB-INF/classes"/> </classpath> |
From: <daw...@us...> - 2004-02-08 20:47:05
|
Update of /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/web/services In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7948/web/services Modified Files: google-cs.xml google-pl.xml google.xml Log Message: [bugfix] google description fixed. [bugfix] null-results on exception fixed. Index: google-cs.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/web/services/google-cs.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** google-cs.xml 19 Jan 2004 21:30:34 -0000 1.2 --- google-cs.xml 8 Feb 2004 20:43:57 -0000 1.3 *************** *** 54,58 **** <number-of-matched-documents> <regexpression> ! <match><![CDATA[of( about)?[^S]{0,20}Search]]></match> <replace regexp="[^0123456789]*" with="" /> </regexpression> --- 54,58 ---- <number-of-matched-documents> <regexpression> ! <match><![CDATA[of( about)?[^S]{0,40}Search]]></match> <replace regexp="[^0123456789]*" with="" /> </regexpression> Index: google-pl.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/web/services/google-pl.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** google-pl.xml 19 Jan 2004 21:30:35 -0000 1.2 --- google-pl.xml 8 Feb 2004 20:43:57 -0000 1.3 *************** *** 55,59 **** <number-of-matched-documents> <regexpression> ! <match><![CDATA[of( about)?[^S]{0,20}Search]]></match> <replace regexp="[^0123456789]*" with="" /> </regexpression> --- 55,59 ---- <number-of-matched-documents> <regexpression> ! <match><![CDATA[of( about)?[^S]{0,40}Search]]></match> <replace regexp="[^0123456789]*" with="" /> </regexpression> Index: google.xml =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/snippet-reader/web/services/google.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** google.xml 19 Jan 2004 21:30:35 -0000 1.2 --- google.xml 8 Feb 2004 20:43:57 -0000 1.3 *************** *** 53,57 **** <number-of-matched-documents> <regexpression> ! <match><![CDATA[of( about)?[^S]{0,20}Search]]></match> <replace regexp="[^0123456789]*" with="" /> </regexpression> --- 53,57 ---- <number-of-matched-documents> <regexpression> ! <match><![CDATA[of( about)?[^S]{0,40}Search]]></match> <replace regexp="[^0123456789]*" with="" /> </regexpression> |
Update of /cvsroot/carrot2/carrot2/components/inputs/treeSnippetMiner/src/com/paulodev/carrot/treeSnippetMiner/frequentTreeMiner In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7534/src/com/paulodev/carrot/treeSnippetMiner/frequentTreeMiner Modified Files: FreqSubtreeMiner.java TreeExpansionElementOccurence.java Log Message: Small refactorings (eclipse) Index: FreqSubtreeMiner.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/treeSnippetMiner/src/com/paulodev/carrot/treeSnippetMiner/frequentTreeMiner/FreqSubtreeMiner.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** FreqSubtreeMiner.java 25 Sep 2003 22:13:33 -0000 1.1 --- FreqSubtreeMiner.java 8 Feb 2004 20:42:36 -0000 1.2 *************** *** 83,89 **** maxExpansion = ex; maxSize = ex.getTreeSize(); - System.out.println("Max: " + maxSize); - // System.out.println("Max: " + ex); - // printResult(ex); } } --- 83,86 ---- *************** *** 160,164 **** } dictionary.put(rootNodeDict.getName(), rootNodeDict); - System.out.println(dictionary); // only 100% support --- 157,160 ---- *************** *** 169,173 **** { TreeExpansion toEnum = (TreeExpansion)e.nextElement(); ! System.out.println("Enumerating: " + toEnum); if ( (maxExpansion == null) || ( ( (DictNodeOccurence)dictionary.get(toEnum.getName())). --- 165,169 ---- { TreeExpansion toEnum = (TreeExpansion)e.nextElement(); ! if ( (maxExpansion == null) || ( ( (DictNodeOccurence)dictionary.get(toEnum.getName())). Index: TreeExpansionElementOccurence.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/treeSnippetMiner/src/com/paulodev/carrot/treeSnippetMiner/frequentTreeMiner/TreeExpansionElementOccurence.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** TreeExpansionElementOccurence.java 25 Sep 2003 22:13:33 -0000 1.1 --- TreeExpansionElementOccurence.java 8 Feb 2004 20:42:36 -0000 1.2 *************** *** 164,168 **** { parent.setBound(bound); - // System.out.println("Mam granicÄ dla: " + node.getPosition() + " " + node.getName() + "=" + bound); } return true; --- 164,167 ---- |
From: <daw...@us...> - 2004-02-08 20:45:47
|
Update of /cvsroot/carrot2/carrot2/components/inputs/treeSnippetMiner/src/org/put/snippetreader/readers/HtmlMultipage In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7534/src/org/put/snippetreader/readers/HtmlMultipage Modified Files: HttpMultiPageReader.java Log Message: Small refactorings (eclipse) Index: HttpMultiPageReader.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/treeSnippetMiner/src/org/put/snippetreader/readers/HtmlMultipage/HttpMultiPageReader.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** HttpMultiPageReader.java 25 Sep 2003 22:13:34 -0000 1.1 --- HttpMultiPageReader.java 8 Feb 2004 20:42:36 -0000 1.2 *************** *** 11,15 **** import java.io.*; import java.util.*; - import org.put.util.exception.*; import org.jdom.Element; import org.put.util.net.http.*; --- 11,14 ---- *************** *** 17,22 **** import org.put.util.xml.JDOMHelper; - import org.put.util.text.HtmlHelper; - import org.put.util.exception.*; import org.put.util.io.FileHelper; import org.apache.log4j.Logger; --- 16,19 ---- |
From: <daw...@us...> - 2004-02-08 20:45:47
|
Update of /cvsroot/carrot2/carrot2/components/inputs/treeSnippetMiner/src/com/paulodev/carrot/treeSnippetMiner/treeAnalyser/tokenFeature In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7534/src/com/paulodev/carrot/treeSnippetMiner/treeAnalyser/tokenFeature Modified Files: IsURLCalc.java TfIdf.java Log Message: Small refactorings (eclipse) Index: IsURLCalc.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/treeSnippetMiner/src/com/paulodev/carrot/treeSnippetMiner/treeAnalyser/tokenFeature/IsURLCalc.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** IsURLCalc.java 25 Sep 2003 22:13:34 -0000 1.1 --- IsURLCalc.java 8 Feb 2004 20:42:36 -0000 1.2 *************** *** 11,18 **** import java.util.*; import java.net.URL; import com.paulodev.carrot.treeSnippetMiner.treeAnalyser.snippetTokenizer.Token; import com.paulodev.carrot.treeExtractor.extractors.TreeExtractor; - import java.net.*; import java.io.*; --- 11,18 ---- import java.util.*; + import java.net.HttpURLConnection; import java.net.URL; import com.paulodev.carrot.treeSnippetMiner.treeAnalyser.snippetTokenizer.Token; import com.paulodev.carrot.treeExtractor.extractors.TreeExtractor; import java.io.*; *************** *** 26,30 **** public double innerCalcValue(Token t, Vector strings) { int urlCount = 0; - System.out.println("Checking URLs: "); for (int i = 0; i < strings.size(); i++) { String toCheck = ((String)strings.get(i)).toLowerCase(); --- 26,29 ---- *************** *** 33,40 **** toCheck = TreeExtractor.clearURL(toCheck); URL temp = new URL(toCheck); - System.out.print("."); HttpURLConnection res = (HttpURLConnection) temp.openConnection(); ! res.setFollowRedirects(false); ! // res.getContent(); urlCount++; } --- 32,37 ---- toCheck = TreeExtractor.clearURL(toCheck); URL temp = new URL(toCheck); HttpURLConnection res = (HttpURLConnection) temp.openConnection(); ! HttpURLConnection.setFollowRedirects(false); urlCount++; } *************** *** 44,48 **** } } - System.out.println(" done"); return (double) urlCount / (double) strings.size(); } --- 41,44 ---- Index: TfIdf.java =================================================================== RCS file: /cvsroot/carrot2/carrot2/components/inputs/treeSnippetMiner/src/com/paulodev/carrot/treeSnippetMiner/treeAnalyser/tokenFeature/TfIdf.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** TfIdf.java 25 Sep 2003 22:13:34 -0000 1.1 --- TfIdf.java 8 Feb 2004 20:42:36 -0000 1.2 *************** *** 163,174 **** } - /* for (int i = 0; i < terms.length; i++) - { - System.out.print(terms[i].term + "(" + Math.round(terms[i].Entrophy * 100.0)/100.0 + ") \t"); - for (int j =0; j < documents.length; j++) - System.out.print( Math.round(DTMatrix[j][i]*100.0)/100.0 + "\t"); - System.out.println(); - }*/ - double res = 0; for (int i = 0; i < documents.length; i++) --- 163,166 ---- |