Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests
In directory sc8-pr-cvs1:/tmp/cvs-serv3574/tests/lexerTests
Modified Files:
LexerTests.java
Log Message:
Fix bug #874175 StringBean doesn't handle charset change well
Add EncodingChangeException to distinguish a recoverable character set change
occuring after the lexer has already coughed up some characters using the wrong
encoding. Added testEncodingChange in LexerTests to excercise it.
Changed IteratorImpl to not wrap a ParserException with another ParserException.
Changed StringBean to retry the URL when an encoding change exception is caught.
Index: LexerTests.java
===================================================================
RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/LexerTests.java,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** LexerTests.java 2 Jan 2004 16:24:55 -0000 1.15
--- LexerTests.java 10 Jan 2004 15:23:33 -0000 1.16
***************
*** 52,55 ****
--- 52,56 ----
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;
+ import org.htmlparser.util.EncodingChangeException;
import org.htmlparser.util.ParserException;
***************
*** 620,628 ****
* causes spurious tags.
* The root cause is characters bracketed by [esc]$B and [esc](J (contrary
! * to what is indicated in then j_s_nightingale analysis of the problem) that
* sometimes have an angle bracket (< or 0x3c) embedded in them. These
* are taken to be tags by the parser, instead of being considered strings.
* <p>
! * The URL refrenced has an ISO-8859-1 encoding (the default), but
* Japanese characters intermixed on the page with English, using the JIS
* encoding. We detect failure by looking for weird tag names which were
--- 621,629 ----
* causes spurious tags.
* The root cause is characters bracketed by [esc]$B and [esc](J (contrary
! * to what is indicated in the j_s_nightingale analysis of the problem) that
* sometimes have an angle bracket (< or 0x3c) embedded in them. These
* are taken to be tags by the parser, instead of being considered strings.
* <p>
! * The URL http://www.009.com/ has an ISO-8859-1 encoding (the default), but
* Japanese characters intermixed on the page with English, using the JIS
* encoding. We detect failure by looking for weird tag names which were
***************
*** 666,670 ****
NodeIterator iterator;
! parser = new Parser ("http://www.009.com/");
iterator = parser.elements ();
while (iterator.hasMoreNodes ())
--- 667,671 ----
NodeIterator iterator;
! parser = new Parser ("http://htmlparser.sourceforge.net/test/www_009_com.html");
iterator = parser.elements ();
while (iterator.hasMoreNodes ())
***************
*** 745,748 ****
--- 746,784 ----
}
+ /**
+ * See bug #874175 StringBean doesn't handle charset change well
+ * Force an encoding change exception, reset and re-read.
+ */
+ public void testEncodingChange ()
+ throws
+ ParserException
+ {
+ NodeIterator iterator;
+ Node node;
+ boolean success;
+
+ parser = new Parser ("http://htmlparser.sourceforge.net/test/www_china-pub_com.html");
+ success = false;
+ try
+ {
+ for (iterator = parser.elements (); iterator.hasMoreNodes (); )
+ node = iterator.nextNode ();
+ }
+ catch (EncodingChangeException ece)
+ {
+ success = true;
+ try
+ {
+ parser.reset ();
+ for (iterator = parser.elements (); iterator.hasMoreNodes (); )
+ node = iterator.nextNode ();
+ }
+ catch (ParserException pe)
+ {
+ success = false;
+ }
+ }
+ assertTrue ("encoding change failed", success);
+ }
}
|