[Htmlparser-cvs] htmlparser/src/org/htmlparser/tests/lexerTests LexerTests.java,1.15,1.16
Brought to you by:
derrickoswald
From: <der...@us...> - 2004-01-10 15:23:36
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests In directory sc8-pr-cvs1:/tmp/cvs-serv3574/tests/lexerTests Modified Files: LexerTests.java Log Message: Fix bug #874175 StringBean doesn't handle charset change well Add EncodingChangeException to distinguish a recoverable character set change occuring after the lexer has already coughed up some characters using the wrong encoding. Added testEncodingChange in LexerTests to excercise it. Changed IteratorImpl to not wrap a ParserException with another ParserException. Changed StringBean to retry the URL when an encoding change exception is caught. Index: LexerTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/LexerTests.java,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** LexerTests.java 2 Jan 2004 16:24:55 -0000 1.15 --- LexerTests.java 10 Jan 2004 15:23:33 -0000 1.16 *************** *** 52,55 **** --- 52,56 ---- import org.htmlparser.util.NodeIterator; import org.htmlparser.util.NodeList; + import org.htmlparser.util.EncodingChangeException; import org.htmlparser.util.ParserException; *************** *** 620,628 **** * causes spurious tags. * The root cause is characters bracketed by [esc]$B and [esc](J (contrary ! * to what is indicated in then j_s_nightingale analysis of the problem) that * sometimes have an angle bracket (< or 0x3c) embedded in them. These * are taken to be tags by the parser, instead of being considered strings. * <p> ! * The URL refrenced has an ISO-8859-1 encoding (the default), but * Japanese characters intermixed on the page with English, using the JIS * encoding. We detect failure by looking for weird tag names which were --- 621,629 ---- * causes spurious tags. * The root cause is characters bracketed by [esc]$B and [esc](J (contrary ! * to what is indicated in the j_s_nightingale analysis of the problem) that * sometimes have an angle bracket (< or 0x3c) embedded in them. These * are taken to be tags by the parser, instead of being considered strings. * <p> ! * The URL http://www.009.com/ has an ISO-8859-1 encoding (the default), but * Japanese characters intermixed on the page with English, using the JIS * encoding. We detect failure by looking for weird tag names which were *************** *** 666,670 **** NodeIterator iterator; ! parser = new Parser ("http://www.009.com/"); iterator = parser.elements (); while (iterator.hasMoreNodes ()) --- 667,671 ---- NodeIterator iterator; ! parser = new Parser ("http://htmlparser.sourceforge.net/test/www_009_com.html"); iterator = parser.elements (); while (iterator.hasMoreNodes ()) *************** *** 745,748 **** --- 746,784 ---- } + /** + * See bug #874175 StringBean doesn't handle charset change well + * Force an encoding change exception, reset and re-read. + */ + public void testEncodingChange () + throws + ParserException + { + NodeIterator iterator; + Node node; + boolean success; + + parser = new Parser ("http://htmlparser.sourceforge.net/test/www_china-pub_com.html"); + success = false; + try + { + for (iterator = parser.elements (); iterator.hasMoreNodes (); ) + node = iterator.nextNode (); + } + catch (EncodingChangeException ece) + { + success = true; + try + { + parser.reset (); + for (iterator = parser.elements (); iterator.hasMoreNodes (); ) + node = iterator.nextNode (); + } + catch (ParserException pe) + { + success = false; + } + } + assertTrue ("encoding change failed", success); + } } |