Thread: [Htmlparser-user] fixed previous problem - (however, new problem)
Brought to you by:
derrickoswald
From: Doyle, A. <Ann...@au...> - 2002-05-01 20:07:05
|
Fixed: <td rowspan=3D3><img height=3D49=20 =20 alt=3D"Central Intelligence Agency, Director of Central Intelligence"=20 =20 src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20 =20 width=3D241></td> =20 by changing HTMLTag as follows: public static int incrementCounter(HTMLReader reader, int state, int i, HTMLTag tag) { String strLine =3D null; if ((state=3D=3DTAG_BEGIN_PARSING_STATE || state = =3D=3D TAG_IGNORE_DATA_STATE) && i=3D=3Dtag.getTagLine().length()-1) { // We need to continue parsing to the next line ; while ((strLine =3D = reader.getNextLine()).length() =3D=3D 0); =20 //tag.setTagLine(reader.getNextLine()); tag.setTagLine(strLine); // convert the end of line to a space // The following line masked by Somik Raha, 15 Apr 2002, to fix space bug in links tag.append('\n'); i=3D-1; } =20 return ++i; } =20 NEW PROBLEM in following: =20 <div align=3D"center"><font face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |
From: Somik R. <so...@ya...> - 2002-05-02 02:42:14
|
Hi Annette, Regarding the first problem, I wrote a testcase, but was unable to = reproduce the error. Can you checkout the latest code from CVS, = (HTMLImageScanner), and take a look at the testcase = testImageTagOnThreeLines(). This test case passes. It ought to fail if = there is a problem in the parsing.=20 Meanwhile I am taking a look at the second issue. Regards, Somik =20 ----- Original Message -----=20 From: Doyle, Annette=20 To: htm...@li...=20 Sent: Thursday, May 02, 2002 5:06 AM Subject: [Htmlparser-user] fixed previous problem - (however, new = problem) Fixed: <td rowspan=3D3><img height=3D49=20 =20 alt=3D"Central Intelligence Agency, Director of Central = Intelligence"=20 =20 src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20 =20 width=3D241></td> =20 by changing HTMLTag as follows: public static int incrementCounter(HTMLReader reader, int = state, int i, HTMLTag tag) { String strLine =3D null; if ((state=3D=3DTAG_BEGIN_PARSING_STATE || = state =3D=3D TAG_IGNORE_DATA_STATE) && = i=3D=3Dtag.getTagLine().length()-1) { // We need to continue parsing to = the next line ; while ((strLine =3D = reader.getNextLine()).length() =3D=3D 0); = //tag.setTagLine(reader.getNextLine()); tag.setTagLine(strLine); // convert the end of line to a = space // The following line masked by = Somik Raha, 15 Apr 2002, to fix space bug in links tag.append('\n'); i=3D-1; } =20 return ++i; } =20 NEW PROBLEM in following: =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" = color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |
From: Somik R. <so...@ya...> - 2002-05-02 02:59:22
|
Hi Annette, Regarding your second problem, the parsing error occurs because -=20 =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font=20 In the above - font face=3D"Arial,"helvetica," -- note the erroneoue = extra " in front of helvetica. Remove it and the parsing is fine. Now of = course you cant remove it, bcos this site is not yours :). So, we do = have to support this kind of dirty html. Thank you so much for bringing = it to our notice. I have written a test case to reproduce this bug, and = am working to resolve this. Regards, Somik =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" = color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |
From: Somik R. <so...@ya...> - 2002-05-02 03:11:27
|
Hi Folks, If you've been following the latest exchange on htmlparser-user, = Annette has shown us a crazy example of dirty html, which works in the = browser, but crashes the parser. The site is http://www.cia.gov =20 Search for this string - <font face=3D"Arial,"helvetica," and you will find it in the html. Now this erroneous inverted comma = in front of helvetica should not be there.=20 This has been captured in a test case in HTMLTagTest.java (you can = get it from CVS), and this test fails (testParsing()). The problem is - the core parsing mechanism ignores anything within = inverted commas. This is critical so as to be able to accept angular = brackets in inverted commas. If we remove this feature from the parser = other tests will break. =20 So I need some suggestions on how we might modify our parsing - how = do we intelligently understand that this is an error (how easy it is for = us humans to figure this out) ? Looks like linear approaches wouldnt = work anymore... Maybe we need to associate some intelligence - that if = its a font tag, then this kind of stuff is most definitely an error. = Whereas if its a jsp tag, we can get more strict with our parsing. This = will probably cause a fundamental shift in our core parsing technique. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-03 08:35:27
|
Hi Annette, I went thru the first problem you reported again, and I realized the = mistake in my testcase- this tag has two newlines instead of one for = each line. Could reproduce the bug after that. Have applied your fix, = and updated CVS. Thanks a lot. Regards, Somik ----- Original Message -----=20 From: Doyle, Annette=20 To: htm...@li...=20 Sent: Thursday, May 02, 2002 5:06 AM Subject: [Htmlparser-user] fixed previous problem - (however, new = problem) Fixed: <td rowspan=3D3><img height=3D49=20 =20 alt=3D"Central Intelligence Agency, Director of Central = Intelligence"=20 =20 src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20 =20 width=3D241></td> =20 by changing HTMLTag as follows: public static int incrementCounter(HTMLReader reader, int = state, int i, HTMLTag tag) { String strLine =3D null; if ((state=3D=3DTAG_BEGIN_PARSING_STATE || = state =3D=3D TAG_IGNORE_DATA_STATE) && = i=3D=3Dtag.getTagLine().length()-1) { // We need to continue parsing to = the next line ; while ((strLine =3D = reader.getNextLine()).length() =3D=3D 0); = //tag.setTagLine(reader.getNextLine()); tag.setTagLine(strLine); // convert the end of line to a = space // The following line masked by = Somik Raha, 15 Apr 2002, to fix space bug in links tag.append('\n'); i=3D-1; } =20 return ++i; } =20 NEW PROBLEM in following: =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" = color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |