Thread: [Htmlparser-user] Parsing quoted strings seems to be broken in 1.6
Brought to you by:
derrickoswald
|
From: sebb <se...@gm...> - 2007-01-08 00:40:53
|
The sample script:
<HTML>
<body>
<script>
fred = "<img src='a.gif'></img>"
</script>
</body>
</HTML>
generates the following output from parser.cmd:
Tag (0[0,0],6[0,6]): HTML
Txt (6[0,6],10[1,2]): \n
Tag (10[1,2],16[1,8]): body
Txt (16[1,8],20[2,2]): \n
Tag (20[2,2],28[2,10]): script
Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'>
End (57[3,27],57[3,27]): /script
End (57[3,27],63[3,33]): /img
Txt (63[3,33],68[4,2]): "\n
End (68[4,2],77[4,11]): /script
Txt (77[4,11],81[5,2]): \n
End (81[5,2],88[5,9]): /body
Txt (88[5,9],90[6,0]): \n
End (90[6,0],97[6,7]): /HTML
Txt (97[6,7],101[8,0]): \n\n
It looks like the closing tag is being recognised - though the opening
tag is not.
Is this a bug, or have I misunderstood something?
|
|
From: Derrick O. <Der...@Ro...> - 2007-01-08 01:53:29
|
For parsing bad script like this you probably want to set the static boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See the explanation in the ScriptScanner.java file. sebb wrote: >The sample script: > ><HTML> > <body> > <script> > fred = "<img src='a.gif'></img>" > </script> > </body> ></HTML> > >generates the following output from parser.cmd: > >Tag (0[0,0],6[0,6]): HTML > Txt (6[0,6],10[1,2]): \n > Tag (10[1,2],16[1,8]): body > Txt (16[1,8],20[2,2]): \n > Tag (20[2,2],28[2,10]): script > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > End (57[3,27],57[3,27]): /script > End (57[3,27],63[3,33]): /img > Txt (63[3,33],68[4,2]): "\n > End (68[4,2],77[4,11]): /script > Txt (77[4,11],81[5,2]): \n > End (81[5,2],88[5,9]): /body > Txt (88[5,9],90[6,0]): \n > End (90[6,0],97[6,7]): /HTML >Txt (97[6,7],101[8,0]): \n\n > >It looks like the closing tag is being recognised - though the opening >tag is not. > >Is this a bug, or have I misunderstood something? > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys - and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
|
From: sebb <se...@gm...> - 2007-01-08 11:21:09
|
Thanks for the quick reply. I'll give it a try. However, I'm not sure why the script example is bad. It is not enclosed in "<!--" and "// -->", but AIUI those are only needed as a work-round for older browsers that did not understand the <script> tag. On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > For parsing bad script like this you probably want to set the static > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > the explanation in the ScriptScanner.java file. > > sebb wrote: > > >The sample script: > > > ><HTML> > > <body> > > <script> > > fred = "<img src='a.gif'></img>" > > </script> > > </body> > ></HTML> > > > >generates the following output from parser.cmd: > > > >Tag (0[0,0],6[0,6]): HTML > > Txt (6[0,6],10[1,2]): \n > > Tag (10[1,2],16[1,8]): body > > Txt (16[1,8],20[2,2]): \n > > Tag (20[2,2],28[2,10]): script > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > End (57[3,27],57[3,27]): /script > > End (57[3,27],63[3,33]): /img > > Txt (63[3,33],68[4,2]): "\n > > End (68[4,2],77[4,11]): /script > > Txt (77[4,11],81[5,2]): \n > > End (81[5,2],88[5,9]): /body > > Txt (88[5,9],90[6,0]): \n > > End (90[6,0],97[6,7]): /HTML > >Txt (97[6,7],101[8,0]): \n\n > > > >It looks like the closing tag is being recognised - though the opening > >tag is not. > > > >Is this a bug, or have I misunderstood something? > > > >------------------------------------------------------------------------- > >Take Surveys. Earn Cash. Influence the Future of IT > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > >opinions on IT & business topics through brief surveys - and earn cash > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
|
From: sebb <se...@gm...> - 2007-01-08 14:58:28
|
Sorry, my bad - I've now read the document referenced in the scanner source, and I see that "</" acts as the terminator unless suitably hidden. S. On 08/01/07, sebb <se...@gm...> wrote: > Thanks for the quick reply. I'll give it a try. > > However, I'm not sure why the script example is bad. > > It is not enclosed in "<!--" and "// -->", but AIUI those are only > needed as a work-round for older browsers that did not understand the > <script> tag. > > > On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > > > For parsing bad script like this you probably want to set the static > > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > > the explanation in the ScriptScanner.java file. > > > > sebb wrote: > > > > >The sample script: > > > > > ><HTML> > > > <body> > > > <script> > > > fred = "<img src='a.gif'></img>" > > > </script> > > > </body> > > ></HTML> > > > > > >generates the following output from parser.cmd: > > > > > >Tag (0[0,0],6[0,6]): HTML > > > Txt (6[0,6],10[1,2]): \n > > > Tag (10[1,2],16[1,8]): body > > > Txt (16[1,8],20[2,2]): \n > > > Tag (20[2,2],28[2,10]): script > > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > > End (57[3,27],57[3,27]): /script > > > End (57[3,27],63[3,33]): /img > > > Txt (63[3,33],68[4,2]): "\n > > > End (68[4,2],77[4,11]): /script > > > Txt (77[4,11],81[5,2]): \n > > > End (81[5,2],88[5,9]): /body > > > Txt (88[5,9],90[6,0]): \n > > > End (90[6,0],97[6,7]): /HTML > > >Txt (97[6,7],101[8,0]): \n\n > > > > > >It looks like the closing tag is being recognised - though the opening > > >tag is not. > > > > > >Is this a bug, or have I misunderstood something? > > > > > >------------------------------------------------------------------------- > > >Take Surveys. Earn Cash. Influence the Future of IT > > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > > >opinions on IT & business topics through brief surveys - and earn cash > > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > >_______________________________________________ > > >Htmlparser-user mailing list > > >Htm...@li... > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
|
From: sebb <se...@gm...> - 2007-01-10 18:13:01
|
FYI: I've now tested the parser using ScriptScanner.STRICT=false and that solved the "problem". Thanks again. On 08/01/07, sebb <se...@gm...> wrote: > Sorry, my bad - I've now read the document referenced in the scanner > source, and I see that "</" acts as the terminator unless suitably > hidden. > > S. > On 08/01/07, sebb <se...@gm...> wrote: > > Thanks for the quick reply. I'll give it a try. > > > > However, I'm not sure why the script example is bad. > > > > It is not enclosed in "<!--" and "// -->", but AIUI those are only > > needed as a work-round for older browsers that did not understand the > > <script> tag. > > > > > > On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > > > > > For parsing bad script like this you probably want to set the static > > > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > > > the explanation in the ScriptScanner.java file. > > > > > > sebb wrote: > > > > > > >The sample script: > > > > > > > ><HTML> > > > > <body> > > > > <script> > > > > fred = "<img src='a.gif'></img>" > > > > </script> > > > > </body> > > > ></HTML> > > > > > > > >generates the following output from parser.cmd: > > > > > > > >Tag (0[0,0],6[0,6]): HTML > > > > Txt (6[0,6],10[1,2]): \n > > > > Tag (10[1,2],16[1,8]): body > > > > Txt (16[1,8],20[2,2]): \n > > > > Tag (20[2,2],28[2,10]): script > > > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > > > End (57[3,27],57[3,27]): /script > > > > End (57[3,27],63[3,33]): /img > > > > Txt (63[3,33],68[4,2]): "\n > > > > End (68[4,2],77[4,11]): /script > > > > Txt (77[4,11],81[5,2]): \n > > > > End (81[5,2],88[5,9]): /body > > > > Txt (88[5,9],90[6,0]): \n > > > > End (90[6,0],97[6,7]): /HTML > > > >Txt (97[6,7],101[8,0]): \n\n > > > > > > > >It looks like the closing tag is being recognised - though the opening > > > >tag is not. > > > > > > > >Is this a bug, or have I misunderstood something? > > > > > > > >------------------------------------------------------------------------- > > > >Take Surveys. Earn Cash. Influence the Future of IT > > > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > >opinions on IT & business topics through brief surveys - and earn cash > > > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > >_______________________________________________ > > > >Htmlparser-user mailing list > > > >Htm...@li... > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > |