Thread: [Htmlparser-user] Parsing quoted strings seems to be broken in 1.6
Brought to you by:
derrickoswald
From: sebb <se...@gm...> - 2007-01-08 00:40:53
|
The sample script: <HTML> <body> <script> fred = "<img src='a.gif'></img>" </script> </body> </HTML> generates the following output from parser.cmd: Tag (0[0,0],6[0,6]): HTML Txt (6[0,6],10[1,2]): \n Tag (10[1,2],16[1,8]): body Txt (16[1,8],20[2,2]): \n Tag (20[2,2],28[2,10]): script Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> End (57[3,27],57[3,27]): /script End (57[3,27],63[3,33]): /img Txt (63[3,33],68[4,2]): "\n End (68[4,2],77[4,11]): /script Txt (77[4,11],81[5,2]): \n End (81[5,2],88[5,9]): /body Txt (88[5,9],90[6,0]): \n End (90[6,0],97[6,7]): /HTML Txt (97[6,7],101[8,0]): \n\n It looks like the closing tag is being recognised - though the opening tag is not. Is this a bug, or have I misunderstood something? |
From: Derrick O. <Der...@Ro...> - 2007-01-08 01:53:29
|
For parsing bad script like this you probably want to set the static boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See the explanation in the ScriptScanner.java file. sebb wrote: >The sample script: > ><HTML> > <body> > <script> > fred = "<img src='a.gif'></img>" > </script> > </body> ></HTML> > >generates the following output from parser.cmd: > >Tag (0[0,0],6[0,6]): HTML > Txt (6[0,6],10[1,2]): \n > Tag (10[1,2],16[1,8]): body > Txt (16[1,8],20[2,2]): \n > Tag (20[2,2],28[2,10]): script > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > End (57[3,27],57[3,27]): /script > End (57[3,27],63[3,33]): /img > Txt (63[3,33],68[4,2]): "\n > End (68[4,2],77[4,11]): /script > Txt (77[4,11],81[5,2]): \n > End (81[5,2],88[5,9]): /body > Txt (88[5,9],90[6,0]): \n > End (90[6,0],97[6,7]): /HTML >Txt (97[6,7],101[8,0]): \n\n > >It looks like the closing tag is being recognised - though the opening >tag is not. > >Is this a bug, or have I misunderstood something? > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys - and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: sebb <se...@gm...> - 2007-01-08 11:21:09
|
Thanks for the quick reply. I'll give it a try. However, I'm not sure why the script example is bad. It is not enclosed in "<!--" and "// -->", but AIUI those are only needed as a work-round for older browsers that did not understand the <script> tag. On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > For parsing bad script like this you probably want to set the static > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > the explanation in the ScriptScanner.java file. > > sebb wrote: > > >The sample script: > > > ><HTML> > > <body> > > <script> > > fred = "<img src='a.gif'></img>" > > </script> > > </body> > ></HTML> > > > >generates the following output from parser.cmd: > > > >Tag (0[0,0],6[0,6]): HTML > > Txt (6[0,6],10[1,2]): \n > > Tag (10[1,2],16[1,8]): body > > Txt (16[1,8],20[2,2]): \n > > Tag (20[2,2],28[2,10]): script > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > End (57[3,27],57[3,27]): /script > > End (57[3,27],63[3,33]): /img > > Txt (63[3,33],68[4,2]): "\n > > End (68[4,2],77[4,11]): /script > > Txt (77[4,11],81[5,2]): \n > > End (81[5,2],88[5,9]): /body > > Txt (88[5,9],90[6,0]): \n > > End (90[6,0],97[6,7]): /HTML > >Txt (97[6,7],101[8,0]): \n\n > > > >It looks like the closing tag is being recognised - though the opening > >tag is not. > > > >Is this a bug, or have I misunderstood something? > > > >------------------------------------------------------------------------- > >Take Surveys. Earn Cash. Influence the Future of IT > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > >opinions on IT & business topics through brief surveys - and earn cash > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: sebb <se...@gm...> - 2007-01-08 14:58:28
|
Sorry, my bad - I've now read the document referenced in the scanner source, and I see that "</" acts as the terminator unless suitably hidden. S. On 08/01/07, sebb <se...@gm...> wrote: > Thanks for the quick reply. I'll give it a try. > > However, I'm not sure why the script example is bad. > > It is not enclosed in "<!--" and "// -->", but AIUI those are only > needed as a work-round for older browsers that did not understand the > <script> tag. > > > On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > > > For parsing bad script like this you probably want to set the static > > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > > the explanation in the ScriptScanner.java file. > > > > sebb wrote: > > > > >The sample script: > > > > > ><HTML> > > > <body> > > > <script> > > > fred = "<img src='a.gif'></img>" > > > </script> > > > </body> > > ></HTML> > > > > > >generates the following output from parser.cmd: > > > > > >Tag (0[0,0],6[0,6]): HTML > > > Txt (6[0,6],10[1,2]): \n > > > Tag (10[1,2],16[1,8]): body > > > Txt (16[1,8],20[2,2]): \n > > > Tag (20[2,2],28[2,10]): script > > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > > End (57[3,27],57[3,27]): /script > > > End (57[3,27],63[3,33]): /img > > > Txt (63[3,33],68[4,2]): "\n > > > End (68[4,2],77[4,11]): /script > > > Txt (77[4,11],81[5,2]): \n > > > End (81[5,2],88[5,9]): /body > > > Txt (88[5,9],90[6,0]): \n > > > End (90[6,0],97[6,7]): /HTML > > >Txt (97[6,7],101[8,0]): \n\n > > > > > >It looks like the closing tag is being recognised - though the opening > > >tag is not. > > > > > >Is this a bug, or have I misunderstood something? > > > > > >------------------------------------------------------------------------- > > >Take Surveys. Earn Cash. Influence the Future of IT > > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > > >opinions on IT & business topics through brief surveys - and earn cash > > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > >_______________________________________________ > > >Htmlparser-user mailing list > > >Htm...@li... > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: sebb <se...@gm...> - 2007-01-10 18:13:01
|
FYI: I've now tested the parser using ScriptScanner.STRICT=false and that solved the "problem". Thanks again. On 08/01/07, sebb <se...@gm...> wrote: > Sorry, my bad - I've now read the document referenced in the scanner > source, and I see that "</" acts as the terminator unless suitably > hidden. > > S. > On 08/01/07, sebb <se...@gm...> wrote: > > Thanks for the quick reply. I'll give it a try. > > > > However, I'm not sure why the script example is bad. > > > > It is not enclosed in "<!--" and "// -->", but AIUI those are only > > needed as a work-round for older browsers that did not understand the > > <script> tag. > > > > > > On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > > > > > For parsing bad script like this you probably want to set the static > > > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > > > the explanation in the ScriptScanner.java file. > > > > > > sebb wrote: > > > > > > >The sample script: > > > > > > > ><HTML> > > > > <body> > > > > <script> > > > > fred = "<img src='a.gif'></img>" > > > > </script> > > > > </body> > > > ></HTML> > > > > > > > >generates the following output from parser.cmd: > > > > > > > >Tag (0[0,0],6[0,6]): HTML > > > > Txt (6[0,6],10[1,2]): \n > > > > Tag (10[1,2],16[1,8]): body > > > > Txt (16[1,8],20[2,2]): \n > > > > Tag (20[2,2],28[2,10]): script > > > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > > > End (57[3,27],57[3,27]): /script > > > > End (57[3,27],63[3,33]): /img > > > > Txt (63[3,33],68[4,2]): "\n > > > > End (68[4,2],77[4,11]): /script > > > > Txt (77[4,11],81[5,2]): \n > > > > End (81[5,2],88[5,9]): /body > > > > Txt (88[5,9],90[6,0]): \n > > > > End (90[6,0],97[6,7]): /HTML > > > >Txt (97[6,7],101[8,0]): \n\n > > > > > > > >It looks like the closing tag is being recognised - though the opening > > > >tag is not. > > > > > > > >Is this a bug, or have I misunderstood something? > > > > > > > >------------------------------------------------------------------------- > > > >Take Surveys. Earn Cash. Influence the Future of IT > > > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > >opinions on IT & business topics through brief surveys - and earn cash > > > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > >_______________________________________________ > > > >Htmlparser-user mailing list > > > >Htm...@li... > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > |