Thread: [Htmlparser-developer] character encoding (for Derrick)
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2003-03-30 06:25:29
|
Hi Derrick, Continuing our earlier discussion, I've had an idea- instead of = re-establishing an input stream, suppose we assume that the parser can = be initialized with a character set - and we use that.. We could have both strategies in there. Bytway, quite a few steps are failing - I'm guessing that you're = actively working on those - let me know if there any issues if I make an = integration release this week (in case you don't finish). Regards, Somik |
From: Derrick O. <Der...@ro...> - 2003-03-30 21:19:53
|
Somik, The capability to setEncoding() is already there, so API users who know the character set can set it before parsing (caveat: there is no test case for this, so it may not work). This can happen before or after the connection is opened, but in the latter case will cause an input stream reset. In the former case the setting will be overwritten by the incoming HTTP and HTML header values if they are there and differ from what's set. One possible enhancement would be to not allow the headers to override the character set if it's been set via the API, which assumes the user knows what they are doing. It's most likely that users don't know the character set though, either from the HTTP or HTML header, so automatic handling by default is best. I noted that four tests are failing after last weeks integration: testScriptCodeExtraction testScriptCodeExtractionWithMultipleQuotes testStringBeanListener testThreadSafety I'm not sure about the others, but the bean listener test is failing because the handling of tables has changed. The test is to show that extracted text contains the link URLs when the links property is set to true. It would now have to dig into the table tags to find the link. I'm looking at collectInto() but can't see how to collect string and link tags so the links can be inserted in context into the text (where they are found). I'm also wrestling with the issue of handling <pre></pre>, since collectInto() doesn't seem to be able to give that kind of information. I guess collectInto() is too blunt a tool. Derrick Somik Raha wrote: > Hi Derrick, > Continuing our earlier discussion, I've had an idea- instead of > re-establishing an input stream, suppose we assume that the parser can > be initialized with a character set - and we use that.. > We could have both strategies in there. > > Bytway, quite a few steps are failing - I'm guessing that you're > actively working on those - let me know if there any issues if I make > an integration release this week (in case you don't finish). > > Regards, > Somik |
From: Somik R. <so...@ya...> - 2003-03-30 23:25:06
|
> It's most likely that users=20 > don't know the character set though, either from the HTTP or HTML=20 > header, so automatic handling by default is best. =20 I agree. > I noted that four tests are failing after last weeks integration: > testStringBeanListener This one has been failing for a while.=20 > I'm not sure about the others, but the bean listener test is failing=20 > because the handling of tables has changed. The test is to show that=20 > extracted text contains the link URLs when the links property is set = to=20 > true. It would now have to dig into the table tags to find the link.=20 > I'm looking at collectInto() but can't see how to collect string and=20 > link tags so the links can be inserted in context into the text (where = > they are found). I'm also wrestling with the issue of handling=20 > <pre></pre>, since collectInto() doesn't seem to be able to give that=20 > kind of information. I guess collectInto() is too blunt a tool. If you're trying to collect strings AND links and keep them in context, = your best bet is to write your own visitor. > testThreadSafety Thanks for reporting this - on my end this one's passing. I had left one = last variable in TagParser- and I thought it would affect Thread safety. = So I rigged up that test, but surprisingly it passed every time on my = end. Can you send me the failure message ? I might need to rework = TagParser again. > testScriptCodeExtraction > testScriptCodeExtractionWithMultipleQuotes You can ignore these two - they actually demonstrate a bug which I have = no clue about, and I think there's little we can do about it. From my = earlier integration release mail (last week), Thanks are also due to Huang-Chun Yu for uncovering a serious bug with = the script scanning mechanism. The parser can currently handle script tags = like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular = tags. Such pages are quite widespread and ought to be supported. I was curious = if anyone has ideas on solving this - given the existing design - fresh = ideas often lead to a better perspective.=20 Regards, Somik |
From: Derrick O. <Der...@ro...> - 2003-03-31 03:24:06
|
Somik, Running testThreadSafety() can give different errors or no errors at all depending on the run (typical of race conditions). The errors might only crop up on SMP systems (I've got a dual CPU). When it fails, the actual link appears empty even though the link text is not (see the debug outputs below): id = 5 link = "" linkText = " Server: sf-web2 \t SourceForge.net: Modify: 711073 - HTMLTagParser not threadsafe as a static variable in HTMLTag\t\t\tfunction help_window(helpurl) {\t\tHelpWin = window.open( \'http://sourceforge.net\' + helpurl,\'HelpWindow\',\'scrollbars=yes,resizable=yes,toolbar=no,height=400,width=400\');\t}\t// \t\t\t This is temp javascript for the jump button. If we could actually have a jump script on the server side that would be ideal \tfunction jump(targ,selObj,restore){ //v3.0\tif (selObj.options[selObj.selectedIndex].value) \t\teval(targ+\".location=\'\"+selObj.options[selObj.selectedIndex].value+\"\'\");\tif (restore) selObj.selectedIndex=0;\t}\t//" result = false Here's some example JUnit dumps: ******************************************************************************************************** junit.framework.AssertionFailedError: Thread 55, link 1: EXPECTED result has 35 extra characters at the end. They are : Position : 0 , Code = 104 Position : 1 , Code = 116 Position : 2 , Code = 116 Position : 3 , Code = 112 <snip> Position : 33 , Code = 109 Position : 34 , Code = 108 Mismatch of strings at char posn 0 String Expected upto mismatch = String Actual upto mismatch = String Expected MISMATCH CHARACTER = h, code = 104 **** COMPLETE STRING EXPECTED **** http://normallink.com/sometext.html **** COMPLETE STRING ACTUAL*** at org.htmlparser.tests.ParserTestCase.assertStringEquals(ParserTestCase.java:125) at org.htmlparser.tests.parserHelperTests.TagParserTest.testThreadSafety(TagParserTest.java:186) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) ******************************************************************************************************** junit.framework.AssertionFailedError: Thread 26, link 2: EXPECTED result has 116 extra characters at the end. They are : Position : 0 , Code = 47 Position : 1 , Code = 99 Position : 2 , Code = 103 Position : 3 , Code = 105 <snip> Position : 114 , Code = 109 Position : 115 , Code = 108 Mismatch of strings at char posn 0 String Expected upto mismatch = String Actual upto mismatch = String Expected MISMATCH CHARACTER = /, code = 47 **** COMPLETE STRING EXPECTED **** /cgi-bin/view_search?query_text=postdate>20020701&txt_clr=White&bg_clr=Red&url=http://localhost/Testing/Report1.html **** COMPLETE STRING ACTUAL*** at org.htmlparser.tests.ParserTestCase.assertStringEquals(ParserTestCase.java:125) at org.htmlparser.tests.parserHelperTests.TagParserTest.testThreadSafety(TagParserTest.java:191) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) Hope this helps, Derrick Somik Raha wrote: > > testThreadSafety > > Thanks for reporting this - on my end this one's passing. I had left > one last variable in TagParser- and I thought it would affect Thread > safety. So I rigged up that test, but surprisingly it passed every > time on my end. Can you send me the failure message ? I might need to > rework TagParser again. > > |
From: Somik R. <so...@ya...> - 2003-03-31 04:20:30
|
Hi Derrick, Thanks a lot.. I have removed the boolean, and made it totally thread-safe now (I think). Could you try again? Regards, Somik ----- Original Message ----- From: "Derrick Oswald" <Der...@ro...> To: <htm...@li...> Sent: Sunday, March 30, 2003 7:31 PM Subject: [Htmlparser-developer] testThreadSafety error > Somik, > > Running testThreadSafety() can give different errors or no errors at all > depending on the run (typical of race conditions). > The errors might only crop up on SMP systems (I've got a dual CPU). > > When it fails, the actual link appears empty even though the link text > is not (see the debug outputs below): > id = 5 > link = "" > linkText = " Server: sf-web2 \t SourceForge.net: Modify: 711073 - > HTMLTagParser not threadsafe as a static variable in > HTMLTag\t\t\tfunction help_window(helpurl) {\t\tHelpWin = window.open( > \'http://sourceforge.net\' + > helpurl,\'HelpWindow\',\'scrollbars=yes,resizable=yes,toolbar=no,height=400, width=400\');\t}\t// > \t\t\t This is temp javascript for the jump button. If we could actually > have a jump script on the server side that would be ideal \tfunction > jump(targ,selObj,restore){ //v3.0\tif > (selObj.options[selObj.selectedIndex].value) > \t\teval(targ+\".location=\'\"+selObj.options[selObj.selectedIndex].value+\" \'\");\tif > (restore) selObj.selectedIndex=0;\t}\t//" > result = false > > Here's some example JUnit dumps: > **************************************************************************** **************************** > > junit.framework.AssertionFailedError: Thread 55, link 1: > > EXPECTED result has 35 extra characters at the end. They are : > Position : 0 , Code = 104 > Position : 1 , Code = 116 > Position : 2 , Code = 116 > Position : 3 , Code = 112 > <snip> > Position : 33 , Code = 109 > Position : 34 , Code = 108 > Mismatch of strings at char posn 0 > > String Expected upto mismatch = > > String Actual upto mismatch = > > String Expected MISMATCH CHARACTER = h, code = 104 > > **** COMPLETE STRING EXPECTED **** > http://normallink.com/sometext.html > > **** COMPLETE STRING ACTUAL*** > > at > org.htmlparser.tests.ParserTestCase.assertStringEquals(ParserTestCase.java:1 25) > at > org.htmlparser.tests.parserHelperTests.TagParserTest.testThreadSafety(TagPar serTest.java:186) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) > > **************************************************************************** **************************** > > junit.framework.AssertionFailedError: Thread 26, link 2: > > EXPECTED result has 116 extra characters at the end. They are : > Position : 0 , Code = 47 > Position : 1 , Code = 99 > Position : 2 , Code = 103 > Position : 3 , Code = 105 > <snip> > Position : 114 , Code = 109 > Position : 115 , Code = 108 > Mismatch of strings at char posn 0 > > String Expected upto mismatch = > > String Actual upto mismatch = > > String Expected MISMATCH CHARACTER = /, code = 47 > > **** COMPLETE STRING EXPECTED **** > > /cgi-bin/view_search?query_text=postdate>20020701&txt_clr=White&bg_clr=Red&u rl=http://localhost/Testing/Report1.html > > > **** COMPLETE STRING ACTUAL*** > > at > org.htmlparser.tests.ParserTestCase.assertStringEquals(ParserTestCase.java:1 25) > at > org.htmlparser.tests.parserHelperTests.TagParserTest.testThreadSafety(TagPar serTest.java:191) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) > > Hope this helps, > > Derrick > > Somik Raha wrote: > > > > testThreadSafety > > > > Thanks for reporting this - on my end this one's passing. I had left > > one last variable in TagParser- and I thought it would affect Thread > > safety. So I rigged up that test, but surprisingly it passed every > > time on my end. Can you send me the failure message ? I might need to > > rework TagParser again. > > > > > > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: > The Definitive IT and Networking Event. Be There! > NetWorld+Interop Las Vegas 2003 -- Register today! > http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-03-31 08:19:26
|
Somik, Sorry, same error as before. Now testJspWithinAttributes is also complaining (in two places). Derrick Somik Raha wrote: >Hi Derrick, > Thanks a lot.. I have removed the boolean, and made it totally >thread-safe now (I think). > Could you try again? > >Regards, >Somik > > |