Thread: [Htmlparser-developer] Charset tests failing
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-12-27 07:17:56
|
Hi Derrick, I was working on the latest parser, and just found that the charset = tests are failing (I didnt modify anything in HTMLParser). Could you check what might have gone wrong ? Also, I was wondering = if it might not be better to have the charset pages on our domain = (http://htmlparser.sourceforge.net) - we could put up encoded pages and = they'd always be there. =20 Regards Somik |
From: Derrick O. <Der...@ro...> - 2002-12-29 01:14:58
|
Somik, Sorry, just got back from the pilgrimage to Bethlehem (Toronto). All 268 tests are running OK against the version I took just now. Perhaps www.ibm.co.jp or www.sony.co.jp weren't available for you when you tried it. Try again, please. The tests can't be all in our repository, specifically the testHTTPCharset relies on an HTTP header charset parameter being set (by the HTTP server) to something other than ISO-8859-1 which the sourceforge people are unlikely to set up for us. We could put the testHTMLCharset under the sourceforge domain though. Will get to it as time permits. Derrick Somik Raha wrote: > Hi Derrick, > I was working on the latest parser, and just found that the > charset tests are failing (I didnt modify anything in HTMLParser). > Could you check what might have gone wrong ? Also, I was wondering > if it might not be better to have the charset pages on our domain > (http://htmlparser.sourceforge.net) - we could put up encoded pages > and they'd always be there. > > Regards > Somik > |
From: Somik R. <so...@ya...> - 2002-12-29 01:21:20
|
Hi Derrick, Are you sure ? I have the latest version here too - but the two = tests are consistently failing.=20 Here's what I see : Charset should be Shift_JIS but was ISO-8859-1 for both failures. =20 Regards, Somik ----- Original Message -----=20 From: Derrick Oswald=20 To: htm...@li...=20 Sent: Saturday, December 28, 2002 5:17 PM Subject: Re: [Htmlparser-developer] Charset tests failing Somik, Sorry, just got back from the pilgrimage to Bethlehem (Toronto). All 268 tests are running OK against the version I took just now. Perhaps www.ibm.co.jp or www.sony.co.jp weren't available for you when = you tried it. Try again, please. The tests can't be all in our repository, specifically the = testHTTPCharset relies on an HTTP header charset parameter being set (by = the HTTP server) to something other than ISO-8859-1 which the = sourceforge people are unlikely to set up for us. We could put the = testHTMLCharset under the sourceforge domain though. Will get to it as = time permits. Derrick Somik Raha wrote: Hi Derrick, I was working on the latest parser, and just found that the = charset tests are failing (I didnt modify anything in HTMLParser). Could you check what might have gone wrong ? Also, I was = wondering if it might not be better to have the charset pages on our = domain (http://htmlparser.sourceforge.net) - we could put up encoded = pages and they'd always be there. =20 Regards Somik =20 |
From: Derrick O. <Der...@ro...> - 2002-12-29 15:50:41
|
Somik, Finally reproduced it by backing down my JVM to 1.2. Code fix and test modifications dropped. Sorry about that agent 99. Derrick Somik Raha wrote: > Hi Derrick, > Are you sure ? I have the latest version here too - but the two > tests are consistently failing. > Here's what I see : > Charset should be Shift_JIS but was ISO-8859-1 > > for both failures. > > Regards, > Somik |
From: Somik R. <so...@ya...> - 2002-12-30 05:19:57
|
Hi Folks, Derrick Oswald just checked in a test case that fails.. Here's a link tag : <a href="http://cbc.ca/artsCanada/stories/greatnorth271202" class="lgblacku">Vancouver schools plan 'Great Northern Way'</a> We are in a quandary now. When we have cases like : <a href="something.html">Kaarle's Page</a> we should accept the apostrophe without doing anything special. When we get links like, <script> var code = '<sometag>'; </script> We should not take the tag symbols after code seriously, as they are part of the string. Handling the last two cases causes a conflict with the first case -bcos the last case is handled by checking if there's a < after ' - and this causes the first case to go into an ignoring mode. How do we handle this problem ? Do we write smart code to handle this particular situation ? From human experience, even if we've not encountered these cases, we know how to differentiate between a string node and a tag. Can AI help us here ? Also, pls feel free to suggest any straightforward solutions as well. Regards Somik |
From: Sam J. <ga...@yh...> - 2002-12-30 05:54:14
|
Hi Somik, I would have thought the solution to this would be to define under which circumstances a pair of apostrophes will indicate text. In the first two examples you are inside an ANCHOR tag, and in the third you are inside a SCRIPT tag. It seems to me that you should only be using apostrophe's to indicate text strings inside a SCRIPT tag, no? If that's true then set your parsing behaviour differently depending on the tag type. CHEERS> SAM Somik Raha wrote: >Hi Folks, > Derrick Oswald just checked in a test case that fails.. >Here's a link tag : > ><a href="http://cbc.ca/artsCanada/stories/greatnorth271202" >class="lgblacku">Vancouver schools plan 'Great Northern Way'</a> > >We are in a quandary now. When we have cases like : > ><a href="something.html">Kaarle's Page</a> > >we should accept the apostrophe without doing anything special. > >When we get links like, ><script> > var code = '<sometag>'; ></script> > >We should not take the tag symbols after code seriously, as they are part of >the string. > >Handling the last two cases causes a conflict with the first case -bcos the >last case is handled by checking if there's a < after ' - and this causes >the first case to go into an ignoring mode. > >How do we handle this problem ? Do we write smart code to handle this >particular situation ? From human experience, even if we've not encountered >these cases, we know how to differentiate between a string node and a tag. >Can AI help us here ? Also, pls feel free to suggest any straightforward >solutions as well. > >Regards >Somik > > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > |
From: Somik R. <so...@ya...> - 2002-12-30 06:14:51
|
Sam Joseph wrote: > I would have thought the solution to this would be to define under which > circumstances a pair of apostrophes will indicate text. In the first > two examples you are inside an ANCHOR tag, and in the third you are > inside a SCRIPT tag. It seems to me that you should only be using > apostrophe's to indicate text strings inside a SCRIPT tag, no? If > that's true then set your parsing behaviour differently depending on the > tag type. You are right. The script scanner could set some static variable in the HTMLStringNode - which tells it to move into the ignore states if it encounters an apostrophe, and flag it off when its done. I'll probably do that for now.. Regards, Somik ----- Original Message ----- From: "Sam Joseph" <ga...@yh...> To: <htm...@li...> Sent: Sunday, December 29, 2002 10:09 PM Subject: Re: [Htmlparser-developer] AI - to be or not to be > Hi Somik, > > I would have thought the solution to this would be to define under which > circumstances a pair of apostrophes will indicate text. In the first > two examples you are inside an ANCHOR tag, and in the third you are > inside a SCRIPT tag. It seems to me that you should only be using > apostrophe's to indicate text strings inside a SCRIPT tag, no? If > that's true then set your parsing behaviour differently depending on the > tag type. > > CHEERS> SAM > > Somik Raha wrote: > > >Hi Folks, > > Derrick Oswald just checked in a test case that fails.. > >Here's a link tag : > > > ><a href="http://cbc.ca/artsCanada/stories/greatnorth271202" > >class="lgblacku">Vancouver schools plan 'Great Northern Way'</a> > > > >We are in a quandary now. When we have cases like : > > > ><a href="something.html">Kaarle's Page</a> > > > >we should accept the apostrophe without doing anything special. > > > >When we get links like, > ><script> > > var code = '<sometag>'; > ></script> > > > >We should not take the tag symbols after code seriously, as they are part of > >the string. > > > >Handling the last two cases causes a conflict with the first case -bcos the > >last case is handled by checking if there's a < after ' - and this causes > >the first case to go into an ignoring mode. > > > >How do we handle this problem ? Do we write smart code to handle this > >particular situation ? From human experience, even if we've not encountered > >these cases, we know how to differentiate between a string node and a tag. > >Can AI help us here ? Also, pls feel free to suggest any straightforward > >solutions as well. > > > >Regards > >Somik > > > > > > > > > >------------------------------------------------------- > >This sf.net email is sponsored by:ThinkGeek > >Welcome to geek heaven. > >http://thinkgeek.com/sf > >_______________________________________________ > >Htmlparser-developer mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |