You can subscribe to this list here.
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(3) |
Nov
(1) |
Dec
(24) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2008 |
Jan
(51) |
Feb
(36) |
Mar
(41) |
Apr
(33) |
May
(20) |
Jun
(26) |
Jul
(52) |
Aug
(29) |
Sep
(9) |
Oct
(10) |
Nov
(4) |
Dec
(34) |
2009 |
Jan
(14) |
Feb
(35) |
Mar
(36) |
Apr
(32) |
May
(11) |
Jun
(7) |
Jul
(22) |
Aug
(65) |
Sep
(15) |
Oct
(5) |
Nov
(11) |
Dec
(53) |
2010 |
Jan
(21) |
Feb
(15) |
Mar
(5) |
Apr
|
May
(13) |
Jun
(8) |
Jul
(3) |
Aug
(2) |
Sep
(2) |
Oct
(4) |
Nov
(4) |
Dec
(3) |
2011 |
Jan
(2) |
Feb
(2) |
Mar
(7) |
Apr
(1) |
May
(2) |
Jun
|
Jul
(6) |
Aug
(13) |
Sep
(4) |
Oct
(1) |
Nov
|
Dec
|
2012 |
Jan
(3) |
Feb
(6) |
Mar
(11) |
Apr
(6) |
May
(12) |
Jun
(1) |
Jul
(8) |
Aug
(16) |
Sep
(11) |
Oct
(11) |
Nov
(5) |
Dec
(13) |
2013 |
Jan
(9) |
Feb
(3) |
Mar
(5) |
Apr
(5) |
May
(6) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
|
Dec
|
2014 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(4) |
Dec
|
2015 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Andy C. <an...@cy...> - 2007-12-09 22:16:15
|
> I'm making the appropriate changes now. And I'll add > the appropriate documentation and test cases to make > sure that it works as expected. I've made the following modifications: * Renamed the feature id I shortened it from .../normalize-attributes to just .../normalize-attrs. This keeps the naming in sync with other settings like .../names/attrs, etc. * Set the default value for the feature to FALSE * Re-worked the attribute normalization code It will now continue to normalize newlines and entity refs but will automatically trim and condense consecutive spaces when normalization is turned on. The raw, non-normalized value is still available via the XNI augmentations. * Added documentation to the settings.html page * Renamed and expanded test case I renamed the test case to test-attr-normalize-none.html since it's testing all normalization, not just newline normalization for the code example. I also added some attributes in the test case to handle all the other normalization possibilities: leading/trailing spaces, tabs, and newlines. Then I used the same source content for a test case with the normalization turned on. So that if you look at the build/data/output/test-attr-normalize*.html output files, you'll see the normalization in action and see how it compares when the feature is turned on or off. -- Andy Clark - an...@cy... |
From: <an...@cy...> - 2007-12-09 20:21:36
|
----- Andy Clark <an...@cy...> wrote: > The XML spec mandates that attribute values be > normalized for attributes whose type is other than > CDATA. For example: NMTOKEN, NMTOKENS, etc. This > normalization trims whitespace from the ends and > converts all other series of whitespace (including > newlines) into a single space character. I guess I've been out of the XML parsing world for a bit too long... I went back and re-read the spec regarding attribute normalization[1]. The spec says that whitespace normalization is done on ALL attrs, not just the ones that are of type CDATA. So I looked at the changes that Marc put in for this new feature in more detail. And I think that it does not have the intended effect. This feature, as implemented, promotes the non-normalized value to be the default value as accessed by the various XML APIs. Which means that applications will still need to process entities in their code because an element like this: <a title='M & M'> will communicate the attribute value to the application as "M & M" which is not as useful as just "M & M". This is a lot of effort on all application writers that just want to process onclick and other attributes that can contain JavaScript code correctly. In other words, if the JS has newlines in it, the code cannot be processed correctly unless the author explicitly uses semi-colons to terminate all statements. So I think the real solution should be to NOT do whitespace normalization by default (except for the newline normalization). Then, if the user turns on this feature, it will obey the full normalization rules as specified in section 3.3.3 of the XML spec. I'm making the appropriate changes now. And I'll add the appropriate documentation and test cases to make sure that it works as expected. Let me know if anyone disagrees with my assessment. [1] http://www.w3.org/TR/xml/#AVNormalize -- Andy Clark - an...@cy... |
From: Andy C. <an...@cy...> - 2007-12-07 22:46:13
|
----- Marc Guillemot <mgu...@ya...> wrote: > > You forgot to remove the import statement for java.nio.charset.Charset from > > HTMLScanner.java. Without that removed, the source cannot be compiled using > > JDK1.3. That said, when I compiled with JDK 1.6 and ran under JDK1.3, it ran > > fine. I don't understand why it didn't blow up with a NoClassDefFoundException or > > something like that? > > class are first loaded when they are used. An unused import is not a > good coding style but it doesn't matter at all for the JVM. We should keep it clean and remove that. And, just to be clear, I think the goal is to allow people with a 1.3 run-time to use NekoHTML; and it's a lesser concern that those people can actually compile the code themselves. -- Andy Clark - an...@cy... |
From: Andy C. <an...@cy...> - 2007-12-07 22:43:45
|
----- Jacob Kjome <ho...@vi...> wrote: > I've got a question about this. Is there a specification that says that > attributes should be normalized as NekoHTML is currently doing (if not the > HTML spec, how about the XML spec)? If so, we should comply with the spec and > perform normalization. If not, then I guess it makes sense not to do it by > default. The XML spec mandates that attribute values be normalized for attributes whose type is other than CDATA. For example: NMTOKEN, NMTOKENS, etc. This normalization trims whitespace from the ends and converts all other series of whitespace (including newlines) into a single space character. However, since we're parsing HTML, I think we should do as much as possible to maintain the doc's intended use as HTML. Which means that we shouldn't break content that would normally work in a browser just because we're making the data available via XML programming interfaces. (It's also for this reason that we don't bother purifying the input by default -- there is a separate filter to do that.) So that's why my vote is to change the default to false. -- Andy Clark - an...@cy... |
From: Marc G. <mgu...@ya...> - 2007-12-06 07:43:48
|
Jacob Kjome wrote: > ... > You forgot to remove the import statement for java.nio.charset.Charset from > HTMLScanner.java. Without that removed, the source cannot be compiled using > JDK1.3. That said, when I compiled with JDK 1.6 and ran under JDK1.3, it ran > fine. I don't understand why it didn't blow up with a NoClassDefFoundException or > something like that? class are first loaded when they are used. An unused import is not a good coding style but it doesn't matter at all for the JVM. Cheers, Marc. -- Blog: http://mguillem.wordpress.com |
From: Marc G. <mgu...@ya...> - 2007-12-06 07:40:29
|
Andy Clark wrote: > ----- Marc Guillemot <mgu...@ya...> wrote: >> NekoHTML normalizes tag attributes (basically it changes \r\n\t to >> space) what is incorrect as browser don't do it. This leads to wrong >> behaviors for instance when a page has something like: > > Great catch! We should probably add a feature so that > attribute normalization can be turned on. good idea. I've added the feature http://cyberneko.org/html/features/normalize-attributes with default to true to avoid changing previous default behavior (feel free to change the feature name if you're not happy with this one). > >> Fixing special handling in HTMLScanner.ContentScanner.scanAttribute >> fixes the problem but makes 2 tests fail: >> - test061.html >> - test081.html >> For me both tests are incorrect as new lines should not be removed. > > After the change, we can just add a test061.html.settings > file that enables attribute normalization. If the > feature is done correctly, then the original test > will continue to pass. I've preferred to default to previous behavior therefore I haven't changed these tests. But perhaps does it make more sense to have a default behavior like the one of "real browsers"? Btw: I've preferred to use something meaningful rather than a number for the new test data (testAttributeMultiline). I hope that this is ok for you. Cheers, Marc. -- Blog: http://mguillem.wordpress.com |
From: Marc G. <mgu...@ya...> - 2007-12-06 07:39:47
|
Hi Andy, I've explained the 2 points just after my commit but I now see that I've sent it directly to you rather than to the mailing list. I repost it. Cheers, Marc. -- Blog: http://mguillem.wordpress.com Andy Clark wrote: > I was just looking at revision 108 that added the attribute > normalization feature and I had a few comments. > > The first was the naming of the new feature. For consistency > with existing feature identifiers, I would have called it: > > http://cyberneko.org/html/features/scanner/normalize-attributes > > Also, I think that the default should be false for the > reason that we don't want to produce unusable values for > the various on* attributes. And we need to add documentation > to the settings.html page. > > Second, I noticed that the files added to test this feature > were named testAttributeMultiline.html. At first, I thought > that it would be better if they were named in numerical > sequence like the other test files. But, after thinking > about it for awhile, I decided that this is actually a > good thing. > > In fact, I think we should rename all of the other test > files so that you can tell what they are testing by the > filename. This is work for a future release but I would > like to decide on a common naming system so that they > are grouped together by category when the directory list > is sorted. For example: text-*, element-*, attribute-*, > doctype-*, namespaces-*, etc... > |
From: Jacob K. <ho...@vi...> - 2007-12-06 07:37:09
|
Jacob Kjome wrote: > Andy Clark wrote: >> ----- Jacob Kjome <ho...@vi...> wrote: >>>> what are the requirements for NekoHTML? >>>> >>>> According to build.xml, we have: >>>> - Java version: 1.4 >>> This needs to change to 1.3. Andy has a patch to make your >>> 1.4 specific encoding patch optional by using reflection. I >>> have yet to test this (not sure if Andy has checked it in yet?). >> It's now checked in. Please test it and let me know if >> there are any issues with the way that I added the >> reflection. >> > > You forgot to remove the import statement for java.nio.charset.Charset from > HTMLScanner.java. Without that removed, the source cannot be compiled using > JDK1.3. That said, when I compiled with JDK 1.6 and ran under JDK1.3, it ran > fine. I don't understand why it didn't blow up with a NoClassDefFoundException or > something like that? > > Of course Bug1790414.java didn't compile under JDK1.3, but that's to be expected. > Might want compile src/bugs along with src/test instead of along with the main > source. Bug and Test code can use JDK1.4+ while the main source must use JDK1.3 > at the minimum. Users wanting to the build the jar from scratch using JDK1.3 can > do that, skipping the tests that may contain JDK1.4+ stuff, which would prevent > their compile from succeeding. > > Once all this is done, it should be good to go. Whoops, at least one other thing to do before release. Can you add includeAntRuntime='false' to <javac> as we discussed previously? This prevents the version of Xerces included in $ANT_HOME/lib from overriding the version provided directly to the <javac> classpath. This resolves any ambiguity over which version of Xerces we are compiling against. Jake > > Jake |
From: Jacob K. <ho...@vi...> - 2007-12-06 07:28:57
|
Andy Clark wrote: >> I agree with this as well. However, after speaking with Andy in a private >> email a week ago or so, I think the concern is to make it look like a major, >> mature, release. 0.9.6 wouldn't do that. 9.6 would, but is confusing because >> of the huge version leap. 1.0 might be ok, but it could be viewed as less >> than mature. If we view the previous releases as, essentially, 1.x releases, >> we could just make the reasonable hop to 2.0. This would not be as jolting a >> version jump as 9.6 and would be viewed as more mature than a 1.0 release. >> >> So, how about we make the next release 2.0? > > How about just adding the 1 so the new release would be > 1.9.6? Version 1.0.0 looks immature but as long as it has > some minor releases after the 1, then it mitigates that > feeling. > > Jumping from 0 to 2 is just as confusing as jumping from > 0 to 9, if we are trying to avoid confusion, that is. How > does everyone feel about that? > Sounds good to me. Jake |
From: Jacob K. <ho...@vi...> - 2007-12-06 07:28:02
|
Andy Clark wrote: > ----- Jacob Kjome <ho...@vi...> wrote: >>> what are the requirements for NekoHTML? >>> >>> According to build.xml, we have: >>> - Java version: 1.4 >> This needs to change to 1.3. Andy has a patch to make your >> 1.4 specific encoding patch optional by using reflection. I >> have yet to test this (not sure if Andy has checked it in yet?). > > It's now checked in. Please test it and let me know if > there are any issues with the way that I added the > reflection. > You forgot to remove the import statement for java.nio.charset.Charset from HTMLScanner.java. Without that removed, the source cannot be compiled using JDK1.3. That said, when I compiled with JDK 1.6 and ran under JDK1.3, it ran fine. I don't understand why it didn't blow up with a NoClassDefFoundException or something like that? Of course Bug1790414.java didn't compile under JDK1.3, but that's to be expected. Might want compile src/bugs along with src/test instead of along with the main source. Bug and Test code can use JDK1.4+ while the main source must use JDK1.3 at the minimum. Users wanting to the build the jar from scratch using JDK1.3 can do that, skipping the tests that may contain JDK1.4+ stuff, which would prevent their compile from succeeding. Once all this is done, it should be good to go. Jake |
From: Andy C. <an...@cy...> - 2007-12-06 05:27:43
|
I was just looking at revision 108 that added the attribute normalization feature and I had a few comments. The first was the naming of the new feature. For consistency with existing feature identifiers, I would have called it: http://cyberneko.org/html/features/scanner/normalize-attributes Also, I think that the default should be false for the reason that we don't want to produce unusable values for the various on* attributes. And we need to add documentation to the settings.html page. Second, I noticed that the files added to test this feature were named testAttributeMultiline.html. At first, I thought that it would be better if they were named in numerical sequence like the other test files. But, after thinking about it for awhile, I decided that this is actually a good thing. In fact, I think we should rename all of the other test files so that you can tell what they are testing by the filename. This is work for a future release but I would like to decide on a common naming system so that they are grouped together by category when the directory list is sorted. For example: text-*, element-*, attribute-*, doctype-*, namespaces-*, etc... -- Andy Clark - an...@cy... |
From: Andy C. <an...@cy...> - 2007-12-06 05:05:12
|
----- Jacob Kjome <ho...@vi...> wrote: > > what are the requirements for NekoHTML? > > > > According to build.xml, we have: > > - Java version: 1.4 > > This needs to change to 1.3. Andy has a patch to make your > 1.4 specific encoding patch optional by using reflection. I > have yet to test this (not sure if Andy has checked it in yet?). It's now checked in. Please test it and let me know if there are any issues with the way that I added the reflection. > > - xerces version: 2.9.1 I agree that we should support Xerces2 version as far back as possible. And while we compile against the latest version of Xerces, the reflection allows the NekoHTML parser to be used with older versions. -- Andy Clark - an...@cy... |
From: Andy C. <an...@cy...> - 2007-12-06 04:39:10
|
> I agree with this as well. However, after speaking with Andy in a private > email a week ago or so, I think the concern is to make it look like a major, > mature, release. 0.9.6 wouldn't do that. 9.6 would, but is confusing because > of the huge version leap. 1.0 might be ok, but it could be viewed as less > than mature. If we view the previous releases as, essentially, 1.x releases, > we could just make the reasonable hop to 2.0. This would not be as jolting a > version jump as 9.6 and would be viewed as more mature than a 1.0 release. > > So, how about we make the next release 2.0? How about just adding the 1 so the new release would be 1.9.6? Version 1.0.0 looks immature but as long as it has some minor releases after the 1, then it mitigates that feeling. Jumping from 0 to 2 is just as confusing as jumping from 0 to 9, if we are trying to avoid confusion, that is. How does everyone feel about that? -- Andy Clark - an...@cy... |
From: Jacob K. <ho...@vi...> - 2007-12-05 19:50:37
|
On Tue, 4 Dec 2007 21:59:07 -0500 (EST) Andy Clark <an...@cy...> wrote: > ----- Marc Guillemot <mgu...@ya...> wrote: >> NekoHTML normalizes tag attributes (basically it changes \r\n\t to >> space) what is incorrect as browser don't do it. This leads to wrong >> behaviors for instance when a page has something like: > > Great catch! We should probably add a feature so that > attribute normalization can be turned on. > I've got a question about this. Is there a specification that says that attributes should be normalized as NekoHTML is currently doing (if not the HTML spec, how about the XML spec)? If so, we should comply with the spec and perform normalization. If not, then I guess it makes sense not to do it by default. Personally, I think the premise of removing attribute normalization as the default is incorrect. When writing scripts inside an attribute, the other form of comment should always be used. Even if NekoHTML doesn't strip newlines from attributes, there's no guarantee that the markup serializer won't do this on its own. Instead of... <body "alert(1) // a comment alert(2)"> ... </body> It should be... <body "alert(1) /* a comment */ alert(2)"> ... </body> The second example is robust because it is agnostic to either how it is parsed or how it is serialized while the first is entirely brittle, as it depends on external factors to run correctly, including parser behavior and serializer behavior. Developers depending on external factors for their scripts to work are likely to get burned one way or another, with or without our help. >> Fixing special handling in HTMLScanner.ContentScanner.scanAttribute >> fixes the problem but makes 2 tests fail: >> - test061.html >> - test081.html >> For me both tests are incorrect as new lines should not be removed. > I'm fine with this only if we are not violating some spec by no longer performing attribute normalization by default. Jake > After the change, we can just add a test061.html.settings > file that enables attribute normalization. If the > feature is done correctly, then the original test > will continue to pass. > > -- > Andy Clark - an...@cy... > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > _______________________________________________ > nekohtml-developer mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-developer > |
From: Jacob K. <ho...@vi...> - 2007-12-05 19:33:48
|
On Wed, 05 Dec 2007 17:44:21 +0100 Marc Guillemot <mgu...@ya...> wrote: > Hi, > > what are the requirements for NekoHTML? > > According to build.xml, we have: > - Java version: 1.4 This needs to change to 1.3. Andy has a patch to make your 1.4 specific encoding patch optional by using reflection. I have yet to test this (not sure if Andy has checked it in yet?). I'd like NekoHTML to be 1.3 compatible for a couple reasons. 1. My project, XMLC, is 1.3 compatible. As such, I need it's libraries, including NekoHTML, to be 1.3 compatible 2. Xerces is 1.3 compatible (probably even 1.2 or earlier) and, since NekoHTML is an extension of Xerces, it really ought to be able to play in all environments that Xerces suppports. Had Andy successfully donated the project to Xerces (which he tried to do earlier this year), this compatibility would likely have been enforced anyway. > - xerces version: 2.9.1 > We should really be compatible with Xerces versions all the way back to just before the current XNI API changed. One bug I reported (and was fixed by Andy) was that NekoHTML wouldn't compile against the XNI API in the latest version of Xerces. I'm not sure exactly when it changed, but that would be a good place to start with for declaring compatibility with Xerces. And then, if we find some aggregious bugs that we want to say that we don't support, then we could move closer to the current version of Xerces up until the version they were fixed. For instance, let's say the XNI interfaces changed in Xerces-2.7.0 (and continue to be the same up through the latest version of Xerces). So, we would start by declaring said version the earliest one that NekoHTML supports. Then, maybe we find some terrible bug in Xerces-2.7.0 that wasn't fixed until Xerces-2.8.0 that is either impossible or simply unnacceptable to work around. In that case, we'd declare Xereces-2.8.0 to be the earliest version that NekoHTML supports, but encourage users to use the latest version of Xerces if at all possible. > this is fully ok for me but HTMLConfiguration (for instance) contains > hacks for special (old) versions of Xerces for instance. Due to the fact > that the build doesn't test it, I think that their value is ~ 0 and > therefore that they should be removed. > As long as we keep the workarounds that apply to the earliest version of Xerces that we support. See above. > Any thoughts? > Does my proposal seem reasonable? Jake > Cheers, > Marc. > -- > Blog: http://mguillem.wordpress.com > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > _______________________________________________ > nekohtml-developer mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-developer > |
From: Jacob K. <ho...@vi...> - 2007-12-05 19:16:31
|
On Wed, 5 Dec 2007 00:33:45 -0800 (PST) Ahmed Ashour <asa...@ya...> wrote: > Dear Andy, > > I think moving from 0.9.x to 9.x creates confusion (where are 2, 3 - 8 >version?). I agree. It's confusing. > > As NekoHTML is mature now, why not upgrade to 1.0.0? > I agree with this as well. However, after speaking with Andy in a private email a week ago or so, I think the concern is to make it look like a major, mature, release. 0.9.6 wouldn't do that. 9.6 would, but is confusing because of the huge version leap. 1.0 might be ok, but it could be viewed as less than mature. If we view the previous releases as, essentially, 1.x releases, we could just make the reasonable hop to 2.0. This would not be as jolting a version jump as 9.6 and would be viewed as more mature than a 1.0 release. So, how about we make the next release 2.0? Jake > Sincerely, > Ahmed > > > ----- Original Message ---- >From: Andy Clark <an...@cy...> > To: nek...@li... > Sent: Wednesday, December 5, 2007 11:15:57 AM > Subject: [nekohtml-dev] NekoHTML Releases (past and future) > > I just spent the past couple hours entering the past > NekoHTML releases into the SourceForge project. Now all > of the old releases will be available for posterity. :) > > Now we must release a new version. My current thinking > is to bump the version from 0.9.x to 9.x to denote the > change in project home and the maturity of the product. > Some people like the idea and others don't. So it's > still up in the air. > > If you have any last minutes arguments one way or the > other, let it be known now. > > -- > Andy Clark - an...@cy... > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > _______________________________________________ > nekohtml-developer mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-developer > > > > > > > Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See how. > > > ____________________________________________________________________________________ > Be a better pen pal. > Text or chat with friends inside Yahoo! Mail. See how. > http://overview.mail.yahoo.com/ |
From: Marc G. <mgu...@ya...> - 2007-12-05 16:44:35
|
Hi, what are the requirements for NekoHTML? According to build.xml, we have: - Java version: 1.4 - xerces version: 2.9.1 this is fully ok for me but HTMLConfiguration (for instance) contains hacks for special (old) versions of Xerces for instance. Due to the fact that the build doesn't test it, I think that their value is ~ 0 and therefore that they should be removed. Any thoughts? Cheers, Marc. -- Blog: http://mguillem.wordpress.com |
From: Ahmed A. <asa...@ya...> - 2007-12-05 08:33:48
|
Dear Andy, I think moving from 0.9.x to 9.x creates confusion (where are 2, 3 - 8 version?). As NekoHTML is mature now, why not upgrade to 1.0.0? Sincerely, Ahmed ----- Original Message ---- From: Andy Clark <an...@cy...> To: nek...@li... Sent: Wednesday, December 5, 2007 11:15:57 AM Subject: [nekohtml-dev] NekoHTML Releases (past and future) I just spent the past couple hours entering the past NekoHTML releases into the SourceForge project. Now all of the old releases will be available for posterity. :) Now we must release a new version. My current thinking is to bump the version from 0.9.x to 9.x to denote the change in project home and the maturity of the product. Some people like the idea and others don't. So it's still up in the air. If you have any last minutes arguments one way or the other, let it be known now. -- Andy Clark - an...@cy... ------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ nekohtml-developer mailing list nek...@li... https://lists.sourceforge.net/lists/listinfo/nekohtml-developer Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See how. ____________________________________________________________________________________ Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See how. http://overview.mail.yahoo.com/ |
From: Andy C. <an...@cy...> - 2007-12-05 08:16:01
|
I just spent the past couple hours entering the past NekoHTML releases into the SourceForge project. Now all of the old releases will be available for posterity. :) Now we must release a new version. My current thinking is to bump the version from 0.9.x to 9.x to denote the change in project home and the maturity of the product. Some people like the idea and others don't. So it's still up in the air. If you have any last minutes arguments one way or the other, let it be known now. -- Andy Clark - an...@cy... |
From: Andy C. <an...@cy...> - 2007-12-05 02:59:14
|
----- Marc Guillemot <mgu...@ya...> wrote: > NekoHTML normalizes tag attributes (basically it changes \r\n\t to > space) what is incorrect as browser don't do it. This leads to wrong > behaviors for instance when a page has something like: Great catch! We should probably add a feature so that attribute normalization can be turned on. > Fixing special handling in HTMLScanner.ContentScanner.scanAttribute > fixes the problem but makes 2 tests fail: > - test061.html > - test081.html > For me both tests are incorrect as new lines should not be removed. After the change, we can just add a test061.html.settings file that enables attribute normalization. If the feature is done correctly, then the original test will continue to pass. -- Andy Clark - an...@cy... |
From: Marc G. <mgu...@ya...> - 2007-12-04 09:39:51
|
Hi, NekoHTML normalizes tag attributes (basically it changes \r\n\t to space) what is incorrect as browser don't do it. This leads to wrong behaviors for instance when a page has something like: <body onload="alert(1) // a comment alert(2)"> ... </body> when the onload attribute is normalized, alert(2) appears to be in the comment and not in a new line and therefore is not evaluated. Fixing special handling in HTMLScanner.ContentScanner.scanAttribute fixes the problem but makes 2 tests fail: - test061.html - test081.html For me both tests are incorrect as new lines should not be removed. Unless someone complains, I will fix this and adapt the tests to express (new) correct expectations. Cheers, Marc. -- Blog: http://mguillem.wordpress.com |
From: Andy C. <an...@cy...> - 2007-11-20 08:27:57
|
Sorry for the delay. I'm just now getting back to work on NekoHTML and would like to have a release soon. Let me know if you think this issue should be resolved before the release. If so, I'll take a look at it this week. P.S. Regarding issue 1796014, would it cause you problems if we made the minimum source/target version 1.4 instead of 1.3 as specified in your bug report? ----- Jacob Kjome <ho...@vi...> wrote: > > The issue seems to be the the order of operations for a file with '\r\n' line > endings -vs- a file with '\r' or '\n' line endings. For the former, we get > (roughly)... > > 1. HTMLScanner.skipNewline(maxLines) finds '\r\n' sequence > 2. SSIReader.read(buffer, offset, length) returns -1 > 3. ClosingInputSource.ClosingReader.read(buffer, offset, length) detects EOF, > calls close (SSIReader.close() gets called) and returns -1 > 4. HTMLScanner no longer calls any read() methods on the provided reader class > > With '\r' or '\n' we get, essentially, the same, except that step #4 is not > reached. Instead, after step #3, steps #1 and #2 are performed again, this time > with the stream already closed (by step #3), which is why we get a > NullPointerException the second time through at step #2. > > So, the question is, why is the read sequence different between Windows and > Unix/Mac line endings? It's not that it doesn't generally work, but it's > inconsistent and seems to give readers an indication that the end of the file has > been reached before it's done calling the reader's read() methods. Most of the > time no harm is done, but when using the special ClosingInputSource that wraps the > real reader in a reader that automatically closes the stream when the EOF is > reached, this can be problematic. > > Is there any way that HTMLScanner can be modified so that it provides consistent > behavior when reading files with different line endings (Xerces1, Xerces2, and > jTidy seem to manage this)? Specifically, make it behave 100% of the time like it > currently does with Windows line endings? > > > thanks, > > Jake > > Jacob Kjome wrote: > > It looks to me like it has to do with the behavior of HTMLScanner.skipNewlines(). > > I printed the stack at the point where SSIReader.close() gets called (see > > below). HTMLScanner calls read(buffer, offset, length), which calls my > > ClosingInputSource.ClosingReader, which wraps and delegates to the real reader. > > It then checks the value returned and if it's less than zero, it calls close() on > > the reader, when ends up setting the stream object to null in the SSIReader. > > > > Trouble is, HTMLScanner.load() is called again, which then calls read() again on > > the reader that already has a closed stream. Seems like it's ignoring the fact > > that read previously returned -1 (or, at least something less than 0, which I'll > > look further into to verify the exact value). > > > > Seems pretty clear to me now that this is an issue in HTMLScanner. I hope you can > > look into it, track down the root cause, and fix the issue. If you think that my > > code is doing something wrong, please let me know. > > > > thanks, > > > > Jake > > > > > > [xmlc] Invoke XMLC on D:\myclasses\Repository\Enhydra\XMLC_FORGE_NEW_DEVELO > > PMENT\xmlc\examples\tomcat\res\pkg\xmlc\demo\mainPage.html > > [xmlc] java.lang.Exception: closed stream > > [xmlc] at org.enhydra.xml.xmlc.misc.SSIReader.close(SSIReader.java:242) > > [xmlc] at java.io.FilterReader.close(FilterReader.java:104) > > [xmlc] at > > org.enhydra.xml.io.ClosingInputSource$ClosingReader.read(ClosingInputSource.java:119) > > [xmlc] at org.cyberneko.html.HTMLScanner.load(HTMLScanner.java:1097) > > [xmlc] at org.cyberneko.html.HTMLScanner.skipNewlines(HTMLScanner.java:1583) > > [xmlc] at org.cyberneko.html.HTMLScanner.skipNewlines(HTMLScanner.java:1538) > > [xmlc] at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters( > > HTMLScanner.java:2024) > > [xmlc] at > > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1857) > > [xmlc] at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:824) > > [xmlc] at > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:487) > > [xmlc] at > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:440) > > [xmlc] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > > [xmlc] at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > > [xmlc] at > > org.enhydra.xml.xmlc.parsers.xerces.XercesDOMParser.parse(XercesDOMParser.java:113) > > [xmlc] at > > org.enhydra.xml.xmlc.parsers.xerces.XercesHTMLDOMParser.parse(XercesHTMLDOMParser.java:65) > > [xmlc] at org.enhydra.xml.xmlc.compiler.Parse.parse(Parse.java:252) > > [xmlc] at > > org.enhydra.xml.xmlc.compiler.Compiler.parsePage(Compiler.java:112) > > [xmlc] at org.enhydra.xml.xmlc.compiler.Compiler.compile(Compiler.java:227) > > [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.compile(XMLC.java:135) > > [xmlc] at > > org.enhydra.xml.xmlc.commands.xmlc.XMLC.compileHandleErrors(XMLC.java:145) > > [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.main(XMLC.java:156) > > > > > > Jacob Kjome wrote: > >> I'm suspicious that there is a file reading bug in HTMLScanner. I'm not > >> 100% sure about it, though, because the error is realized in my code, > >> not NekoHTML. But when the parser is Xerces1, Xerces2, or JTidy, I have > >> no issue. Only when using NekoHTML does the error occur. > >> > >> The issues seems to be in dealing with files with either Unix or Mac > >> line endings; '\n' or '\r'. There is no problem with Windows line > >> endings; '\r\n'. And it only happens when the last character in the > >> file is '\n' or '\r'. If the last character isn't one of the latter 2 > >> characters or the last character is '\r\n', there is no issue. > >> > >> The issue happens on line 220 in my Reader class where it is assumed > >> that the stream is not null. And this is always the case, except when > >> using NekoHTML for parsing and the file has Unix or Mac line endings. I > >> added a null check on line 217 to work around this issue, but I wonder > >> why it's needed only with NekoHTML as the parser? Here's the SSIReader > >> class (with annotations to see line numbers).... > >> > >> http://cvs.xmlc.forge.objectweb.org/cgi-bin/viewcvs.cgi/xmlc/xmlc/xmlc/modules/xmlc/src/org/enhydra/xml/xmlc/misc/SSIReader.java?annotate=1.10 > >> > >> I'm going to look into this a bit more to see if I can find the issue in > >> my code, but I wonder you could look at HTMLScanner and see if you > >> notice anything that could be dealing improperly with line endings? > >> > >> Here's the stack trace when I comment out my workaround at line 217 (and > >> the last character in the file is '\n' or '\r')... > >> > >> [xmlc] Invoke XMLC on > >> D:\myclasses\Repository\Enhydra\XMLC_FORGE_NEW_DEVELOPMENT\xmlc\examples\tomcat\res\pkg\xmlc\demo\mainPage.html > >> [xmlc] Error: java.lang.NullPointerException > >> [xmlc] java.lang.NullPointerException > >> [xmlc] at > >> org.enhydra.xml.xmlc.misc.SSIReader.read(SSIReader.java:220) > >> [xmlc] at java.io.FilterReader.read(FilterReader.java:57) > >> [xmlc] at > >> org.enhydra.xml.io.ClosingInputSource$ClosingReader.read(ClosingInputSource.java:117) > >> [xmlc] at > >> org.cyberneko.html.HTMLScanner.load(HTMLScanner.java:1097) > >> [xmlc] at > >> org.cyberneko.html.HTMLScanner.read(HTMLScanner.java:1058) > >> [xmlc] at > >> org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1843) > >> [xmlc] at > >> org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:824) > >> [xmlc] at > >> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:487) > >> [xmlc] at > >> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:440) > >> [xmlc] at org.apache.xerces.parsers.XMLParser.parse(UnknownSource) > >> [xmlc] at org.apache.xerces.parsers.DOMParser.parse(UnknownSource) > >> [xmlc] at > >> org.enhydra.xml.xmlc.parsers.xerces.XercesDOMParser.parse(XercesDOMParser.java:113) > >> [xmlc] at > >> org.enhydra.xml.xmlc.parsers.xerces.XercesHTMLDOMParser.parse(XercesHTMLDOMParser.java:65) > >> [xmlc] at > >> org.enhydra.xml.xmlc.compiler.Parse.parse(Parse.java:252) > >> [xmlc] at > >> org.enhydra.xml.xmlc.compiler.Compiler.parsePage(Compiler.java:112) > >> [xmlc] at > >> org.enhydra.xml.xmlc.compiler.Compiler.compile(Compiler.java:227) > >> [xmlc] at > >> org.enhydra.xml.xmlc.commands.xmlc.XMLC.compile(XMLC.java:135) > >> [xmlc] at > >> org.enhydra.xml.xmlc.commands.xmlc.XMLC.compileHandleErrors(XMLC.java:145) > >> [xmlc] at > >> org.enhydra.xml.xmlc.commands.xmlc.XMLC.main(XMLC.java:156) > >> > >> > >> > >> Jake > >> > >> ------------------------------------------------------------------------- > >> This SF.net email is sponsored by: Splunk Inc. > >> Still grepping through log files to find problems? Stop. > >> Now Search log events and configuration files using AJAX and a browser. > >> Download your FREE copy of Splunk now >> http://get.splunk.com/ > >> _______________________________________________ > >> nekohtml-developer mailing list > >> nek...@li... > >> https://lists.sourceforge.net/lists/listinfo/nekohtml-developer > >> > >> > >> > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. > > Still grepping through log files to find problems? Stop. > > Now Search log events and configuration files using AJAX and a browser. > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > _______________________________________________ > > nekohtml-developer mailing list > > nek...@li... > > https://lists.sourceforge.net/lists/listinfo/nekohtml-developer > > > > > > Andy Clark - an...@cy... |
From: Jacob K. <ho...@vi...> - 2007-10-29 23:45:13
|
The issue seems to be the the order of operations for a file with '\r\n' line endings -vs- a file with '\r' or '\n' line endings. For the former, we get (roughly)... 1. HTMLScanner.skipNewline(maxLines) finds '\r\n' sequence 2. SSIReader.read(buffer, offset, length) returns -1 3. ClosingInputSource.ClosingReader.read(buffer, offset, length) detects EOF, calls close (SSIReader.close() gets called) and returns -1 4. HTMLScanner no longer calls any read() methods on the provided reader class With '\r' or '\n' we get, essentially, the same, except that step #4 is not reached. Instead, after step #3, steps #1 and #2 are performed again, this time with the stream already closed (by step #3), which is why we get a NullPointerException the second time through at step #2. So, the question is, why is the read sequence different between Windows and Unix/Mac line endings? It's not that it doesn't generally work, but it's inconsistent and seems to give readers an indication that the end of the file has been reached before it's done calling the reader's read() methods. Most of the time no harm is done, but when using the special ClosingInputSource that wraps the real reader in a reader that automatically closes the stream when the EOF is reached, this can be problematic. Is there any way that HTMLScanner can be modified so that it provides consistent behavior when reading files with different line endings (Xerces1, Xerces2, and jTidy seem to manage this)? Specifically, make it behave 100% of the time like it currently does with Windows line endings? thanks, Jake Jacob Kjome wrote: > It looks to me like it has to do with the behavior of HTMLScanner.skipNewlines(). > I printed the stack at the point where SSIReader.close() gets called (see > below). HTMLScanner calls read(buffer, offset, length), which calls my > ClosingInputSource.ClosingReader, which wraps and delegates to the real reader. > It then checks the value returned and if it's less than zero, it calls close() on > the reader, when ends up setting the stream object to null in the SSIReader. > > Trouble is, HTMLScanner.load() is called again, which then calls read() again on > the reader that already has a closed stream. Seems like it's ignoring the fact > that read previously returned -1 (or, at least something less than 0, which I'll > look further into to verify the exact value). > > Seems pretty clear to me now that this is an issue in HTMLScanner. I hope you can > look into it, track down the root cause, and fix the issue. If you think that my > code is doing something wrong, please let me know. > > thanks, > > Jake > > > [xmlc] Invoke XMLC on D:\myclasses\Repository\Enhydra\XMLC_FORGE_NEW_DEVELO > PMENT\xmlc\examples\tomcat\res\pkg\xmlc\demo\mainPage.html > [xmlc] java.lang.Exception: closed stream > [xmlc] at org.enhydra.xml.xmlc.misc.SSIReader.close(SSIReader.java:242) > [xmlc] at java.io.FilterReader.close(FilterReader.java:104) > [xmlc] at > org.enhydra.xml.io.ClosingInputSource$ClosingReader.read(ClosingInputSource.java:119) > [xmlc] at org.cyberneko.html.HTMLScanner.load(HTMLScanner.java:1097) > [xmlc] at org.cyberneko.html.HTMLScanner.skipNewlines(HTMLScanner.java:1583) > [xmlc] at org.cyberneko.html.HTMLScanner.skipNewlines(HTMLScanner.java:1538) > [xmlc] at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters( > HTMLScanner.java:2024) > [xmlc] at > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1857) > [xmlc] at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:824) > [xmlc] at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:487) > [xmlc] at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:440) > [xmlc] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > [xmlc] at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > [xmlc] at > org.enhydra.xml.xmlc.parsers.xerces.XercesDOMParser.parse(XercesDOMParser.java:113) > [xmlc] at > org.enhydra.xml.xmlc.parsers.xerces.XercesHTMLDOMParser.parse(XercesHTMLDOMParser.java:65) > [xmlc] at org.enhydra.xml.xmlc.compiler.Parse.parse(Parse.java:252) > [xmlc] at > org.enhydra.xml.xmlc.compiler.Compiler.parsePage(Compiler.java:112) > [xmlc] at org.enhydra.xml.xmlc.compiler.Compiler.compile(Compiler.java:227) > [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.compile(XMLC.java:135) > [xmlc] at > org.enhydra.xml.xmlc.commands.xmlc.XMLC.compileHandleErrors(XMLC.java:145) > [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.main(XMLC.java:156) > > > Jacob Kjome wrote: >> I'm suspicious that there is a file reading bug in HTMLScanner. I'm not >> 100% sure about it, though, because the error is realized in my code, >> not NekoHTML. But when the parser is Xerces1, Xerces2, or JTidy, I have >> no issue. Only when using NekoHTML does the error occur. >> >> The issues seems to be in dealing with files with either Unix or Mac >> line endings; '\n' or '\r'. There is no problem with Windows line >> endings; '\r\n'. And it only happens when the last character in the >> file is '\n' or '\r'. If the last character isn't one of the latter 2 >> characters or the last character is '\r\n', there is no issue. >> >> The issue happens on line 220 in my Reader class where it is assumed >> that the stream is not null. And this is always the case, except when >> using NekoHTML for parsing and the file has Unix or Mac line endings. I >> added a null check on line 217 to work around this issue, but I wonder >> why it's needed only with NekoHTML as the parser? Here's the SSIReader >> class (with annotations to see line numbers).... >> >> http://cvs.xmlc.forge.objectweb.org/cgi-bin/viewcvs.cgi/xmlc/xmlc/xmlc/modules/xmlc/src/org/enhydra/xml/xmlc/misc/SSIReader.java?annotate=1.10 >> >> I'm going to look into this a bit more to see if I can find the issue in >> my code, but I wonder you could look at HTMLScanner and see if you >> notice anything that could be dealing improperly with line endings? >> >> Here's the stack trace when I comment out my workaround at line 217 (and >> the last character in the file is '\n' or '\r')... >> >> [xmlc] Invoke XMLC on >> D:\myclasses\Repository\Enhydra\XMLC_FORGE_NEW_DEVELOPMENT\xmlc\examples\tomcat\res\pkg\xmlc\demo\mainPage.html >> [xmlc] Error: java.lang.NullPointerException >> [xmlc] java.lang.NullPointerException >> [xmlc] at >> org.enhydra.xml.xmlc.misc.SSIReader.read(SSIReader.java:220) >> [xmlc] at java.io.FilterReader.read(FilterReader.java:57) >> [xmlc] at >> org.enhydra.xml.io.ClosingInputSource$ClosingReader.read(ClosingInputSource.java:117) >> [xmlc] at >> org.cyberneko.html.HTMLScanner.load(HTMLScanner.java:1097) >> [xmlc] at >> org.cyberneko.html.HTMLScanner.read(HTMLScanner.java:1058) >> [xmlc] at >> org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1843) >> [xmlc] at >> org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:824) >> [xmlc] at >> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:487) >> [xmlc] at >> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:440) >> [xmlc] at org.apache.xerces.parsers.XMLParser.parse(UnknownSource) >> [xmlc] at org.apache.xerces.parsers.DOMParser.parse(UnknownSource) >> [xmlc] at >> org.enhydra.xml.xmlc.parsers.xerces.XercesDOMParser.parse(XercesDOMParser.java:113) >> [xmlc] at >> org.enhydra.xml.xmlc.parsers.xerces.XercesHTMLDOMParser.parse(XercesHTMLDOMParser.java:65) >> [xmlc] at >> org.enhydra.xml.xmlc.compiler.Parse.parse(Parse.java:252) >> [xmlc] at >> org.enhydra.xml.xmlc.compiler.Compiler.parsePage(Compiler.java:112) >> [xmlc] at >> org.enhydra.xml.xmlc.compiler.Compiler.compile(Compiler.java:227) >> [xmlc] at >> org.enhydra.xml.xmlc.commands.xmlc.XMLC.compile(XMLC.java:135) >> [xmlc] at >> org.enhydra.xml.xmlc.commands.xmlc.XMLC.compileHandleErrors(XMLC.java:145) >> [xmlc] at >> org.enhydra.xml.xmlc.commands.xmlc.XMLC.main(XMLC.java:156) >> >> >> >> Jake >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Splunk Inc. >> Still grepping through log files to find problems? Stop. >> Now Search log events and configuration files using AJAX and a browser. >> Download your FREE copy of Splunk now >> http://get.splunk.com/ >> _______________________________________________ >> nekohtml-developer mailing list >> nek...@li... >> https://lists.sourceforge.net/lists/listinfo/nekohtml-developer >> >> >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > nekohtml-developer mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-developer > > > |
From: Jacob K. <ho...@vi...> - 2007-10-28 21:55:49
|
It looks to me like it has to do with the behavior of HTMLScanner.skipNewlines(). I printed the stack at the point where SSIReader.close() gets called (see below). HTMLScanner calls read(buffer, offset, length), which calls my ClosingInputSource.ClosingReader, which wraps and delegates to the real reader. It then checks the value returned and if it's less than zero, it calls close() on the reader, when ends up setting the stream object to null in the SSIReader. Trouble is, HTMLScanner.load() is called again, which then calls read() again on the reader that already has a closed stream. Seems like it's ignoring the fact that read previously returned -1 (or, at least something less than 0, which I'll look further into to verify the exact value). Seems pretty clear to me now that this is an issue in HTMLScanner. I hope you can look into it, track down the root cause, and fix the issue. If you think that my code is doing something wrong, please let me know. thanks, Jake [xmlc] Invoke XMLC on D:\myclasses\Repository\Enhydra\XMLC_FORGE_NEW_DEVELO PMENT\xmlc\examples\tomcat\res\pkg\xmlc\demo\mainPage.html [xmlc] java.lang.Exception: closed stream [xmlc] at org.enhydra.xml.xmlc.misc.SSIReader.close(SSIReader.java:242) [xmlc] at java.io.FilterReader.close(FilterReader.java:104) [xmlc] at org.enhydra.xml.io.ClosingInputSource$ClosingReader.read(ClosingInputSource.java:119) [xmlc] at org.cyberneko.html.HTMLScanner.load(HTMLScanner.java:1097) [xmlc] at org.cyberneko.html.HTMLScanner.skipNewlines(HTMLScanner.java:1583) [xmlc] at org.cyberneko.html.HTMLScanner.skipNewlines(HTMLScanner.java:1538) [xmlc] at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters( HTMLScanner.java:2024) [xmlc] at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1857) [xmlc] at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:824) [xmlc] at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:487) [xmlc] at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:440) [xmlc] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) [xmlc] at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) [xmlc] at org.enhydra.xml.xmlc.parsers.xerces.XercesDOMParser.parse(XercesDOMParser.java:113) [xmlc] at org.enhydra.xml.xmlc.parsers.xerces.XercesHTMLDOMParser.parse(XercesHTMLDOMParser.java:65) [xmlc] at org.enhydra.xml.xmlc.compiler.Parse.parse(Parse.java:252) [xmlc] at org.enhydra.xml.xmlc.compiler.Compiler.parsePage(Compiler.java:112) [xmlc] at org.enhydra.xml.xmlc.compiler.Compiler.compile(Compiler.java:227) [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.compile(XMLC.java:135) [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.compileHandleErrors(XMLC.java:145) [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.main(XMLC.java:156) Jacob Kjome wrote: > I'm suspicious that there is a file reading bug in HTMLScanner. I'm not > 100% sure about it, though, because the error is realized in my code, > not NekoHTML. But when the parser is Xerces1, Xerces2, or JTidy, I have > no issue. Only when using NekoHTML does the error occur. > > The issues seems to be in dealing with files with either Unix or Mac > line endings; '\n' or '\r'. There is no problem with Windows line > endings; '\r\n'. And it only happens when the last character in the > file is '\n' or '\r'. If the last character isn't one of the latter 2 > characters or the last character is '\r\n', there is no issue. > > The issue happens on line 220 in my Reader class where it is assumed > that the stream is not null. And this is always the case, except when > using NekoHTML for parsing and the file has Unix or Mac line endings. I > added a null check on line 217 to work around this issue, but I wonder > why it's needed only with NekoHTML as the parser? Here's the SSIReader > class (with annotations to see line numbers).... > > http://cvs.xmlc.forge.objectweb.org/cgi-bin/viewcvs.cgi/xmlc/xmlc/xmlc/modules/xmlc/src/org/enhydra/xml/xmlc/misc/SSIReader.java?annotate=1.10 > > I'm going to look into this a bit more to see if I can find the issue in > my code, but I wonder you could look at HTMLScanner and see if you > notice anything that could be dealing improperly with line endings? > > Here's the stack trace when I comment out my workaround at line 217 (and > the last character in the file is '\n' or '\r')... > > [xmlc] Invoke XMLC on > D:\myclasses\Repository\Enhydra\XMLC_FORGE_NEW_DEVELOPMENT\xmlc\examples\tomcat\res\pkg\xmlc\demo\mainPage.html > [xmlc] Error: java.lang.NullPointerException > [xmlc] java.lang.NullPointerException > [xmlc] at > org.enhydra.xml.xmlc.misc.SSIReader.read(SSIReader.java:220) > [xmlc] at java.io.FilterReader.read(FilterReader.java:57) > [xmlc] at > org.enhydra.xml.io.ClosingInputSource$ClosingReader.read(ClosingInputSource.java:117) > [xmlc] at > org.cyberneko.html.HTMLScanner.load(HTMLScanner.java:1097) > [xmlc] at > org.cyberneko.html.HTMLScanner.read(HTMLScanner.java:1058) > [xmlc] at > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1843) > [xmlc] at > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:824) > [xmlc] at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:487) > [xmlc] at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:440) > [xmlc] at org.apache.xerces.parsers.XMLParser.parse(UnknownSource) > [xmlc] at org.apache.xerces.parsers.DOMParser.parse(UnknownSource) > [xmlc] at > org.enhydra.xml.xmlc.parsers.xerces.XercesDOMParser.parse(XercesDOMParser.java:113) > [xmlc] at > org.enhydra.xml.xmlc.parsers.xerces.XercesHTMLDOMParser.parse(XercesHTMLDOMParser.java:65) > [xmlc] at > org.enhydra.xml.xmlc.compiler.Parse.parse(Parse.java:252) > [xmlc] at > org.enhydra.xml.xmlc.compiler.Compiler.parsePage(Compiler.java:112) > [xmlc] at > org.enhydra.xml.xmlc.compiler.Compiler.compile(Compiler.java:227) > [xmlc] at > org.enhydra.xml.xmlc.commands.xmlc.XMLC.compile(XMLC.java:135) > [xmlc] at > org.enhydra.xml.xmlc.commands.xmlc.XMLC.compileHandleErrors(XMLC.java:145) > [xmlc] at > org.enhydra.xml.xmlc.commands.xmlc.XMLC.main(XMLC.java:156) > > > > Jake > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > nekohtml-developer mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-developer > > > |
From: Jacob K. <ho...@vi...> - 2007-10-28 21:08:17
|
I'm suspicious that there is a file reading bug in HTMLScanner. I'm not 100% sure about it, though, because the error is realized in my code, not NekoHTML. But when the parser is Xerces1, Xerces2, or JTidy, I have no issue. Only when using NekoHTML does the error occur. The issues seems to be in dealing with files with either Unix or Mac line endings; '\n' or '\r'. There is no problem with Windows line endings; '\r\n'. And it only happens when the last character in the file is '\n' or '\r'. If the last character isn't one of the latter 2 characters or the last character is '\r\n', there is no issue. The issue happens on line 220 in my Reader class where it is assumed that the stream is not null. And this is always the case, except when using NekoHTML for parsing and the file has Unix or Mac line endings. I added a null check on line 217 to work around this issue, but I wonder why it's needed only with NekoHTML as the parser? Here's the SSIReader class (with annotations to see line numbers).... http://cvs.xmlc.forge.objectweb.org/cgi-bin/viewcvs.cgi/xmlc/xmlc/xmlc/modules/xmlc/src/org/enhydra/xml/xmlc/misc/SSIReader.java?annotate=1.10 I'm going to look into this a bit more to see if I can find the issue in my code, but I wonder you could look at HTMLScanner and see if you notice anything that could be dealing improperly with line endings? Here's the stack trace when I comment out my workaround at line 217 (and the last character in the file is '\n' or '\r')... [xmlc] Invoke XMLC on D:\myclasses\Repository\Enhydra\XMLC_FORGE_NEW_DEVELOPMENT\xmlc\examples\tomcat\res\pkg\xmlc\demo\mainPage.html [xmlc] Error: java.lang.NullPointerException [xmlc] java.lang.NullPointerException [xmlc] at org.enhydra.xml.xmlc.misc.SSIReader.read(SSIReader.java:220) [xmlc] at java.io.FilterReader.read(FilterReader.java:57) [xmlc] at org.enhydra.xml.io.ClosingInputSource$ClosingReader.read(ClosingInputSource.java:117) [xmlc] at org.cyberneko.html.HTMLScanner.load(HTMLScanner.java:1097) [xmlc] at org.cyberneko.html.HTMLScanner.read(HTMLScanner.java:1058) [xmlc] at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1843) [xmlc] at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:824) [xmlc] at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:487) [xmlc] at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:440) [xmlc] at org.apache.xerces.parsers.XMLParser.parse(UnknownSource) [xmlc] at org.apache.xerces.parsers.DOMParser.parse(UnknownSource) [xmlc] at org.enhydra.xml.xmlc.parsers.xerces.XercesDOMParser.parse(XercesDOMParser.java:113) [xmlc] at org.enhydra.xml.xmlc.parsers.xerces.XercesHTMLDOMParser.parse(XercesHTMLDOMParser.java:65) [xmlc] at org.enhydra.xml.xmlc.compiler.Parse.parse(Parse.java:252) [xmlc] at org.enhydra.xml.xmlc.compiler.Compiler.parsePage(Compiler.java:112) [xmlc] at org.enhydra.xml.xmlc.compiler.Compiler.compile(Compiler.java:227) [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.compile(XMLC.java:135) [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.compileHandleErrors(XMLC.java:145) [xmlc] at org.enhydra.xml.xmlc.commands.xmlc.XMLC.main(XMLC.java:156) Jake |