RE: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag
Brought to you by:
derrickoswald
From: <dha...@po...> - 2003-05-29 04:29:13
|
Marc, Your requirement is quite common. Mostly code inside <SCRIPT> tag should = be produced as it is. I think its important that we have the test cases = and appropriate fixes in the main codebase. Dhaval > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of Marc Novakowski > Sent: Wednesday, May 28, 2003 8:30 PM > To: htm...@li...;=20 > htm...@li... > Subject: RE: [Htmlparser-developer] RE: [Htmlparser-cvs]=20 > htmlparser/src/org/htmlparser/scanners=20 > CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 >=20 >=20 > Derrick, if it's anybody's fault that my code is failing=20 > because of your change, it's mine. I should have checked in=20 > specific test cases that excersise my usage of the library. =20 > I apologise for not doing that earlier... > =20 > Here are the main things that the new ScriptScanner does that=20 > breaks my code: > 1) acts very strangely when it encounters "\" at a newline=20 > (doesn't just get rid of the newline, but it starts repeating=20 > the entire line about 6 times) > 2) uppercases and auto-closes tags that aren't in quotes > =20 > I have some specific test cases that demonstrate these. I'll=20 > check them in if you'd like. I have to admit that after=20 > playing with the internals of NodeReader, TagScanner, etc.=20 > that I'm not 100% clear on how some of this low level=20 > scanning code works. Nor is it always clear from reading the=20 > code. That's why I am not confident that I will be able to=20 > refactor the existing code to handle my specific problems. > =20 > I realize my usage of the parser may be quite different than=20 > 95% of the people who use the library, so if there isn't a=20 > solution that fits into the existing architecture I'll be=20 > happy to just make some local changes to fix things. I can=20 > always make my own scanner and not check it into the codeline=20 > (or just copy the old version of ScriptScanner into my code).=20 > However, if I'm running into this now, chances are somebody=20 > in the future will, also. > =20 > Marc >=20 > -----Original Message-----=20 > From: Derrick Oswald [mailto:Der...@ro...]=20 > Sent: Tue 5/27/2003 6:26 PM=20 > To: htm...@li...=20 > Cc:=20 > Subject: Re: [Htmlparser-developer] RE:=20 > [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners=20 > CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > =09 > =09 >=20 > You may need to back out the change, or at a minimum=20 > get the old code by > going back a version and putting it in your=20 > ScriptScanner base class. > =09 > I guess I screwed up. I saw you're drop that allowed=20 > all the lines to be > accumulated in a tag and I thought the two scanners=20 > were very close then > (apart from the tags in quotes thing). My only excuse=20 > is it passed all > the unit tests. Well to be truthful I changed two of=20 > the tests, but it > was only extraneous newline stuff at the start and end of text. > =09 > The script scanner is breaking your code because of=20 > uppercasing tags > (not just within in comments) and removing newlines=20 > after \, right? > =09 > Marc Novakowski wrote: > =09 > >I just realized that it's more complicated than that=20 > (for me, at least). In my application that uses htmlparser,=20 > I am extending certain scanners and tags (such as=20 > ScriptScanner but mostly CompositeTagScanner) to allow for=20 > "custom" tags in an HTML page. When the "HTML + custom tags"=20 > are run through my custom parser, the custom tags are=20 > converted into an object model which is then turned into=20 > dynamic javascript code. > > > >Long story short: some of these custom tags (i.e. the=20 > ones that extend ScriptScanner) _absolutely_ need the inner=20 > contents of the tag to remain unchanged. Also, since it's=20 > not always Javascript that is inside of the tags, adding=20 > extra rules to ignore tags in comments or strings won't=20 > always work. For example, one tag allows for arbitrary XML=20 > innards. Currently, the scanner will UPPERCASE all tags=20 > inside unless they're in quotes (which messes up the XML). > > > >The old ScriptScanner did exactly what I needed --=20 > that is, it didn't scan for tags at all. It just looked for=20 > the exact (case-insensitive) string match of the end tag. It=20 > didn't look for "<" and it didn't defer to scanners. I took=20 > a look at the current code and I can't see any easy way to do this. > > > >Marc > > > >-----Original Message----- > >From: Derrick Oswald [mailto:Der...@ro...] > >Sent: Tuesday, May 27, 2003 2:39 PM > >To: htm...@li... > >Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] > >htmlparser/src/org/htmlparser/scanners > >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > > > > >Marc, > > > >The text within <SCRIPT></SCRIPT> is supposed to be=20 > parsed as pure text > >or remarks. > >I guess the text scanner goes until it sees a <x...=20 > and then stops to > >defer to a tag scanner. I hadn't thought about those=20 > in comments, or > >about the \ end of lines. > > > >Perhaps, rather than write a new scanner, fix the=20 > StringScanner (the > >remark scanner should be OK), so that it does the=20 > correct behaviour when > >balance_quotes is true. Then the 'balance_quotes' flag=20 > could be called > >'strict_script' or something. > > > >Derrick > > > >Marc Novakowski wrote: > > > >=20 > > > =09 > =09 > =09 > =09 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your=20 > application fit in a > relational database is painful, don't do it! Check out=20 > ObjectStore. > Now part of Progress Software.=20 > http://www.objectstore.net/sourceforge > =09 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > =09 > https://lists.sourceforge.net/lists/listinfo/h> tmlparser-developer > =09 >=20 > NHun~uj=CA=89jjjjvv > 9r>JF yqjzzzy=E2=96=8Az >=20 |