Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2003-05-28 01:34:33
|
You may need to back out the change, or at a minimum get the old code by going back a version and putting it in your ScriptScanner base class. I guess I screwed up. I saw you're drop that allowed all the lines to be accumulated in a tag and I thought the two scanners were very close then (apart from the tags in quotes thing). My only excuse is it passed all the unit tests. Well to be truthful I changed two of the tests, but it was only extraneous newline stuff at the start and end of text. The script scanner is breaking your code because of uppercasing tags (not just within in comments) and removing newlines after \, right? Marc Novakowski wrote: >I just realized that it's more complicated than that (for me, at least). In my application that uses htmlparser, I am extending certain scanners and tags (such as ScriptScanner but mostly CompositeTagScanner) to allow for "custom" tags in an HTML page. When the "HTML + custom tags" are run through my custom parser, the custom tags are converted into an object model which is then turned into dynamic javascript code. > >Long story short: some of these custom tags (i.e. the ones that extend ScriptScanner) _absolutely_ need the inner contents of the tag to remain unchanged. Also, since it's not always Javascript that is inside of the tags, adding extra rules to ignore tags in comments or strings won't always work. For example, one tag allows for arbitrary XML innards. Currently, the scanner will UPPERCASE all tags inside unless they're in quotes (which messes up the XML). > >The old ScriptScanner did exactly what I needed -- that is, it didn't scan for tags at all. It just looked for the exact (case-insensitive) string match of the end tag. It didn't look for "<" and it didn't defer to scanners. I took a look at the current code and I can't see any easy way to do this. > >Marc > >-----Original Message----- >From: Derrick Oswald [mailto:Der...@ro...] >Sent: Tuesday, May 27, 2003 2:39 PM >To: htm...@li... >Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] >htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Marc, > >The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text >or remarks. >I guess the text scanner goes until it sees a <x... and then stops to >defer to a tag scanner. I hadn't thought about those in comments, or >about the \ end of lines. > >Perhaps, rather than write a new scanner, fix the StringScanner (the >remark scanner should be OK), so that it does the correct behaviour when >balance_quotes is true. Then the 'balance_quotes' flag could be called >'strict_script' or something. > >Derrick > >Marc Novakowski wrote: > > > |