From: vertigo <ve...@pa...> - 2002-07-09 16:58:02
|
Yes, but it is also one of the more complicated regions (a dark, shadowy corner) of the project. Regular expressions are not, as mentioned, the best way to parse HTML on a large scale. It can get way out of control. An actual parser is better for a number of reasons. The one issue I have is the magnitude of writing an HTML parser. It isn't simple, especially when considering poorly written HTML. Now, from a programmer's perspective I think "dammit, why don't people write correct HTML?" From an customer's perspective, however, I think "We decided to write this site to be used only with Internet Explorer. 75% of the people out there use IE, most of the remaining 25% have IE available on their computer, and the rest we don't care about. The browser wars are over and Microsoft won. IE renders this code fine. Why can't your Filter handle it? I'm sure as hell not going to pay several thousand dollars to have that idiot coder come back and rewrite everything." Remember, we have to catch everything that explorer THINKS is valid HTML. Explorer thinks the following is valid code: <html head> <title> microsoft has a very robust parser.</title> </head> <script> function f() { var x = 10 alert("I can't believe IE handles this." + x) } </script <body> IE is great when it comes to parsing HTML, much to the chagrine of many programmers. <br> <br> <input type="mutton" value="amazing" onClick="f()"> </body </html> Put into the project perspective, we have to write HTML parsers for each implementation, and this can be much more complicated than it first appears. We might not want to have limited support in the first release, and then improve it later. Cross-site scripting is a huge issue, and deserves to be handled in great detail. nathan On 9 Jul 2002, Gabriel Lawrence wrote: > When I did a similar thing for a previous project we benchmarked writing > our own specialized parser to find <> and manage what can be in a tag vs > using regular expressions and found a dramatic improvement to using the > non regular expression version. This was in Java code, so it could have > been that the regular expression library we were using was not the best, > but that may be something to consider also. > > As a side note I think this kind of functionality would be something > great to put into the filters project.... > > -gabe > > On Tue, 2002-07-09 at 05:41, Steven J. Sobol wrote: > > On 8 Jul 2002, Gabriel Lawrence wrote: > > > > > Steve, > > > > > > You're going to find that there is a whole lot more that is evil then > > > just script tags... What I'd suggest you do is instead parse for > > > occurances of <> and only allow things to appear in tags that you have > > > a good list... > > > > Right. That's what I'm planning on doing. :) I have to figure out the > > easiest way to do it using PHP and regular expressions. > > > > What I have done so far is just a stopgap for a few days until I can > > continue working on the site. > > > > -- > > Steve Sobol, CTO JustThe.net LLC, Mentor On The Lake, OH 888.480.4NET > > - I do my best work with one of my cockatiels sitting on each shoulder - > > 6/4/02:A USA TODAY poll found that 80% of Catholics advocated a zero-tolerance > > stance towards abusive priests. The fact that 20% didn't, scares me... > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Stuff, things, and much much more. > http://thinkgeek.com/sf > _______________________________________________ > Owasp-input-api-developers mailing list > Owa...@li... > https://lists.sourceforge.net/lists/listinfo/owasp-input-api-developers > |