Thread: RE: [Htmlparser-developer] CompositeTagScanner - Some comments
Brought to you by:
derrickoswald
From: <dha...@or...> - 2003-05-09 11:11:46
|
The concept of MATCH_IDS and ENDERS array is great. A STARTERS array could also be useful in the correction procedure. If any tag from this array is encountered automatic correction could be done to end the previous tag. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 -----Original Message----- From: Udani, Dhaval H. Sent: Friday, May 09, 2003 10:03 AM To: htmlparser-developer Cc: Udani, Dhaval H. Subject: [Htmlparser-developer] CompositeTagScanner - Some comments Hi, A lot of thought has definitely gone into the design of the CompositeTagScanner. Some absolutely wonderful work has been done here. Somik, had asked me to have a look at the code and review it. I just have one point for discussion. The CompositeTagScanner has a provision to allow for nested children. However I feel there are very few HTML tags which have children of the same type. By default the scanner allows nesting. I believe this behaviour should be disallowed by default. my $0.02 ;) dhaval ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <dha...@or...> - 2003-05-09 11:24:58
|
Wow!!! I think this is a record number of mails I must have sent to HTMLParser in a day. But some thoughts are evolving and I was wundering about it. I wanted to know whether end tag correction takes place in the CompositeTagScanner when end of stream is encountered. If not then I think that too should happen. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 -----Original Message----- From: Udani, Dhaval H. Sent: Friday, May 09, 2003 4:41 PM To: htmlparser-developer Cc: Udani, Dhaval H. Subject: RE: [Htmlparser-developer] CompositeTagScanner - Some comments The concept of MATCH_IDS and ENDERS array is great. A STARTERS array could also be useful in the correction procedure. If any tag from this array is encountered automatic correction could be done to end the previous tag. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 -----Original Message----- From: Udani, Dhaval H. Sent: Friday, May 09, 2003 10:03 AM To: htmlparser-developer Cc: Udani, Dhaval H. Subject: [Htmlparser-developer] CompositeTagScanner - Some comments Hi, A lot of thought has definitely gone into the design of the CompositeTagScanner. Some absolutely wonderful work has been done here. Somik, had asked me to have a look at the code and review it. I just have one point for discussion. The CompositeTagScanner has a provision to allow for nested children. However I feel there are very few HTML tags which have children of the same type. By default the scanner allows nesting. I believe this behaviour should be disallowed by default. my $0.02 ;) dhaval ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2003-05-10 14:45:10
|
Dhaval Udani wrote: > I wanted to know whether end tag correction takes place in the > CompositeTagScanner when end of stream is encountered. If not then I think that > too should happen. Yes it does - I believe there are some tests for this. Regards Somik |
From: Somik R. <so...@ya...> - 2003-05-10 16:04:37
|
> Wow!!! I think this is a record number of mails I must have sent to HTMLParser > in a day. This level of activity is also a record for the project. Thanks to you, Derrick, Marc... Derrick, you're a terrific project lead, and I wish I had stepped down earlier.. Thanks to you for the idea about the auto generation of the CVS log - I am going to use it in a project at work. Dhaval --> keep your critiques (and thoughts) coming - they will go a long way in improving the parser. Regards, Somik |
From: <dha...@or...> - 2003-05-12 08:49:46
|
Hi, What I am saying is my understanding of the scanner. Do forgive me if I don't understand it correctly. The MATCH_IDS array is used to "match" the tags that should be parsed by this scanner as tags of particular types. i.e. for a OPTION scanner the MATCH_IDS tag would have OPTION as its member. This tells the scanning engine that whenever OPTION is encountered create an instance of a OptionTag. At the same time it has SELECT as a member of the ENDERS array. This means that whenever the end tag of SELECT i.e. </SELECT> is encountered, correction to close OPTION tag should take place. A STARTERS array would be useful to tell the scanner that when a particular start tag(say another OPTION tag in this case), as opposed to a end tag denoted by ENDERS, is encountered also perform end tag correction. I hope I've been able to explain the need more clearly. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 -----Original Message----- From: somik [mailto:so...@ya...] Sent: Saturday, May 10, 2003 8:13 PM To: htmlparser-developer Cc: somik Subject: Re: [Htmlparser-developer] CompositeTagScanner - Some comments Dhaval Udani wrote: > The concept of MATCH_IDS and ENDERS array is great. A STARTERS array could also > be useful in the correction procedure. If any tag from this array is > encountered automatic correction could be done to end the previous tag. STARTERS is actually what MATCH_IDS is for. Why do you want a seperate STARTERS array ? Regards, Somik ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2003-05-12 22:26:21
|
Dhaval Udani wrote: > A STARTERS array would be useful to tell the scanner that when a particular > start tag(say another OPTION tag in this case), as opposed to a end tag denoted > by ENDERS, is encountered also perform end tag correction. I hope I've been > able to explain the need more clearly. I still don't follow why you would need a STARTERS array - if you encounter another OPTION tag, the behavior is determined by the boolean variable - to add or not to add children, and correction is done automatically. Or am I missing something ? Regards, Somik |
From: Marc N. <ma...@ke...> - 2003-05-12 16:38:35
|
My $0.02: I don't mind if you make it disallowed by default, as long as = you don't break the ability for it to have nested tags of the same type. = I extend CompositeTagScanner quite a bit in my own code to parse = "custom" XML tags inside of an HTML page, and that code relies heavily = on the current capability of CompositeTagScanner. Marc -----Original Message----- From: dha...@or... [mailto:dha...@or...] Sent: Thursday, May 08, 2003 9:33 PM To: htm...@li... Subject: [Htmlparser-developer] CompositeTagScanner - Some comments Hi, A lot of thought has definitely gone into the design of the=20 CompositeTagScanner. Some absolutely wonderful work has been done here. = Somik,=20 had asked me to have a look at the code and review it. I just have one = point=20 for discussion. The CompositeTagScanner has a provision to allow for nested children. = However I=20 feel there are very few HTML tags which have children of the same type. = By=20 default the scanner allows nesting. I believe this behaviour should be=20 disallowed by default. my $0.02 ;) dhaval ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <dha...@or...> - 2003-05-13 03:41:54
|
Dhaval Udani wrote: >> A STARTERS array would be useful to tell the scanner that when a particular >> start tag(say another OPTION tag in this case), as opposed to a end tag denoted >> by ENDERS, is encountered also perform end tag correction. I hope I've been >> able to explain the need more clearly. > I still don't follow why you would need a STARTERS array - if you encounter > another OPTION tag, the behavior is determined by the boolean variable - to > add or not to add children, and correction is done automatically. Or am I > missing something ? I put OPTION tag as an example probably a wrong one. Say something like this is there: <P> blah blah blah <TABLE> Now what I am saying that in the P scanner, if TABLE is provided as a member of the STARTERS array then a </P> will be put up before the beginning of <TABLE> tag. In essence the way the ENDERS array looks for a tag of type EndTag, similarly STARTERS array would look for a start tag of the type defined. I hope I've been clearer. Do let me know. Dhaval |
From: Somik R. <so...@ya...> - 2003-05-13 11:46:56
|
> Say something like this is there: > > <P> blah blah blah > <TABLE> > > Now what I am saying that in the P scanner, if TABLE is provided as a member of > the STARTERS array then a </P> will be put up before the beginning of <TABLE> > tag. In essence the way the ENDERS array looks for a tag of type EndTag, > similarly STARTERS array would look for a start tag of the type defined. > > I hope I've been clearer. Do let me know. I'm with you - initially I was checking for starters - changed that to enders. But if we must have both, then we must have both. Go for it. But also think deeply about the names- if it was confusing to you and me, it would be for others too... Cheers, Somik |
From: Somik R. <so...@ya...> - 2003-05-13 12:42:09
|
One word of caution, ensure that you are not over-engineering. I didn't do it, because it wasn't needed. That has been the key to our approach- and enabled us to keep the parser really small. Regards Somik ----- Original Message ----- From: "Somik Raha" <so...@ya...> To: <htm...@li...> Sent: Tuesday, May 13, 2003 7:46 AM Subject: Re: [Htmlparser-developer] CompositeTagScanner - Some comments > > > Say something like this is there: > > > > <P> blah blah blah > > <TABLE> > > > > Now what I am saying that in the P scanner, if TABLE is provided as a > member of > > the STARTERS array then a </P> will be put up before the beginning of > <TABLE> > > tag. In essence the way the ENDERS array looks for a tag of type EndTag, > > similarly STARTERS array would look for a start tag of the type defined. > > > > I hope I've been clearer. Do let me know. > > I'm with you - initially I was checking for starters - changed that to > enders. But if we must have both, then we must have both. Go for it. But > also think deeply about the names- if it was confusing to you and me, it > would be for others too... > > Cheers, > Somik > > > > ------------------------------------------------------- > Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara > The only event dedicated to issues related to Linux enterprise solutions > www.enterpriselinuxforum.com > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <dha...@or...> - 2003-05-14 10:39:15
|
Yeah i understand that. The problem being that currently such a situation cannot be envisaged. However it may prove beneficial to other scanner writers if they ever come up with such scenarios. What say Derrick? Should we go ahead with something like this? Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 > -----Original Message----- > From: somik [mailto:so...@ya...] > Sent: Tuesday, May 13, 2003 6:11 PM > To: htmlparser-developer > Cc: somik > Subject: Re: [Htmlparser-developer] CompositeTagScanner - > Some comments > > > One word of caution, ensure that you are not > over-engineering. I didn't do > it, because it wasn't needed. That has been the key to our > approach- and > enabled us to keep the parser really small. > > Regards > Somik > ----- Original Message ----- > From: "Somik Raha" <so...@ya...> > To: <htm...@li...> > Sent: Tuesday, May 13, 2003 7:46 AM > Subject: Re: [Htmlparser-developer] CompositeTagScanner - > Some comments > > > > > > > Say something like this is there: > > > > > > <P> blah blah blah > > > <TABLE> > > > > > > Now what I am saying that in the P scanner, if TABLE is > provided as a > > member of > > > the STARTERS array then a </P> will be put up before the > beginning of > > <TABLE> > > > tag. In essence the way the ENDERS array looks for a tag > of type EndTag, > > > similarly STARTERS array would look for a start tag of > the type defined. > > > > > > I hope I've been clearer. Do let me know. > > > > I'm with you - initially I was checking for starters - > changed that to > > enders. But if we must have both, then we must have both. > Go for it. But > > also think deeply about the names- if it was confusing to > you and me, it > > would be for others too... > > > > Cheers, > > Somik > > > > > > > > ------------------------------------------------------- > > Enterprise Linux Forum Conference & Expo, June 4-6, 2003, > Santa Clara > > The only event dedicated to issues related to Linux > enterprise solutions > > www.enterpriselinuxforum.com > > > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > ------------------------------------------------------- > Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara > The only event dedicated to issues related to Linux > enterprise solutions > www.enterpriselinuxforum.com > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Somik R. <so...@ya...> - 2003-05-14 22:06:15
|
Dhaval Udani wrote: > Yeah i understand that. The problem being that currently such a situation > cannot be envisaged. However it may prove beneficial to other scanner writers > if they ever come up with such scenarios. That is the classic definition of over-engineering.. Of course, you are not bound to not over-engineer, IMHO. :) Regards, Somik |
From: <dha...@or...> - 2003-05-15 03:56:34
|
> -----Original Message----- > From: somik [mailto:so...@ya...] > Sent: Thursday, May 15, 2003 3:36 AM > To: htmlparser-developer > Cc: somik > Subject: Re: [Htmlparser-developer] CompositeTagScanner - > Some comments > > > Dhaval Udani wrote: > > Yeah i understand that. The problem being that currently > such a situation > > cannot be envisaged. However it may prove beneficial to > other scanner > writers > > if they ever come up with such scenarios. > > That is the classic definition of over-engineering.. Of > course, you are not > bound to not over-engineer, IMHO. :) > ha..ha...ha :) Guess I've fallen in the trap :) |
From: <dha...@or...> - 2003-05-15 05:44:59
|
> > Dhaval Udani wrote: > > Yeah i understand that. The problem being that currently > such a situation > > cannot be envisaged. However it may prove beneficial to > other scanner > writers > > if they ever come up with such scenarios. > > That is the classic definition of over-engineering.. Of > course, you are not > bound to not over-engineer, IMHO. :) > Well the situation just came up. Assume a <HEAD> tag which is not closed. It needs to be closed when a <BODY> tag is encountered. Hence BODY would be in the STARTERS array for HEAD. |
From: Somik R. <so...@ya...> - 2003-05-16 02:17:17
|
Dhaval Udani wrote: > Well the situation just came up. > > Assume a <HEAD> tag which is not closed. It needs to be closed when a <BODY> > tag is encountered. Hence BODY would be in the STARTERS array for HEAD. I don't see a HeadScanner. If <HEAD> is not closed, it should be no problem. Regards, Somik |
From: <dha...@or...> - 2003-05-16 05:03:44
|
> > > Dhaval Udani wrote: > > Well the situation just came up. > > > > Assume a <HEAD> tag which is not closed. It needs to be > closed when a > <BODY> > > tag is encountered. Hence BODY would be in the STARTERS > array for HEAD. > > I don't see a HeadScanner. If <HEAD> is not closed, it should > be no problem. > I wrote a HEAD scanner and have sent it to Derrick for inclusion in the next version. In ENDERS I put BODY & in END_TAG_ENDERS I put HTML. Works well. |
From: Somik R. <so...@ya...> - 2003-05-10 14:44:09
|
Dhaval Udani wrote: > The concept of MATCH_IDS and ENDERS array is great. A STARTERS array could also > be useful in the correction procedure. If any tag from this array is > encountered automatic correction could be done to end the previous tag. STARTERS is actually what MATCH_IDS is for. Why do you want a seperate STARTERS array ? Regards, Somik |