Thread: [Htmlparser-developer] Extending HasAttributeFilter
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2006-05-08 13:45:47
|
I would like to add the following functionality to HasAttributeFilter: 1) A boolean flag to set if the matching should be case-insensitive. I think this could be done with a boolean, one new constructor (String attribute, String value, boolean attribValue) and get/set method pair. 2) A flag to mark that the attribValue should be parsed as a regular expression (I don't really see the benefit of doing this with the tag name). This should also obey the case-sensitivity rule in (1). For this, I imagine a further constructor and get/set method pair. (a sample use case of this is "post\d+" to match post1, post22, post343545, etc). I'm willing to go ahead and code these, but I thought I should run this past you other developers too in case you dislike either idea. I'm also open to either: a) putting the regexp stuff in a subclass of HasAttributeFilter (but it seems a small enough change to be suitable as part of the class size-wise). b) changing the one/two boolean constructors to be one constructor that takes an INT flag, and add flags for the different combinations (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would be 3). This seems unnecessarily complex, and doing it the way I suggested above still allows for this in the future if desired. Thanks for your feedback, Ian Macfarlane |
From: Ian M. <ian...@gm...> - 2006-05-08 14:17:51
|
I would also like to be able to set the attribute as null but the attribute value as not-null. In this case, it should attempt to match all attributes against the attribute value. Please email me if you have any objections to this (or anything else). Thanks Ian Macfarlane On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > I would like to add the following functionality to HasAttributeFilter: > > 1) A boolean flag to set if the matching should be case-insensitive. I > think this could be done with a boolean, one new constructor (String > attribute, String value, boolean attribValue) and get/set method pair. > > 2) A flag to mark that the attribValue should be parsed as a regular > expression (I don't really see the benefit of doing this with the tag > name). This should also obey the case-sensitivity rule in (1). For > this, I imagine a further constructor and get/set method pair. (a > sample use case of this is "post\d+" to match post1, post22, > post343545, etc). > > > I'm willing to go ahead and code these, but I thought I should run > this past you other developers too in case you dislike either idea. > I'm also open to either: > > a) putting the regexp stuff in a subclass of HasAttributeFilter (but > it seems a small enough change to be suitable as part of the class > size-wise). > > b) changing the one/two boolean constructors to be one constructor > that takes an INT flag, and add flags for the different combinations > (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would be 3)= . > This seems unnecessarily complex, and doing it the way I suggested > above still allows for this in the future if desired. > > > Thanks for your feedback, > > Ian Macfarlane > |
From: Derrick O. <Der...@Ro...> - 2006-05-09 03:14:10
|
Ian, The conversion of case requires either an assumption of encoding or an explicit one. See for example the additional Locale property on StringFilter. The regex library requires or assumes a strategy, either MATCH, LOOKINGAT or FIND. See for example the additional int property on RegexFilter. I'm not sure how much could be gained by subclassing the existing HasAttributeFilter. Another strategy would be to add boolean properties for 'InText' (on by default), 'InAttributeName', and 'InAttributeValue' to the StringFilter and RegexFilter. Then of course you would need to add an AttributeName property. The attribute name being allowed to be null is a good idea, and would be the default if it's just not set, no need for an extra boolean 'nameIsNull' property. By the way, searching the tag name would come for free if the attributes checking loop started at index zero. That would mean adding three boolean and a string property to the two classes. I think these are differences enough to warrant new classes. In fact, maybe this should be one really prickly class called a SearchFilter that combines what StringFilter and RegexFilter do, plus the above. I don't think something can be case-insensitive and a regex filter though, so these aren't completely orthogonal. So maybe a 'type' property: straight string match case insensitive match - needs or assumes a Locale regex match - needs or assumes a strategy I leave it up to you though. Sounds like a fair piece of work. The extra constructors on the AndFilter and OrFilter are also good ideas. The XorFilter seems like a good thing to round out the logical operations. Would it also take an array of filters and only return true if just one is matched? The FilterBuilder would need to be altered to handle these changes of course, assuming this was a goal. This would be easier if there were just new SearchFilter and XorFilter classes rather than changes to the existing HasAttributeFilter, StringFilter, and RegexFilter (because new classes could be ignored, like the CssSelectorFilter is currently being). Derrick Ian Macfarlane wrote: > I would also like to be able to set the attribute as null but the > attribute value as not-null. In this case, it should attempt to match > all attributes against the attribute value. > > Please email me if you have any objections to this (or anything else). > > Thanks > > Ian Macfarlane > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > >> I would like to add the following functionality to HasAttributeFilter: >> >> 1) A boolean flag to set if the matching should be case-insensitive. I >> think this could be done with a boolean, one new constructor (String >> attribute, String value, boolean attribValue) and get/set method pair. >> >> 2) A flag to mark that the attribValue should be parsed as a regular >> expression (I don't really see the benefit of doing this with the tag >> name). This should also obey the case-sensitivity rule in (1). For >> this, I imagine a further constructor and get/set method pair. (a >> sample use case of this is "post\d+" to match post1, post22, >> post343545, etc). >> >> >> I'm willing to go ahead and code these, but I thought I should run >> this past you other developers too in case you dislike either idea. >> I'm also open to either: >> >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but >> it seems a small enough change to be suitable as part of the class >> size-wise). >> >> b) changing the one/two boolean constructors to be one constructor >> that takes an INT flag, and add flags for the different combinations >> (e.g. CASE_SENSITIVE = 1, USE_REGEX = 2, so both together would be 3). >> This seems unnecessarily complex, and doing it the way I suggested >> above still allows for this in the future if desired. >> >> >> Thanks for your feedback, >> >> Ian Macfarlane >> > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=k&kid0709&bid&3057&dat1642 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: Ian M. <ian...@gm...> - 2006-05-09 11:56:22
|
I think the existing StringFilter and RegexFilter, as they apply to Text nodes only at the moment, should probably be left alone. Whichever class we apply this to should handle tags only. One set for text, one for tags (although I also think that the two text-searching ones should possible be made one class). Now I look through the existing filters, it strikes me that we might already have some of this in place in the form of LinkStringFilter / LinkRegexFilter. These basically do the scanning based on a tag and attribute, but restricted to LinkTag tags. I think combining these with HasAttributeFilter pretty much gives us what we want (indeed we could in theory deprecate the two Link*Filter classes in favour of this - not sure if we'd want to do that or not). Also, you can indeed have both regex and case insensitive - it's built into Java's Pattern, and is actually used by LinkRegexFilter [sample Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE)]. So after a fair bit of consideration and changing my mind several times, I've settled back on extending HasAttributeFilter. This is the list of things we need to be able to tell the filter: Attribute Value Attrib value case sensitive? Locale (but not for regex as use Pattern.UNICODE_CASE) Regex on/off Regex type (MATCH, LOOKINGAT, FIND) You could certainly combine the regex on/off with regex type in the constructor, e.g.: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, <stuff for case sensitive here>) HasAttributeFilter(String attribute, String value, <stuff for case sensitive here>, int regexType) Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the others it would be 0 (OFF). I don't think that's too confusing. The issue is how to fold in the Locale into the constructor, as it's not used by the regex (the regex either uses US-ASCII or Unicode, depending if Pattern.UNICODE_CASE is set). Also regexes can include locale-specific sections in them (as well as, of course, the usual case-insensitive stuff). So I think we want to mutually exclude passing values for locale and regex: The non regex constructors: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, String locale) or alternatively: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, String caseInsensitiveLo= cale) [for the last one therefore we turn on case-insensitivity] and for the regex-accepting constructors either: HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, boolean caseSensitive, int regexType) or alternatively: HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). I personally favour choice number 1 for the non-regex constructors and number 2 for the regex constructors, so I think it the list of constructors should look like: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, String locale) HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) I'd love some feedback please :) ---------------------------------------------------------------------------= ------ AndFilter/OrFilter taking arrays - this seems like you'd like it, so if Sourceforge CVS will stop being broken I might try and add it. In the case of less than 2 filters being added, I'm in favour of throwing an IllegalArgumentException - does that sound reasonable? ---------------------------------------------------------------------------= ------ XorFilter - I'm not 100% sure how the XOR logic should work if it has more than two filters. According to the ever reliable (ahem!) Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries "is true iff an odd number of the variables are true". So "true true false" is false, and "false true false" is true. Ian On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > Ian, > > The conversion of case requires either an assumption of encoding or an > explicit one. > See for example the additional Locale property on StringFilter. > > The regex library requires or assumes a strategy, either MATCH, > LOOKINGAT or FIND. > See for example the additional int property on RegexFilter. > > I'm not sure how much could be gained by subclassing the existing > HasAttributeFilter. > > Another strategy would be to add boolean properties for 'InText' (on by > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > and RegexFilter. Then of course you would need to add an AttributeName > property. The attribute name being allowed to be null is a good idea, > and would be the default if it's just not set, no need for an extra > boolean 'nameIsNull' property. By the way, searching the tag name would > come for free if the attributes checking loop started at index zero. > That would mean adding three boolean and a string property to the two > classes. I think these are differences enough to warrant new classes. In > fact, maybe this should be one really prickly class called a > SearchFilter that combines what StringFilter and RegexFilter do, plus > the above. I don't think something can be case-insensitive and a regex > filter though, so these aren't completely orthogonal. So maybe a 'type' > property: > straight string match > case insensitive match - needs or assumes a Locale > regex match - needs or assumes a strategy > I leave it up to you though. Sounds like a fair piece of work. > > The extra constructors on the AndFilter and OrFilter are also good ideas. > > The XorFilter seems like a good thing to round out the logical operations= . > Would it also take an array of filters and only return true if just one > is matched? > > The FilterBuilder would need to be altered to handle these changes of > course, assuming this was a goal. > This would be easier if there were just new SearchFilter and XorFilter > classes rather than changes to the existing HasAttributeFilter, > StringFilter, and RegexFilter (because new classes could be ignored, > like the CssSelectorFilter is currently being). > > Derrick > > Ian Macfarlane wrote: > > > I would also like to be able to set the attribute as null but the > > attribute value as not-null. In this case, it should attempt to match > > all attributes against the attribute value. > > > > Please email me if you have any objections to this (or anything else). > > > > Thanks > > > > Ian Macfarlane > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > >> I would like to add the following functionality to HasAttributeFilter: > >> > >> 1) A boolean flag to set if the matching should be case-insensitive. I > >> think this could be done with a boolean, one new constructor (String > >> attribute, String value, boolean attribValue) and get/set method pair. > >> > >> 2) A flag to mark that the attribValue should be parsed as a regular > >> expression (I don't really see the benefit of doing this with the tag > >> name). This should also obey the case-sensitivity rule in (1). For > >> this, I imagine a further constructor and get/set method pair. (a > >> sample use case of this is "post\d+" to match post1, post22, > >> post343545, etc). > >> > >> > >> I'm willing to go ahead and code these, but I thought I should run > >> this past you other developers too in case you dislike either idea. > >> I'm also open to either: > >> > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > >> it seems a small enough change to be suitable as part of the class > >> size-wise). > >> > >> b) changing the one/two boolean constructors to be one constructor > >> that takes an INT flag, and add flags for the different combinations > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would be= 3). > >> This seems unnecessarily complex, and doing it the way I suggested > >> above still allows for this in the future if desired. > >> > >> > >> Thanks for your feedback, > >> > >> Ian Macfarlane > >> > > > > > > ------------------------------------------------------- > > Using Tomcat but need to do more? Need to support web services, securit= y? > > Get stuff done quickly with pre-integrated technology to make your job > > easier > > Download IBM WebSphere Application Server v.1.0.1 based on Apache > > Geronimo > > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=120709&bid&3057&dat=121642 > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job ea= sier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim= o > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat= =3D121642 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Ian M. <ian...@gm...> - 2006-05-16 13:34:20
|
I've just committed the the new constructors for OrFilter and a new class XorFilter, as these were simple useful additions and non-controversial. I would still love some feedback about the (revised) proposed changes to HasAttributeFilter (see below in the email), as I don't want to write it only for more senior devs to afterwards decide to change it back because they didn't look at it before. If any of it isn't clear, please ask. Ian On 09/05/06, Ian Macfarlane <ian...@gm...> wrote: > I think the existing StringFilter and RegexFilter, as they apply to > Text nodes only at the moment, should probably be left alone. > Whichever class we apply this to should handle tags only. One set for > text, one for tags (although I also think that the two text-searching > ones should possible be made one class). > > Now I look through the existing filters, it strikes me that we might > already have some of this in place in the form of LinkStringFilter / > LinkRegexFilter. These basically do the scanning based on a tag and > attribute, but restricted to LinkTag tags. I think combining these > with HasAttributeFilter pretty much gives us what we want (indeed we > could in theory deprecate the two Link*Filter classes in favour of > this - not sure if we'd want to do that or not). > > Also, you can indeed have both regex and case insensitive - it's built > into Java's Pattern, and is actually used by LinkRegexFilter [sample > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > Pattern.UNICODE_CASE)]. > > So after a fair bit of consideration and changing my mind several > times, I've settled back on extending HasAttributeFilter. > > This is the list of things we need to be able to tell the filter: > > Attribute > Value > Attrib value case sensitive? > Locale (but not for regex as use Pattern.UNICODE_CASE) > Regex on/off > Regex type (MATCH, LOOKINGAT, FIND) > > You could certainly combine the regex on/off with regex type in the > constructor, e.g.: > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, <stuff for case > sensitive here>) > HasAttributeFilter(String attribute, String value, <stuff for case > sensitive here>, int regexType) > > Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the > others it would be 0 (OFF). I don't think that's too confusing. > > The issue is how to fold in the Locale into the constructor, as it's > not used by the regex (the regex either uses US-ASCII or Unicode, > depending if Pattern.UNICODE_CASE is set). Also regexes can include > locale-specific sections in them (as well as, of course, the usual > case-insensitive stuff). So I think we want to mutually exclude > passing values for locale and regex: > > The non regex constructors: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > > or alternatively: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, String caseInsensitive= Locale) > [for the last one therefore we turn on case-insensitivity] > > and for the regex-accepting constructors either: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, int regexType) > > or alternatively: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > I personally favour choice number 1 for the non-regex constructors and > number 2 for the regex constructors, so I think it the list of > constructors should look like: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) > > I'd love some feedback please :) > > -------------------------------------------------------------------------= -------- > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > if Sourceforge CVS will stop being broken I might try and add it. In > the case of less than 2 filters being added, I'm in favour of throwing > an IllegalArgumentException - does that sound reasonable? > > -------------------------------------------------------------------------= -------- > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > more than two filters. According to the ever reliable (ahem!) > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries > "is true iff an odd number of the variables are true". So "true true > false" is false, and "false true false" is true. > > Ian > > On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > > Ian, > > > > The conversion of case requires either an assumption of encoding or an > > explicit one. > > See for example the additional Locale property on StringFilter. > > > > The regex library requires or assumes a strategy, either MATCH, > > LOOKINGAT or FIND. > > See for example the additional int property on RegexFilter. > > > > I'm not sure how much could be gained by subclassing the existing > > HasAttributeFilter. > > > > Another strategy would be to add boolean properties for 'InText' (on by > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > > and RegexFilter. Then of course you would need to add an AttributeName > > property. The attribute name being allowed to be null is a good idea, > > and would be the default if it's just not set, no need for an extra > > boolean 'nameIsNull' property. By the way, searching the tag name woul= d > > come for free if the attributes checking loop started at index zero. > > That would mean adding three boolean and a string property to the two > > classes. I think these are differences enough to warrant new classes. I= n > > fact, maybe this should be one really prickly class called a > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > the above. I don't think something can be case-insensitive and a regex > > filter though, so these aren't completely orthogonal. So maybe a 'type= ' > > property: > > straight string match > > case insensitive match - needs or assumes a Locale > > regex match - needs or assumes a strategy > > I leave it up to you though. Sounds like a fair piece of work. > > > > The extra constructors on the AndFilter and OrFilter are also good idea= s. > > > > The XorFilter seems like a good thing to round out the logical operatio= ns. > > Would it also take an array of filters and only return true if just one > > is matched? > > > > The FilterBuilder would need to be altered to handle these changes of > > course, assuming this was a goal. > > This would be easier if there were just new SearchFilter and XorFilter > > classes rather than changes to the existing HasAttributeFilter, > > StringFilter, and RegexFilter (because new classes could be ignored, > > like the CssSelectorFilter is currently being). > > > > Derrick > > > > Ian Macfarlane wrote: > > > > > I would also like to be able to set the attribute as null but the > > > attribute value as not-null. In this case, it should attempt to match > > > all attributes against the attribute value. > > > > > > Please email me if you have any objections to this (or anything else)= . > > > > > > Thanks > > > > > > Ian Macfarlane > > > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > > > >> I would like to add the following functionality to HasAttributeFilte= r: > > >> > > >> 1) A boolean flag to set if the matching should be case-insensitive.= I > > >> think this could be done with a boolean, one new constructor (String > > >> attribute, String value, boolean attribValue) and get/set method pai= r. > > >> > > >> 2) A flag to mark that the attribValue should be parsed as a regular > > >> expression (I don't really see the benefit of doing this with the ta= g > > >> name). This should also obey the case-sensitivity rule in (1). For > > >> this, I imagine a further constructor and get/set method pair. (a > > >> sample use case of this is "post\d+" to match post1, post22, > > >> post343545, etc). > > >> > > >> > > >> I'm willing to go ahead and code these, but I thought I should run > > >> this past you other developers too in case you dislike either idea. > > >> I'm also open to either: > > >> > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > > >> it seems a small enough change to be suitable as part of the class > > >> size-wise). > > >> > > >> b) changing the one/two boolean constructors to be one constructor > > >> that takes an INT flag, and add flags for the different combinations > > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would = be 3). > > >> This seems unnecessarily complex, and doing it the way I suggested > > >> above still allows for this in the future if desired. > > >> > > >> > > >> Thanks for your feedback, > > >> > > >> Ian Macfarlane |
From: Ian M. <ian...@gm...> - 2006-05-16 13:40:41
|
Oops! I missed mentioning that I added the new constructor AndFilter as well as OrFilter. Ian On 16/05/06, Ian Macfarlane <ian...@gm...> wrote: > I've just committed the the new constructors for OrFilter and a new > class XorFilter, as these were simple useful additions and > non-controversial. > > I would still love some feedback about the (revised) proposed changes > to HasAttributeFilter (see below in the email), as I don't want to > write it only for more senior devs to afterwards decide to change it > back because they didn't look at it before. If any of it isn't clear, > please ask. > > Ian > > On 09/05/06, Ian Macfarlane <ian...@gm...> wrote: > > I think the existing StringFilter and RegexFilter, as they apply to > > Text nodes only at the moment, should probably be left alone. > > Whichever class we apply this to should handle tags only. One set for > > text, one for tags (although I also think that the two text-searching > > ones should possible be made one class). > > > > Now I look through the existing filters, it strikes me that we might > > already have some of this in place in the form of LinkStringFilter / > > LinkRegexFilter. These basically do the scanning based on a tag and > > attribute, but restricted to LinkTag tags. I think combining these > > with HasAttributeFilter pretty much gives us what we want (indeed we > > could in theory deprecate the two Link*Filter classes in favour of > > this - not sure if we'd want to do that or not). > > > > Also, you can indeed have both regex and case insensitive - it's built > > into Java's Pattern, and is actually used by LinkRegexFilter [sample > > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > > Pattern.UNICODE_CASE)]. > > > > So after a fair bit of consideration and changing my mind several > > times, I've settled back on extending HasAttributeFilter. > > > > This is the list of things we need to be able to tell the filter: > > > > Attribute > > Value > > Attrib value case sensitive? > > Locale (but not for regex as use Pattern.UNICODE_CASE) > > Regex on/off > > Regex type (MATCH, LOOKINGAT, FIND) > > > > You could certainly combine the regex on/off with regex type in the > > constructor, e.g.: > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, <stuff for case > > sensitive here>) > > HasAttributeFilter(String attribute, String value, <stuff for case > > sensitive here>, int regexType) > > > > Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the > > others it would be 0 (OFF). I don't think that's too confusing. > > > > The issue is how to fold in the Locale into the constructor, as it's > > not used by the regex (the regex either uses US-ASCII or Unicode, > > depending if Pattern.UNICODE_CASE is set). Also regexes can include > > locale-specific sections in them (as well as, of course, the usual > > case-insensitive stuff). So I think we want to mutually exclude > > passing values for locale and regex: > > > > The non regex constructors: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > > > or alternatively: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, String caseInsensiti= veLocale) > > [for the last one therefore we turn on case-insensitivity] > > > > and for the regex-accepting constructors either: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, int regexType) > > > > or alternatively: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > > > > I personally favour choice number 1 for the non-regex constructors and > > number 2 for the regex constructors, so I think it the list of > > constructors should look like: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) > > > > I'd love some feedback please :) > > > > -----------------------------------------------------------------------= ---------- > > > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > > if Sourceforge CVS will stop being broken I might try and add it. In > > the case of less than 2 filters being added, I'm in favour of throwing > > an IllegalArgumentException - does that sound reasonable? > > > > -----------------------------------------------------------------------= ---------- > > > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > > more than two filters. According to the ever reliable (ahem!) > > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries > > "is true iff an odd number of the variables are true". So "true true > > false" is false, and "false true false" is true. > > > > Ian > > > > On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > > > Ian, > > > > > > The conversion of case requires either an assumption of encoding or a= n > > > explicit one. > > > See for example the additional Locale property on StringFilter. > > > > > > The regex library requires or assumes a strategy, either MATCH, > > > LOOKINGAT or FIND. > > > See for example the additional int property on RegexFilter. > > > > > > I'm not sure how much could be gained by subclassing the existing > > > HasAttributeFilter. > > > > > > Another strategy would be to add boolean properties for 'InText' (on = by > > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilt= er > > > and RegexFilter. Then of course you would need to add an AttributeNa= me > > > property. The attribute name being allowed to be null is a good idea, > > > and would be the default if it's just not set, no need for an extra > > > boolean 'nameIsNull' property. By the way, searching the tag name wo= uld > > > come for free if the attributes checking loop started at index zero. > > > That would mean adding three boolean and a string property to the two > > > classes. I think these are differences enough to warrant new classes.= In > > > fact, maybe this should be one really prickly class called a > > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > > the above. I don't think something can be case-insensitive and a rege= x > > > filter though, so these aren't completely orthogonal. So maybe a 'ty= pe' > > > property: > > > straight string match > > > case insensitive match - needs or assumes a Locale > > > regex match - needs or assumes a strategy > > > I leave it up to you though. Sounds like a fair piece of work. > > > > > > The extra constructors on the AndFilter and OrFilter are also good id= eas. > > > > > > The XorFilter seems like a good thing to round out the logical operat= ions. > > > Would it also take an array of filters and only return true if just o= ne > > > is matched? > > > > > > The FilterBuilder would need to be altered to handle these changes of > > > course, assuming this was a goal. > > > This would be easier if there were just new SearchFilter and XorFilte= r > > > classes rather than changes to the existing HasAttributeFilter, > > > StringFilter, and RegexFilter (because new classes could be ignored, > > > like the CssSelectorFilter is currently being). > > > > > > Derrick > > > > > > Ian Macfarlane wrote: > > > > > > > I would also like to be able to set the attribute as null but the > > > > attribute value as not-null. In this case, it should attempt to mat= ch > > > > all attributes against the attribute value. > > > > > > > > Please email me if you have any objections to this (or anything els= e). > > > > > > > > Thanks > > > > > > > > Ian Macfarlane > > > > > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > > > > > >> I would like to add the following functionality to HasAttributeFil= ter: > > > >> > > > >> 1) A boolean flag to set if the matching should be case-insensitiv= e. I > > > >> think this could be done with a boolean, one new constructor (Stri= ng > > > >> attribute, String value, boolean attribValue) and get/set method p= air. > > > >> > > > >> 2) A flag to mark that the attribValue should be parsed as a regul= ar > > > >> expression (I don't really see the benefit of doing this with the = tag > > > >> name). This should also obey the case-sensitivity rule in (1). For > > > >> this, I imagine a further constructor and get/set method pair. (a > > > >> sample use case of this is "post\d+" to match post1, post22, > > > >> post343545, etc). > > > >> > > > >> > > > >> I'm willing to go ahead and code these, but I thought I should run > > > >> this past you other developers too in case you dislike either idea= . > > > >> I'm also open to either: > > > >> > > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (b= ut > > > >> it seems a small enough change to be suitable as part of the class > > > >> size-wise). > > > >> > > > >> b) changing the one/two boolean constructors to be one constructor > > > >> that takes an INT flag, and add flags for the different combinatio= ns > > > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together woul= d be 3). > > > >> This seems unnecessarily complex, and doing it the way I suggested > > > >> above still allows for this in the future if desired. > > > >> > > > >> > > > >> Thanks for your feedback, > > > >> > > > >> Ian Macfarlane > |
From: Derrick O. <der...@ro...> - 2006-05-16 14:01:39
|
It looks good to me. I might be tempted to pass a Locale object rather than a String in the constructors, like the existing StringFilter constructors. Ian Macfarlane <ian...@gm...> wrote: I've just committed the the new constructors for OrFilter and a new class XorFilter, as these were simple useful additions and non-controversial. I would still love some feedback about the (revised) proposed changes to HasAttributeFilter (see below in the email), as I don't want to write it only for more senior devs to afterwards decide to change it back because they didn't look at it before. If any of it isn't clear, please ask. Ian On 09/05/06, Ian Macfarlane wrote: > I think the existing StringFilter and RegexFilter, as they apply to > Text nodes only at the moment, should probably be left alone. > Whichever class we apply this to should handle tags only. One set for > text, one for tags (although I also think that the two text-searching > ones should possible be made one class). > > Now I look through the existing filters, it strikes me that we might > already have some of this in place in the form of LinkStringFilter / > LinkRegexFilter. These basically do the scanning based on a tag and > attribute, but restricted to LinkTag tags. I think combining these > with HasAttributeFilter pretty much gives us what we want (indeed we > could in theory deprecate the two Link*Filter classes in favour of > this - not sure if we'd want to do that or not). > > Also, you can indeed have both regex and case insensitive - it's built > into Java's Pattern, and is actually used by LinkRegexFilter [sample > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > Pattern.UNICODE_CASE)]. > > So after a fair bit of consideration and changing my mind several > times, I've settled back on extending HasAttributeFilter. > > This is the list of things we need to be able to tell the filter: > > Attribute > Value > Attrib value case sensitive? > Locale (but not for regex as use Pattern.UNICODE_CASE) > Regex on/off > Regex type (MATCH, LOOKINGAT, FIND) > > You could certainly combine the regex on/off with regex type in the > constructor, e.g.: > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, > sensitive here>) > HasAttributeFilter(String attribute, String value, > sensitive here>, int regexType) > > Where regexType = 0 (OFF), 1 (MATCH), etc. Then by default for the > others it would be 0 (OFF). I don't think that's too confusing. > > The issue is how to fold in the Locale into the constructor, as it's > not used by the regex (the regex either uses US-ASCII or Unicode, > depending if Pattern.UNICODE_CASE is set). Also regexes can include > locale-specific sections in them (as well as, of course, the usual > case-insensitive stuff). So I think we want to mutually exclude > passing values for locale and regex: > > The non regex constructors: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > > or alternatively: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, String caseInsensitiveLocale) > [for the last one therefore we turn on case-insensitivity] > > and for the regex-accepting constructors either: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, int regexType) > > or alternatively: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > I personally favour choice number 1 for the non-regex constructors and > number 2 for the regex constructors, so I think it the list of > constructors should look like: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) > > I'd love some feedback please :) > > --------------------------------------------------------------------------------- > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > if Sourceforge CVS will stop being broken I might try and add it. In > the case of less than 2 filters being added, I'm in favour of throwing > an IllegalArgumentException - does that sound reasonable? > > --------------------------------------------------------------------------------- > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > more than two filters. According to the ever reliable (ahem!) > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries > "is true iff an odd number of the variables are true". So "true true > false" is false, and "false true false" is true. > > Ian > > On 5/9/06, Derrick Oswald wrote: > > Ian, > > > > The conversion of case requires either an assumption of encoding or an > > explicit one. > > See for example the additional Locale property on StringFilter. > > > > The regex library requires or assumes a strategy, either MATCH, > > LOOKINGAT or FIND. > > See for example the additional int property on RegexFilter. > > > > I'm not sure how much could be gained by subclassing the existing > > HasAttributeFilter. > > > > Another strategy would be to add boolean properties for 'InText' (on by > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > > and RegexFilter. Then of course you would need to add an AttributeName > > property. The attribute name being allowed to be null is a good idea, > > and would be the default if it's just not set, no need for an extra > > boolean 'nameIsNull' property. By the way, searching the tag name would > > come for free if the attributes checking loop started at index zero. > > That would mean adding three boolean and a string property to the two > > classes. I think these are differences enough to warrant new classes. In > > fact, maybe this should be one really prickly class called a > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > the above. I don't think something can be case-insensitive and a regex > > filter though, so these aren't completely orthogonal. So maybe a 'type' > > property: > > straight string match > > case insensitive match - needs or assumes a Locale > > regex match - needs or assumes a strategy > > I leave it up to you though. Sounds like a fair piece of work. > > > > The extra constructors on the AndFilter and OrFilter are also good ideas. > > > > The XorFilter seems like a good thing to round out the logical operations. > > Would it also take an array of filters and only return true if just one > > is matched? > > > > The FilterBuilder would need to be altered to handle these changes of > > course, assuming this was a goal. > > This would be easier if there were just new SearchFilter and XorFilter > > classes rather than changes to the existing HasAttributeFilter, > > StringFilter, and RegexFilter (because new classes could be ignored, > > like the CssSelectorFilter is currently being). > > > > Derrick > > > > Ian Macfarlane wrote: > > > > > I would also like to be able to set the attribute as null but the > > > attribute value as not-null. In this case, it should attempt to match > > > all attributes against the attribute value. > > > > > > Please email me if you have any objections to this (or anything else). > > > > > > Thanks > > > > > > Ian Macfarlane > > > > > > On 5/8/06, Ian Macfarlane wrote: > > > > > >> I would like to add the following functionality to HasAttributeFilter: > > >> > > >> 1) A boolean flag to set if the matching should be case-insensitive. I > > >> think this could be done with a boolean, one new constructor (String > > >> attribute, String value, boolean attribValue) and get/set method pair. > > >> > > >> 2) A flag to mark that the attribValue should be parsed as a regular > > >> expression (I don't really see the benefit of doing this with the tag > > >> name). This should also obey the case-sensitivity rule in (1). For > > >> this, I imagine a further constructor and get/set method pair. (a > > >> sample use case of this is "post\d+" to match post1, post22, > > >> post343545, etc). > > >> > > >> > > >> I'm willing to go ahead and code these, but I thought I should run > > >> this past you other developers too in case you dislike either idea. > > >> I'm also open to either: > > >> > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > > >> it seems a small enough change to be suitable as part of the class > > >> size-wise). > > >> > > >> b) changing the one/two boolean constructors to be one constructor > > >> that takes an INT flag, and add flags for the different combinations > > >> (e.g. CASE_SENSITIVE = 1, USE_REGEX = 2, so both together would be 3). > > >> This seems unnecessarily complex, and doing it the way I suggested > > >> above still allows for this in the future if desired. > > >> > > >> > > >> Thanks for your feedback, > > >> > > >> Ian Macfarlane ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Ian M. <ian...@gm...> - 2006-05-16 17:07:53
|
Derrick, Can I get a final confirmation you would be happy with the following list of constructors: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, Locale locale) HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) Thanks Ian Macfarlane On 5/16/06, Derrick Oswald <der...@ro...> wrote: > It looks good to me. > I might be tempted to pass a Locale object rather than a String in the > constructors, like the existing StringFilter constructors. > > > Ian Macfarlane <ian...@gm...> wrote: > > I've just committed the the new constructors for OrFilter and a new > class XorFilter, as these were simple useful additions and > non-controversial. > > I would still love some feedback about the (revised) proposed changes > to HasAttributeFilter (see below in the email), as I don't want to > write it only for more senior devs to afterwards decide to change it > back because they didn't look at it before. If any of it isn't clear, > please ask. > > Ian > > On 09/05/06, Ian Macfarlane wrote: > > I think the existing StringFilter and RegexFilter, as they apply to > > Text nodes only at the moment, should probably be left alone. > > Whichever class we apply this to should handle tags only. One set for > > text, one for tags (although I also think that the two text-searching > > ones should possible be made one class). > > > > Now I look through the existing filters, it strikes me that we might > > already have some of this in place in the form of LinkStringFilter / > > LinkRegexFilter. These basically do the scanning based on a tag and > > attribute, but restricted to LinkTag tags. I think combining these > > with HasAttributeFilter pretty much gives us what we want (indeed we > > could in theory deprecate the two Link*Filter classes in favour of > > this - not sure if we'd want to do that or not). > > > > Also, you can indeed have both regex and case insensitive - it's built > > into Java's Pattern, and is actually used by LinkRegexFilter [sample > > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > > Pattern.UNICODE_CASE)]. > > > > So after a fair bit of consideration and changing my mind several > > times, I've settled back on extending HasAttributeFilter. > > > > This is the list of things we need to be able to tell the filter: > > > > Attribute > > Value > > Attrib value case sensitive? > > Locale (but not for regex as use Pattern.UNICODE_CASE) > > Regex on/off > > Regex type (MATCH, LOOKINGAT, FIND) > > > > You could certainly combine the regex on/off with regex type in the > > constructor, e.g.: > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, > > sensitive here>) > > HasAttributeFilter(String attribute, String value, > > sensitive here>, int regexType) > > > > Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the > > others it would be 0 (OFF). I don't think that's too confusing. > > > > The issue is how to fold in the Locale into the constructor, as it's > > not used by the regex (the regex either uses US-ASCII or Unicode, > > depending if Pattern.UNICODE_CASE is set). Also regexes can include > > locale-specific sections in them (as well as, of course, the usual > > case-insensitive stuff). So I think we want to mutually exclude > > passing values for locale and regex: > > > > The non regex constructors: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > > > or alternatively: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, String > caseInsensitiveLocale) > > [for the last one therefore we turn on case-insensitivity] > > > > and for the regex-accepting constructors either: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, int regexType) > > > > or alternatively: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > > > > I personally favour choice number 1 for the non-regex constructors and > > number 2 for the regex constructors, so I think it the list of > > constructors should look like: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) > > > > I'd love some feedback please :) > > > > > -------------------------------------------------------------------------= -------- > > > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > > if Sourceforge CVS will stop being broken I might try and add it. In > > the case of less than 2 filters being added, I'm in favour of throwing > > an IllegalArgumentException - does that sound reasonable? > > > > > -------------------------------------------------------------------------= -------- > > > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > > more than two filters. According to the ever reliable (ahem!) > > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over > multiple entries > > "is true iff an odd number of the variables are true". So "true true > > false" is false, and "false true false" is true. > > > > Ian > > > > On 5/9/06, Derrick Oswald wrote: > > > Ian, > > > > > > The conversion of case requires either an assumption of encoding or a= n > > > explicit one. > > > See for example the additional Locale property on StringFilter. > > > > > > The regex library requires or assumes a strategy, either MATCH, > > > LOOKINGAT or FIND. > > > See for example the additional int property on RegexFilter. > > > > > > I'm not sure how much could be gained by subclassing the existing > > > HasAttributeFilter. > > > > > > Another strategy would be to add boolean properties for 'InText' (on = by > > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilt= er > > > and RegexFilter. Then of course you would need to add an AttributeNam= e > > > property. The attribute name being allowed to be null is a good idea, > > > and would be the default if it's just not set, no need for an extra > > > boolean 'nameIsNull' property. By the way, searching the tag name wou= ld > > > come for free if the attributes checking loop started at index zero. > > > That would mean adding three boolean and a string property to the two > > > classes. I think these are differences enough to warrant new classes.= In > > > fact, maybe this should be one really prickly class called a > > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > > the above. I don't think something can be case-insensitive and a rege= x > > > filter though, so these aren't completely orthogonal. So maybe a 'typ= e' > > > property: > > > straight string match > > > case insensitive match - needs or assumes a Locale > > > regex match - needs or assumes a strategy > > > I leave it up to you though. Sounds like a fair piece of work. > > > > > > The extra constructors on the AndFilter and OrFilter are also good > ideas. > > > > > > The XorFilter seems like a good thing to round out the logical > operations. > > > Would it also take an array of filters and only return true if just o= ne > > > is matched? > > > > > > The FilterBuilder would need to be altered to handle these changes of > > > course, assuming this was a goal. > > > This would be easier if there were just new SearchFilter and XorFilte= r > > > classes rather than changes to the existing HasAttributeFilter, > > > StringFilter, and RegexFilter (because new classes could be ignored, > > > like the CssSelectorFilter is currently being). > > > > > > Derrick > > > > > > Ian Macfarlane wrote: > > > > > > > I would also like to be able to set the attribute as null but the > > > > attribute value as not-null. In this case, it should attempt to mat= ch > > > > all attributes against the attribute value. > > > > > > > > Please email me if you have any objections to this (or anything els= e). > > > > > > > > Thanks > > > > > > > > Ian Macfarlane > > > > > > > > On 5/8/06, Ian Macfarlane wrote: > > > > > > > >> I would like to add the following functionality to > HasAttributeFilter: > > > >> > > > >> 1) A boolean flag to set if the matching should be case-insensitiv= e. > I > > > >> think this could be done with a boolean, one new constructor (Stri= ng > > > >> attribute, String value, boolean attribValue) and get/set method > pair. > > > >> > > > >> 2) A flag to mark that the attribValue should be parsed as a regul= ar > > > >> expression (I don't really see the benefit of doing this with the = tag > > > >> name). This should also obey the case-sensitivity rule in (1). For > > > >> this, I imagine a further constructor and get/set method pair. (a > > > >> sample use case of this is "post\d+" to match post1, post22, > > > >> post343545, etc). > > > >> > > > >> > > > >> I'm willing to go ahead and code these, but I thought I should run > > > >> this past you other developers too in case you dislike either idea= . > > > >> I'm also open to either: > > > >> > > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (b= ut > > > >> it seems a small enough change to be suitable as part of the class > > > >> size-wise). > > > >> > > > >> b) changing the one/two boolean constructors to be one constructor > > > >> that takes an INT flag, and add flags for the different combinatio= ns > > > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together woul= d be > 3). > > > >> This seems unnecessarily complex, and doing it the way I suggested > > > >> above still allows for this in the future if desired. > > > >> > > > >> > > > >> Thanks for your feedback, > > > >> > > > >> Ian Macfarlane > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim= o > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=120709&bid&3057&dat=121642 > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: Derrick O. <der...@ro...> - 2006-05-16 23:33:41
|
Looks fine. Ian Macfarlane <ian...@gm...> wrote: Derrick, Can I get a final confirmation you would be happy with the following list of constructors: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, Locale locale) HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) Thanks Ian Macfarlane On 5/16/06, Derrick Oswald wrote: > It looks good to me. > I might be tempted to pass a Locale object rather than a String in the > constructors, like the existing StringFilter constructors. > > > Ian Macfarlane wrote: > > I've just committed the the new constructors for OrFilter and a new > class XorFilter, as these were simple useful additions and > non-controversial. > > I would still love some feedback about the (revised) proposed changes > to HasAttributeFilter (see below in the email), as I don't want to > write it only for more senior devs to afterwards decide to change it > back because they didn't look at it before. If any of it isn't clear, > please ask. > > Ian > > On 09/05/06, Ian Macfarlane wrote: > > I think the existing StringFilter and RegexFilter, as they apply to > > Text nodes only at the moment, should probably be left alone. > > Whichever class we apply this to should handle tags only. One set for > > text, one for tags (although I also think that the two text-searching > > ones should possible be made one class). > > > > Now I look through the existing filters, it strikes me that we might > > already have some of this in place in the form of LinkStringFilter / > > LinkRegexFilter. These basically do the scanning based on a tag and > > attribute, but restricted to LinkTag tags. I think combining these > > with HasAttributeFilter pretty much gives us what we want (indeed we > > could in theory deprecate the two Link*Filter classes in favour of > > this - not sure if we'd want to do that or not). > > > > Also, you can indeed have both regex and case insensitive - it's built > > into Java's Pattern, and is actually used by LinkRegexFilter [sample > > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > > Pattern.UNICODE_CASE)]. > > > > So after a fair bit of consideration and changing my mind several > > times, I've settled back on extending HasAttributeFilter. > > > > This is the list of things we need to be able to tell the filter: > > > > Attribute > > Value > > Attrib value case sensitive? > > Locale (but not for regex as use Pattern.UNICODE_CASE) > > Regex on/off > > Regex type (MATCH, LOOKINGAT, FIND) > > > > You could certainly combine the regex on/off with regex type in the > > constructor, e.g.: > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, > > sensitive here>) > > HasAttributeFilter(String attribute, String value, > > sensitive here>, int regexType) > > > > Where regexType = 0 (OFF), 1 (MATCH), etc. Then by default for the > > others it would be 0 (OFF). I don't think that's too confusing. > > > > The issue is how to fold in the Locale into the constructor, as it's > > not used by the regex (the regex either uses US-ASCII or Unicode, > > depending if Pattern.UNICODE_CASE is set). Also regexes can include > > locale-specific sections in them (as well as, of course, the usual > > case-insensitive stuff). So I think we want to mutually exclude > > passing values for locale and regex: > > > > The non regex constructors: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > > > or alternatively: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, String > caseInsensitiveLocale) > > [for the last one therefore we turn on case-insensitivity] > > > > and for the regex-accepting constructors either: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, int regexType) > > > > or alternatively: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > > > > I personally favour choice number 1 for the non-regex constructors and > > number 2 for the regex constructors, so I think it the list of > > constructors should look like: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) > > > > I'd love some feedback please :) > > > > > --------------------------------------------------------------------------------- > > > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > > if Sourceforge CVS will stop being broken I might try and add it. In > > the case of less than 2 filters being added, I'm in favour of throwing > > an IllegalArgumentException - does that sound reasonable? > > > > > --------------------------------------------------------------------------------- > > > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > > more than two filters. According to the ever reliable (ahem!) > > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over > multiple entries > > "is true iff an odd number of the variables are true". So "true true > > false" is false, and "false true false" is true. > > > > Ian > > > > On 5/9/06, Derrick Oswald wrote: > > > Ian, > > > > > > The conversion of case requires either an assumption of encoding or an > > > explicit one. > > > See for example the additional Locale property on StringFilter. > > > > > > The regex library requires or assumes a strategy, either MATCH, > > > LOOKINGAT or FIND. > > > See for example the additional int property on RegexFilter. > > > > > > I'm not sure how much could be gained by subclassing the existing > > > HasAttributeFilter. > > > > > > Another strategy would be to add boolean properties for 'InText' (on by > > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > > > and RegexFilter. Then of course you would need to add an AttributeName > > > property. The attribute name being allowed to be null is a good idea, > > > and would be the default if it's just not set, no need for an extra > > > boolean 'nameIsNull' property. By the way, searching the tag name would > > > come for free if the attributes checking loop started at index zero. > > > That would mean adding three boolean and a string property to the two > > > classes. I think these are differences enough to warrant new classes. In > > > fact, maybe this should be one really prickly class called a > > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > > the above. I don't think something can be case-insensitive and a regex > > > filter though, so these aren't completely orthogonal. So maybe a 'type' > > > property: > > > straight string match > > > case insensitive match - needs or assumes a Locale > > > regex match - needs or assumes a strategy > > > I leave it up to you though. Sounds like a fair piece of work. > > > > > > The extra constructors on the AndFilter and OrFilter are also good > ideas. > > > > > > The XorFilter seems like a good thing to round out the logical > operations. > > > Would it also take an array of filters and only return true if just one > > > is matched? > > > > > > The FilterBuilder would need to be altered to handle these changes of > > > course, assuming this was a goal. > > > This would be easier if there were just new SearchFilter and XorFilter > > > classes rather than changes to the existing HasAttributeFilter, > > > StringFilter, and RegexFilter (because new classes could be ignored, > > > like the CssSelectorFilter is currently being). > > > > > > Derrick > > > > > > Ian Macfarlane wrote: > > > > > > > I would also like to be able to set the attribute as null but the > > > > attribute value as not-null. In this case, it should attempt to match > > > > all attributes against the attribute value. > > > > > > > > Please email me if you have any objections to this (or anything else). > > > > > > > > Thanks > > > > > > > > Ian Macfarlane > > > > > > > > On 5/8/06, Ian Macfarlane wrote: > > > > > > > >> I would like to add the following functionality to > HasAttributeFilter: > > > >> > > > >> 1) A boolean flag to set if the matching should be case-insensitive. > I > > > >> think this could be done with a boolean, one new constructor (String > > > >> attribute, String value, boolean attribValue) and get/set method > pair. > > > >> > > > >> 2) A flag to mark that the attribValue should be parsed as a regular > > > >> expression (I don't really see the benefit of doing this with the tag > > > >> name). This should also obey the case-sensitivity rule in (1). For > > > >> this, I imagine a further constructor and get/set method pair. (a > > > >> sample use case of this is "post\d+" to match post1, post22, > > > >> post343545, etc). > > > >> > > > >> > > > >> I'm willing to go ahead and code these, but I thought I should run > > > >> this past you other developers too in case you dislike either idea. > > > >> I'm also open to either: > > > >> > > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > > > >> it seems a small enough change to be suitable as part of the class > > > >> size-wise). > > > >> > > > >> b) changing the one/two boolean constructors to be one constructor > > > >> that takes an INT flag, and add flags for the different combinations > > > >> (e.g. CASE_SENSITIVE = 1, USE_REGEX = 2, so both together would be > 3). > > > >> This seems unnecessarily complex, and doing it the way I suggested > > > >> above still allows for this in the future if desired. > > > >> > > > >> > > > >> Thanks for your feedback, > > > >> > > > >> Ian Macfarlane > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |