Re: [Htmlparser-developer] SearchFilter
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2006-05-16 13:34:20
|
I've just committed the the new constructors for OrFilter and a new class XorFilter, as these were simple useful additions and non-controversial. I would still love some feedback about the (revised) proposed changes to HasAttributeFilter (see below in the email), as I don't want to write it only for more senior devs to afterwards decide to change it back because they didn't look at it before. If any of it isn't clear, please ask. Ian On 09/05/06, Ian Macfarlane <ian...@gm...> wrote: > I think the existing StringFilter and RegexFilter, as they apply to > Text nodes only at the moment, should probably be left alone. > Whichever class we apply this to should handle tags only. One set for > text, one for tags (although I also think that the two text-searching > ones should possible be made one class). > > Now I look through the existing filters, it strikes me that we might > already have some of this in place in the form of LinkStringFilter / > LinkRegexFilter. These basically do the scanning based on a tag and > attribute, but restricted to LinkTag tags. I think combining these > with HasAttributeFilter pretty much gives us what we want (indeed we > could in theory deprecate the two Link*Filter classes in favour of > this - not sure if we'd want to do that or not). > > Also, you can indeed have both regex and case insensitive - it's built > into Java's Pattern, and is actually used by LinkRegexFilter [sample > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > Pattern.UNICODE_CASE)]. > > So after a fair bit of consideration and changing my mind several > times, I've settled back on extending HasAttributeFilter. > > This is the list of things we need to be able to tell the filter: > > Attribute > Value > Attrib value case sensitive? > Locale (but not for regex as use Pattern.UNICODE_CASE) > Regex on/off > Regex type (MATCH, LOOKINGAT, FIND) > > You could certainly combine the regex on/off with regex type in the > constructor, e.g.: > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, <stuff for case > sensitive here>) > HasAttributeFilter(String attribute, String value, <stuff for case > sensitive here>, int regexType) > > Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the > others it would be 0 (OFF). I don't think that's too confusing. > > The issue is how to fold in the Locale into the constructor, as it's > not used by the regex (the regex either uses US-ASCII or Unicode, > depending if Pattern.UNICODE_CASE is set). Also regexes can include > locale-specific sections in them (as well as, of course, the usual > case-insensitive stuff). So I think we want to mutually exclude > passing values for locale and regex: > > The non regex constructors: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > > or alternatively: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, String caseInsensitive= Locale) > [for the last one therefore we turn on case-insensitivity] > > and for the regex-accepting constructors either: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, int regexType) > > or alternatively: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > I personally favour choice number 1 for the non-regex constructors and > number 2 for the regex constructors, so I think it the list of > constructors should look like: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) > > I'd love some feedback please :) > > -------------------------------------------------------------------------= -------- > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > if Sourceforge CVS will stop being broken I might try and add it. In > the case of less than 2 filters being added, I'm in favour of throwing > an IllegalArgumentException - does that sound reasonable? > > -------------------------------------------------------------------------= -------- > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > more than two filters. According to the ever reliable (ahem!) > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries > "is true iff an odd number of the variables are true". So "true true > false" is false, and "false true false" is true. > > Ian > > On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > > Ian, > > > > The conversion of case requires either an assumption of encoding or an > > explicit one. > > See for example the additional Locale property on StringFilter. > > > > The regex library requires or assumes a strategy, either MATCH, > > LOOKINGAT or FIND. > > See for example the additional int property on RegexFilter. > > > > I'm not sure how much could be gained by subclassing the existing > > HasAttributeFilter. > > > > Another strategy would be to add boolean properties for 'InText' (on by > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > > and RegexFilter. Then of course you would need to add an AttributeName > > property. The attribute name being allowed to be null is a good idea, > > and would be the default if it's just not set, no need for an extra > > boolean 'nameIsNull' property. By the way, searching the tag name woul= d > > come for free if the attributes checking loop started at index zero. > > That would mean adding three boolean and a string property to the two > > classes. I think these are differences enough to warrant new classes. I= n > > fact, maybe this should be one really prickly class called a > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > the above. I don't think something can be case-insensitive and a regex > > filter though, so these aren't completely orthogonal. So maybe a 'type= ' > > property: > > straight string match > > case insensitive match - needs or assumes a Locale > > regex match - needs or assumes a strategy > > I leave it up to you though. Sounds like a fair piece of work. > > > > The extra constructors on the AndFilter and OrFilter are also good idea= s. > > > > The XorFilter seems like a good thing to round out the logical operatio= ns. > > Would it also take an array of filters and only return true if just one > > is matched? > > > > The FilterBuilder would need to be altered to handle these changes of > > course, assuming this was a goal. > > This would be easier if there were just new SearchFilter and XorFilter > > classes rather than changes to the existing HasAttributeFilter, > > StringFilter, and RegexFilter (because new classes could be ignored, > > like the CssSelectorFilter is currently being). > > > > Derrick > > > > Ian Macfarlane wrote: > > > > > I would also like to be able to set the attribute as null but the > > > attribute value as not-null. In this case, it should attempt to match > > > all attributes against the attribute value. > > > > > > Please email me if you have any objections to this (or anything else)= . > > > > > > Thanks > > > > > > Ian Macfarlane > > > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > > > >> I would like to add the following functionality to HasAttributeFilte= r: > > >> > > >> 1) A boolean flag to set if the matching should be case-insensitive.= I > > >> think this could be done with a boolean, one new constructor (String > > >> attribute, String value, boolean attribValue) and get/set method pai= r. > > >> > > >> 2) A flag to mark that the attribValue should be parsed as a regular > > >> expression (I don't really see the benefit of doing this with the ta= g > > >> name). This should also obey the case-sensitivity rule in (1). For > > >> this, I imagine a further constructor and get/set method pair. (a > > >> sample use case of this is "post\d+" to match post1, post22, > > >> post343545, etc). > > >> > > >> > > >> I'm willing to go ahead and code these, but I thought I should run > > >> this past you other developers too in case you dislike either idea. > > >> I'm also open to either: > > >> > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > > >> it seems a small enough change to be suitable as part of the class > > >> size-wise). > > >> > > >> b) changing the one/two boolean constructors to be one constructor > > >> that takes an INT flag, and add flags for the different combinations > > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would = be 3). > > >> This seems unnecessarily complex, and doing it the way I suggested > > >> above still allows for this in the future if desired. > > >> > > >> > > >> Thanks for your feedback, > > >> > > >> Ian Macfarlane |