Re: [Htmlparser-developer] SearchFilter
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2006-05-16 13:40:41
|
Oops! I missed mentioning that I added the new constructor AndFilter as well as OrFilter. Ian On 16/05/06, Ian Macfarlane <ian...@gm...> wrote: > I've just committed the the new constructors for OrFilter and a new > class XorFilter, as these were simple useful additions and > non-controversial. > > I would still love some feedback about the (revised) proposed changes > to HasAttributeFilter (see below in the email), as I don't want to > write it only for more senior devs to afterwards decide to change it > back because they didn't look at it before. If any of it isn't clear, > please ask. > > Ian > > On 09/05/06, Ian Macfarlane <ian...@gm...> wrote: > > I think the existing StringFilter and RegexFilter, as they apply to > > Text nodes only at the moment, should probably be left alone. > > Whichever class we apply this to should handle tags only. One set for > > text, one for tags (although I also think that the two text-searching > > ones should possible be made one class). > > > > Now I look through the existing filters, it strikes me that we might > > already have some of this in place in the form of LinkStringFilter / > > LinkRegexFilter. These basically do the scanning based on a tag and > > attribute, but restricted to LinkTag tags. I think combining these > > with HasAttributeFilter pretty much gives us what we want (indeed we > > could in theory deprecate the two Link*Filter classes in favour of > > this - not sure if we'd want to do that or not). > > > > Also, you can indeed have both regex and case insensitive - it's built > > into Java's Pattern, and is actually used by LinkRegexFilter [sample > > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > > Pattern.UNICODE_CASE)]. > > > > So after a fair bit of consideration and changing my mind several > > times, I've settled back on extending HasAttributeFilter. > > > > This is the list of things we need to be able to tell the filter: > > > > Attribute > > Value > > Attrib value case sensitive? > > Locale (but not for regex as use Pattern.UNICODE_CASE) > > Regex on/off > > Regex type (MATCH, LOOKINGAT, FIND) > > > > You could certainly combine the regex on/off with regex type in the > > constructor, e.g.: > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, <stuff for case > > sensitive here>) > > HasAttributeFilter(String attribute, String value, <stuff for case > > sensitive here>, int regexType) > > > > Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the > > others it would be 0 (OFF). I don't think that's too confusing. > > > > The issue is how to fold in the Locale into the constructor, as it's > > not used by the regex (the regex either uses US-ASCII or Unicode, > > depending if Pattern.UNICODE_CASE is set). Also regexes can include > > locale-specific sections in them (as well as, of course, the usual > > case-insensitive stuff). So I think we want to mutually exclude > > passing values for locale and regex: > > > > The non regex constructors: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > > > or alternatively: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, String caseInsensiti= veLocale) > > [for the last one therefore we turn on case-insensitivity] > > > > and for the regex-accepting constructors either: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, int regexType) > > > > or alternatively: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > > > > I personally favour choice number 1 for the non-regex constructors and > > number 2 for the regex constructors, so I think it the list of > > constructors should look like: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) > > > > I'd love some feedback please :) > > > > -----------------------------------------------------------------------= ---------- > > > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > > if Sourceforge CVS will stop being broken I might try and add it. In > > the case of less than 2 filters being added, I'm in favour of throwing > > an IllegalArgumentException - does that sound reasonable? > > > > -----------------------------------------------------------------------= ---------- > > > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > > more than two filters. According to the ever reliable (ahem!) > > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries > > "is true iff an odd number of the variables are true". So "true true > > false" is false, and "false true false" is true. > > > > Ian > > > > On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > > > Ian, > > > > > > The conversion of case requires either an assumption of encoding or a= n > > > explicit one. > > > See for example the additional Locale property on StringFilter. > > > > > > The regex library requires or assumes a strategy, either MATCH, > > > LOOKINGAT or FIND. > > > See for example the additional int property on RegexFilter. > > > > > > I'm not sure how much could be gained by subclassing the existing > > > HasAttributeFilter. > > > > > > Another strategy would be to add boolean properties for 'InText' (on = by > > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilt= er > > > and RegexFilter. Then of course you would need to add an AttributeNa= me > > > property. The attribute name being allowed to be null is a good idea, > > > and would be the default if it's just not set, no need for an extra > > > boolean 'nameIsNull' property. By the way, searching the tag name wo= uld > > > come for free if the attributes checking loop started at index zero. > > > That would mean adding three boolean and a string property to the two > > > classes. I think these are differences enough to warrant new classes.= In > > > fact, maybe this should be one really prickly class called a > > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > > the above. I don't think something can be case-insensitive and a rege= x > > > filter though, so these aren't completely orthogonal. So maybe a 'ty= pe' > > > property: > > > straight string match > > > case insensitive match - needs or assumes a Locale > > > regex match - needs or assumes a strategy > > > I leave it up to you though. Sounds like a fair piece of work. > > > > > > The extra constructors on the AndFilter and OrFilter are also good id= eas. > > > > > > The XorFilter seems like a good thing to round out the logical operat= ions. > > > Would it also take an array of filters and only return true if just o= ne > > > is matched? > > > > > > The FilterBuilder would need to be altered to handle these changes of > > > course, assuming this was a goal. > > > This would be easier if there were just new SearchFilter and XorFilte= r > > > classes rather than changes to the existing HasAttributeFilter, > > > StringFilter, and RegexFilter (because new classes could be ignored, > > > like the CssSelectorFilter is currently being). > > > > > > Derrick > > > > > > Ian Macfarlane wrote: > > > > > > > I would also like to be able to set the attribute as null but the > > > > attribute value as not-null. In this case, it should attempt to mat= ch > > > > all attributes against the attribute value. > > > > > > > > Please email me if you have any objections to this (or anything els= e). > > > > > > > > Thanks > > > > > > > > Ian Macfarlane > > > > > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > > > > > >> I would like to add the following functionality to HasAttributeFil= ter: > > > >> > > > >> 1) A boolean flag to set if the matching should be case-insensitiv= e. I > > > >> think this could be done with a boolean, one new constructor (Stri= ng > > > >> attribute, String value, boolean attribValue) and get/set method p= air. > > > >> > > > >> 2) A flag to mark that the attribValue should be parsed as a regul= ar > > > >> expression (I don't really see the benefit of doing this with the = tag > > > >> name). This should also obey the case-sensitivity rule in (1). For > > > >> this, I imagine a further constructor and get/set method pair. (a > > > >> sample use case of this is "post\d+" to match post1, post22, > > > >> post343545, etc). > > > >> > > > >> > > > >> I'm willing to go ahead and code these, but I thought I should run > > > >> this past you other developers too in case you dislike either idea= . > > > >> I'm also open to either: > > > >> > > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (b= ut > > > >> it seems a small enough change to be suitable as part of the class > > > >> size-wise). > > > >> > > > >> b) changing the one/two boolean constructors to be one constructor > > > >> that takes an INT flag, and add flags for the different combinatio= ns > > > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together woul= d be 3). > > > >> This seems unnecessarily complex, and doing it the way I suggested > > > >> above still allows for this in the future if desired. > > > >> > > > >> > > > >> Thanks for your feedback, > > > >> > > > >> Ian Macfarlane > |