Re: [Htmlparser-developer] SearchFilter
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2006-05-09 11:56:22
|
I think the existing StringFilter and RegexFilter, as they apply to Text nodes only at the moment, should probably be left alone. Whichever class we apply this to should handle tags only. One set for text, one for tags (although I also think that the two text-searching ones should possible be made one class). Now I look through the existing filters, it strikes me that we might already have some of this in place in the form of LinkStringFilter / LinkRegexFilter. These basically do the scanning based on a tag and attribute, but restricted to LinkTag tags. I think combining these with HasAttributeFilter pretty much gives us what we want (indeed we could in theory deprecate the two Link*Filter classes in favour of this - not sure if we'd want to do that or not). Also, you can indeed have both regex and case insensitive - it's built into Java's Pattern, and is actually used by LinkRegexFilter [sample Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE)]. So after a fair bit of consideration and changing my mind several times, I've settled back on extending HasAttributeFilter. This is the list of things we need to be able to tell the filter: Attribute Value Attrib value case sensitive? Locale (but not for regex as use Pattern.UNICODE_CASE) Regex on/off Regex type (MATCH, LOOKINGAT, FIND) You could certainly combine the regex on/off with regex type in the constructor, e.g.: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, <stuff for case sensitive here>) HasAttributeFilter(String attribute, String value, <stuff for case sensitive here>, int regexType) Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the others it would be 0 (OFF). I don't think that's too confusing. The issue is how to fold in the Locale into the constructor, as it's not used by the regex (the regex either uses US-ASCII or Unicode, depending if Pattern.UNICODE_CASE is set). Also regexes can include locale-specific sections in them (as well as, of course, the usual case-insensitive stuff). So I think we want to mutually exclude passing values for locale and regex: The non regex constructors: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, String locale) or alternatively: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, String caseInsensitiveLo= cale) [for the last one therefore we turn on case-insensitivity] and for the regex-accepting constructors either: HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, boolean caseSensitive, int regexType) or alternatively: HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). I personally favour choice number 1 for the non-regex constructors and number 2 for the regex constructors, so I think it the list of constructors should look like: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, String locale) HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) I'd love some feedback please :) ---------------------------------------------------------------------------= ------ AndFilter/OrFilter taking arrays - this seems like you'd like it, so if Sourceforge CVS will stop being broken I might try and add it. In the case of less than 2 filters being added, I'm in favour of throwing an IllegalArgumentException - does that sound reasonable? ---------------------------------------------------------------------------= ------ XorFilter - I'm not 100% sure how the XOR logic should work if it has more than two filters. According to the ever reliable (ahem!) Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries "is true iff an odd number of the variables are true". So "true true false" is false, and "false true false" is true. Ian On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > Ian, > > The conversion of case requires either an assumption of encoding or an > explicit one. > See for example the additional Locale property on StringFilter. > > The regex library requires or assumes a strategy, either MATCH, > LOOKINGAT or FIND. > See for example the additional int property on RegexFilter. > > I'm not sure how much could be gained by subclassing the existing > HasAttributeFilter. > > Another strategy would be to add boolean properties for 'InText' (on by > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > and RegexFilter. Then of course you would need to add an AttributeName > property. The attribute name being allowed to be null is a good idea, > and would be the default if it's just not set, no need for an extra > boolean 'nameIsNull' property. By the way, searching the tag name would > come for free if the attributes checking loop started at index zero. > That would mean adding three boolean and a string property to the two > classes. I think these are differences enough to warrant new classes. In > fact, maybe this should be one really prickly class called a > SearchFilter that combines what StringFilter and RegexFilter do, plus > the above. I don't think something can be case-insensitive and a regex > filter though, so these aren't completely orthogonal. So maybe a 'type' > property: > straight string match > case insensitive match - needs or assumes a Locale > regex match - needs or assumes a strategy > I leave it up to you though. Sounds like a fair piece of work. > > The extra constructors on the AndFilter and OrFilter are also good ideas. > > The XorFilter seems like a good thing to round out the logical operations= . > Would it also take an array of filters and only return true if just one > is matched? > > The FilterBuilder would need to be altered to handle these changes of > course, assuming this was a goal. > This would be easier if there were just new SearchFilter and XorFilter > classes rather than changes to the existing HasAttributeFilter, > StringFilter, and RegexFilter (because new classes could be ignored, > like the CssSelectorFilter is currently being). > > Derrick > > Ian Macfarlane wrote: > > > I would also like to be able to set the attribute as null but the > > attribute value as not-null. In this case, it should attempt to match > > all attributes against the attribute value. > > > > Please email me if you have any objections to this (or anything else). > > > > Thanks > > > > Ian Macfarlane > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > >> I would like to add the following functionality to HasAttributeFilter: > >> > >> 1) A boolean flag to set if the matching should be case-insensitive. I > >> think this could be done with a boolean, one new constructor (String > >> attribute, String value, boolean attribValue) and get/set method pair. > >> > >> 2) A flag to mark that the attribValue should be parsed as a regular > >> expression (I don't really see the benefit of doing this with the tag > >> name). This should also obey the case-sensitivity rule in (1). For > >> this, I imagine a further constructor and get/set method pair. (a > >> sample use case of this is "post\d+" to match post1, post22, > >> post343545, etc). > >> > >> > >> I'm willing to go ahead and code these, but I thought I should run > >> this past you other developers too in case you dislike either idea. > >> I'm also open to either: > >> > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > >> it seems a small enough change to be suitable as part of the class > >> size-wise). > >> > >> b) changing the one/two boolean constructors to be one constructor > >> that takes an INT flag, and add flags for the different combinations > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would be= 3). > >> This seems unnecessarily complex, and doing it the way I suggested > >> above still allows for this in the future if desired. > >> > >> > >> Thanks for your feedback, > >> > >> Ian Macfarlane > >> > > > > > > ------------------------------------------------------- > > Using Tomcat but need to do more? Need to support web services, securit= y? > > Get stuff done quickly with pre-integrated technology to make your job > > easier > > Download IBM WebSphere Application Server v.1.0.1 based on Apache > > Geronimo > > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=120709&bid&3057&dat=121642 > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job ea= sier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim= o > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat= =3D121642 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |