htmlparser-developer Mailing List for HTML Parser (Page 3)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Ian M. <ian...@gm...> - 2006-09-20 13:09:04
|
Derrick, I have reservations regarding relicensing - the LGPL is convertible to the GPL for GPL-only projects, which together make up a very large number of OSS projects. The CPL, however, is incompatible with GPL version 2 (though it looks like it may be compatible with GPL version 3). In your email you appear to be saying that you will be removing the current LGPL license from the project in favour of the CPL. However, having visited the HTMLParser website, it states that the project is under both the CPL and LGPL. I would be happy with adding the CPL license to the project but leaving the LGPL as an option, but I would be quite unhappy if the LGPL license were to be removed. Please can you confirm the situation regarding this? (everything else sounds great by the way) Kind regards, Ian Macfarlane On 9/17/06, Derrick Oswald <der...@ro...> wrote: > > > > > The very popular HTML Parser project > (http://sourceforge.net/projects/htmlparser) on Sourceforge > has been updated with a new license, new build environment, new repository > and a new web site. To identify this radical change, the version has been > revved to 2.0. > > > > In response to requests from the Apache community, the htmlparser license > has changed from GNU Library or Lesser General Public License, to the more > Apache friendly Common Public License 1.0 > (http://opensource.org/licenses/cpl1.0.txt). > > > > As most projects are doing, the htmlparser repository has been changed from > CVS to Subversion (http://subversion.tigris.org/). > > > > To support automatic integration in other projects, the build environment > has changed from ant to Maven 2 (http://maven.apache.org/). This has > provided an opportunity to update the web site (http://htmlparser.org). > Project SNAPSHOTS and releases should be available soon, bear with us as we > work out the kinks. > > > > HTML Parser is a Java library used to parse HTML in either a linear or > nested fashion. > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > Htmlparser-announce mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-announce > > > |
From: Derrick O. <der...@ro...> - 2006-09-17 20:48:20
|
The very popular HTML Parser project (http://sourceforge.net/projects/htmlparser) on Sourceforge has been updated with a new license, new build environment, new repository and a new web site. To identify this radical change, the version has been revved to 2.0. In response to requests from the Apache community, the htmlparser license has changed from GNU Library or Lesser General Public License, to the more Apache friendly Common Public License 1.0 (http://opensource.org/licenses/cpl1.0.txt). As most projects are doing, the htmlparser repository has been changed from CVS to Subversion (http://subversion.tigris.org/). To support automatic integration in other projects, the build environment has changed from ant to Maven 2 (http://maven.apache.org/). This has provided an opportunity to update the web site (http://htmlparser.org). Project SNAPSHOTS and releases should be available soon, bear with us as we work out the kinks. HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. |
From: Derrick O. <Der...@Ro...> - 2006-05-29 11:13:05
|
Rongdong has joined HTML Parser as a develeoper. He's currently living in Shanghai, and has been been doing some freelance work with PHP/MySQL. He's currently working on a project that requires a lot of HTML parsing and has some functions he would like to add. |
From: Derrick O. <der...@ro...> - 2006-05-16 23:33:41
|
Looks fine. Ian Macfarlane <ian...@gm...> wrote: Derrick, Can I get a final confirmation you would be happy with the following list of constructors: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, Locale locale) HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) Thanks Ian Macfarlane On 5/16/06, Derrick Oswald wrote: > It looks good to me. > I might be tempted to pass a Locale object rather than a String in the > constructors, like the existing StringFilter constructors. > > > Ian Macfarlane wrote: > > I've just committed the the new constructors for OrFilter and a new > class XorFilter, as these were simple useful additions and > non-controversial. > > I would still love some feedback about the (revised) proposed changes > to HasAttributeFilter (see below in the email), as I don't want to > write it only for more senior devs to afterwards decide to change it > back because they didn't look at it before. If any of it isn't clear, > please ask. > > Ian > > On 09/05/06, Ian Macfarlane wrote: > > I think the existing StringFilter and RegexFilter, as they apply to > > Text nodes only at the moment, should probably be left alone. > > Whichever class we apply this to should handle tags only. One set for > > text, one for tags (although I also think that the two text-searching > > ones should possible be made one class). > > > > Now I look through the existing filters, it strikes me that we might > > already have some of this in place in the form of LinkStringFilter / > > LinkRegexFilter. These basically do the scanning based on a tag and > > attribute, but restricted to LinkTag tags. I think combining these > > with HasAttributeFilter pretty much gives us what we want (indeed we > > could in theory deprecate the two Link*Filter classes in favour of > > this - not sure if we'd want to do that or not). > > > > Also, you can indeed have both regex and case insensitive - it's built > > into Java's Pattern, and is actually used by LinkRegexFilter [sample > > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > > Pattern.UNICODE_CASE)]. > > > > So after a fair bit of consideration and changing my mind several > > times, I've settled back on extending HasAttributeFilter. > > > > This is the list of things we need to be able to tell the filter: > > > > Attribute > > Value > > Attrib value case sensitive? > > Locale (but not for regex as use Pattern.UNICODE_CASE) > > Regex on/off > > Regex type (MATCH, LOOKINGAT, FIND) > > > > You could certainly combine the regex on/off with regex type in the > > constructor, e.g.: > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, > > sensitive here>) > > HasAttributeFilter(String attribute, String value, > > sensitive here>, int regexType) > > > > Where regexType = 0 (OFF), 1 (MATCH), etc. Then by default for the > > others it would be 0 (OFF). I don't think that's too confusing. > > > > The issue is how to fold in the Locale into the constructor, as it's > > not used by the regex (the regex either uses US-ASCII or Unicode, > > depending if Pattern.UNICODE_CASE is set). Also regexes can include > > locale-specific sections in them (as well as, of course, the usual > > case-insensitive stuff). So I think we want to mutually exclude > > passing values for locale and regex: > > > > The non regex constructors: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > > > or alternatively: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, String > caseInsensitiveLocale) > > [for the last one therefore we turn on case-insensitivity] > > > > and for the regex-accepting constructors either: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, int regexType) > > > > or alternatively: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > > > > I personally favour choice number 1 for the non-regex constructors and > > number 2 for the regex constructors, so I think it the list of > > constructors should look like: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) > > > > I'd love some feedback please :) > > > > > --------------------------------------------------------------------------------- > > > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > > if Sourceforge CVS will stop being broken I might try and add it. In > > the case of less than 2 filters being added, I'm in favour of throwing > > an IllegalArgumentException - does that sound reasonable? > > > > > --------------------------------------------------------------------------------- > > > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > > more than two filters. According to the ever reliable (ahem!) > > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over > multiple entries > > "is true iff an odd number of the variables are true". So "true true > > false" is false, and "false true false" is true. > > > > Ian > > > > On 5/9/06, Derrick Oswald wrote: > > > Ian, > > > > > > The conversion of case requires either an assumption of encoding or an > > > explicit one. > > > See for example the additional Locale property on StringFilter. > > > > > > The regex library requires or assumes a strategy, either MATCH, > > > LOOKINGAT or FIND. > > > See for example the additional int property on RegexFilter. > > > > > > I'm not sure how much could be gained by subclassing the existing > > > HasAttributeFilter. > > > > > > Another strategy would be to add boolean properties for 'InText' (on by > > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > > > and RegexFilter. Then of course you would need to add an AttributeName > > > property. The attribute name being allowed to be null is a good idea, > > > and would be the default if it's just not set, no need for an extra > > > boolean 'nameIsNull' property. By the way, searching the tag name would > > > come for free if the attributes checking loop started at index zero. > > > That would mean adding three boolean and a string property to the two > > > classes. I think these are differences enough to warrant new classes. In > > > fact, maybe this should be one really prickly class called a > > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > > the above. I don't think something can be case-insensitive and a regex > > > filter though, so these aren't completely orthogonal. So maybe a 'type' > > > property: > > > straight string match > > > case insensitive match - needs or assumes a Locale > > > regex match - needs or assumes a strategy > > > I leave it up to you though. Sounds like a fair piece of work. > > > > > > The extra constructors on the AndFilter and OrFilter are also good > ideas. > > > > > > The XorFilter seems like a good thing to round out the logical > operations. > > > Would it also take an array of filters and only return true if just one > > > is matched? > > > > > > The FilterBuilder would need to be altered to handle these changes of > > > course, assuming this was a goal. > > > This would be easier if there were just new SearchFilter and XorFilter > > > classes rather than changes to the existing HasAttributeFilter, > > > StringFilter, and RegexFilter (because new classes could be ignored, > > > like the CssSelectorFilter is currently being). > > > > > > Derrick > > > > > > Ian Macfarlane wrote: > > > > > > > I would also like to be able to set the attribute as null but the > > > > attribute value as not-null. In this case, it should attempt to match > > > > all attributes against the attribute value. > > > > > > > > Please email me if you have any objections to this (or anything else). > > > > > > > > Thanks > > > > > > > > Ian Macfarlane > > > > > > > > On 5/8/06, Ian Macfarlane wrote: > > > > > > > >> I would like to add the following functionality to > HasAttributeFilter: > > > >> > > > >> 1) A boolean flag to set if the matching should be case-insensitive. > I > > > >> think this could be done with a boolean, one new constructor (String > > > >> attribute, String value, boolean attribValue) and get/set method > pair. > > > >> > > > >> 2) A flag to mark that the attribValue should be parsed as a regular > > > >> expression (I don't really see the benefit of doing this with the tag > > > >> name). This should also obey the case-sensitivity rule in (1). For > > > >> this, I imagine a further constructor and get/set method pair. (a > > > >> sample use case of this is "post\d+" to match post1, post22, > > > >> post343545, etc). > > > >> > > > >> > > > >> I'm willing to go ahead and code these, but I thought I should run > > > >> this past you other developers too in case you dislike either idea. > > > >> I'm also open to either: > > > >> > > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > > > >> it seems a small enough change to be suitable as part of the class > > > >> size-wise). > > > >> > > > >> b) changing the one/two boolean constructors to be one constructor > > > >> that takes an INT flag, and add flags for the different combinations > > > >> (e.g. CASE_SENSITIVE = 1, USE_REGEX = 2, so both together would be > 3). > > > >> This seems unnecessarily complex, and doing it the way I suggested > > > >> above still allows for this in the future if desired. > > > >> > > > >> > > > >> Thanks for your feedback, > > > >> > > > >> Ian Macfarlane > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Ian M. <ian...@gm...> - 2006-05-16 17:07:53
|
Derrick, Can I get a final confirmation you would be happy with the following list of constructors: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, Locale locale) HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) Thanks Ian Macfarlane On 5/16/06, Derrick Oswald <der...@ro...> wrote: > It looks good to me. > I might be tempted to pass a Locale object rather than a String in the > constructors, like the existing StringFilter constructors. > > > Ian Macfarlane <ian...@gm...> wrote: > > I've just committed the the new constructors for OrFilter and a new > class XorFilter, as these were simple useful additions and > non-controversial. > > I would still love some feedback about the (revised) proposed changes > to HasAttributeFilter (see below in the email), as I don't want to > write it only for more senior devs to afterwards decide to change it > back because they didn't look at it before. If any of it isn't clear, > please ask. > > Ian > > On 09/05/06, Ian Macfarlane wrote: > > I think the existing StringFilter and RegexFilter, as they apply to > > Text nodes only at the moment, should probably be left alone. > > Whichever class we apply this to should handle tags only. One set for > > text, one for tags (although I also think that the two text-searching > > ones should possible be made one class). > > > > Now I look through the existing filters, it strikes me that we might > > already have some of this in place in the form of LinkStringFilter / > > LinkRegexFilter. These basically do the scanning based on a tag and > > attribute, but restricted to LinkTag tags. I think combining these > > with HasAttributeFilter pretty much gives us what we want (indeed we > > could in theory deprecate the two Link*Filter classes in favour of > > this - not sure if we'd want to do that or not). > > > > Also, you can indeed have both regex and case insensitive - it's built > > into Java's Pattern, and is actually used by LinkRegexFilter [sample > > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > > Pattern.UNICODE_CASE)]. > > > > So after a fair bit of consideration and changing my mind several > > times, I've settled back on extending HasAttributeFilter. > > > > This is the list of things we need to be able to tell the filter: > > > > Attribute > > Value > > Attrib value case sensitive? > > Locale (but not for regex as use Pattern.UNICODE_CASE) > > Regex on/off > > Regex type (MATCH, LOOKINGAT, FIND) > > > > You could certainly combine the regex on/off with regex type in the > > constructor, e.g.: > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, > > sensitive here>) > > HasAttributeFilter(String attribute, String value, > > sensitive here>, int regexType) > > > > Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the > > others it would be 0 (OFF). I don't think that's too confusing. > > > > The issue is how to fold in the Locale into the constructor, as it's > > not used by the regex (the regex either uses US-ASCII or Unicode, > > depending if Pattern.UNICODE_CASE is set). Also regexes can include > > locale-specific sections in them (as well as, of course, the usual > > case-insensitive stuff). So I think we want to mutually exclude > > passing values for locale and regex: > > > > The non regex constructors: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > > > or alternatively: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, String > caseInsensitiveLocale) > > [for the last one therefore we turn on case-insensitivity] > > > > and for the regex-accepting constructors either: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, int regexType) > > > > or alternatively: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > > > > I personally favour choice number 1 for the non-regex constructors and > > number 2 for the regex constructors, so I think it the list of > > constructors should look like: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) > > > > I'd love some feedback please :) > > > > > -------------------------------------------------------------------------= -------- > > > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > > if Sourceforge CVS will stop being broken I might try and add it. In > > the case of less than 2 filters being added, I'm in favour of throwing > > an IllegalArgumentException - does that sound reasonable? > > > > > -------------------------------------------------------------------------= -------- > > > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > > more than two filters. According to the ever reliable (ahem!) > > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over > multiple entries > > "is true iff an odd number of the variables are true". So "true true > > false" is false, and "false true false" is true. > > > > Ian > > > > On 5/9/06, Derrick Oswald wrote: > > > Ian, > > > > > > The conversion of case requires either an assumption of encoding or a= n > > > explicit one. > > > See for example the additional Locale property on StringFilter. > > > > > > The regex library requires or assumes a strategy, either MATCH, > > > LOOKINGAT or FIND. > > > See for example the additional int property on RegexFilter. > > > > > > I'm not sure how much could be gained by subclassing the existing > > > HasAttributeFilter. > > > > > > Another strategy would be to add boolean properties for 'InText' (on = by > > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilt= er > > > and RegexFilter. Then of course you would need to add an AttributeNam= e > > > property. The attribute name being allowed to be null is a good idea, > > > and would be the default if it's just not set, no need for an extra > > > boolean 'nameIsNull' property. By the way, searching the tag name wou= ld > > > come for free if the attributes checking loop started at index zero. > > > That would mean adding three boolean and a string property to the two > > > classes. I think these are differences enough to warrant new classes.= In > > > fact, maybe this should be one really prickly class called a > > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > > the above. I don't think something can be case-insensitive and a rege= x > > > filter though, so these aren't completely orthogonal. So maybe a 'typ= e' > > > property: > > > straight string match > > > case insensitive match - needs or assumes a Locale > > > regex match - needs or assumes a strategy > > > I leave it up to you though. Sounds like a fair piece of work. > > > > > > The extra constructors on the AndFilter and OrFilter are also good > ideas. > > > > > > The XorFilter seems like a good thing to round out the logical > operations. > > > Would it also take an array of filters and only return true if just o= ne > > > is matched? > > > > > > The FilterBuilder would need to be altered to handle these changes of > > > course, assuming this was a goal. > > > This would be easier if there were just new SearchFilter and XorFilte= r > > > classes rather than changes to the existing HasAttributeFilter, > > > StringFilter, and RegexFilter (because new classes could be ignored, > > > like the CssSelectorFilter is currently being). > > > > > > Derrick > > > > > > Ian Macfarlane wrote: > > > > > > > I would also like to be able to set the attribute as null but the > > > > attribute value as not-null. In this case, it should attempt to mat= ch > > > > all attributes against the attribute value. > > > > > > > > Please email me if you have any objections to this (or anything els= e). > > > > > > > > Thanks > > > > > > > > Ian Macfarlane > > > > > > > > On 5/8/06, Ian Macfarlane wrote: > > > > > > > >> I would like to add the following functionality to > HasAttributeFilter: > > > >> > > > >> 1) A boolean flag to set if the matching should be case-insensitiv= e. > I > > > >> think this could be done with a boolean, one new constructor (Stri= ng > > > >> attribute, String value, boolean attribValue) and get/set method > pair. > > > >> > > > >> 2) A flag to mark that the attribValue should be parsed as a regul= ar > > > >> expression (I don't really see the benefit of doing this with the = tag > > > >> name). This should also obey the case-sensitivity rule in (1). For > > > >> this, I imagine a further constructor and get/set method pair. (a > > > >> sample use case of this is "post\d+" to match post1, post22, > > > >> post343545, etc). > > > >> > > > >> > > > >> I'm willing to go ahead and code these, but I thought I should run > > > >> this past you other developers too in case you dislike either idea= . > > > >> I'm also open to either: > > > >> > > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (b= ut > > > >> it seems a small enough change to be suitable as part of the class > > > >> size-wise). > > > >> > > > >> b) changing the one/two boolean constructors to be one constructor > > > >> that takes an INT flag, and add flags for the different combinatio= ns > > > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together woul= d be > 3). > > > >> This seems unnecessarily complex, and doing it the way I suggested > > > >> above still allows for this in the future if desired. > > > >> > > > >> > > > >> Thanks for your feedback, > > > >> > > > >> Ian Macfarlane > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim= o > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=120709&bid&3057&dat=121642 > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: Derrick O. <der...@ro...> - 2006-05-16 14:01:39
|
It looks good to me. I might be tempted to pass a Locale object rather than a String in the constructors, like the existing StringFilter constructors. Ian Macfarlane <ian...@gm...> wrote: I've just committed the the new constructors for OrFilter and a new class XorFilter, as these were simple useful additions and non-controversial. I would still love some feedback about the (revised) proposed changes to HasAttributeFilter (see below in the email), as I don't want to write it only for more senior devs to afterwards decide to change it back because they didn't look at it before. If any of it isn't clear, please ask. Ian On 09/05/06, Ian Macfarlane wrote: > I think the existing StringFilter and RegexFilter, as they apply to > Text nodes only at the moment, should probably be left alone. > Whichever class we apply this to should handle tags only. One set for > text, one for tags (although I also think that the two text-searching > ones should possible be made one class). > > Now I look through the existing filters, it strikes me that we might > already have some of this in place in the form of LinkStringFilter / > LinkRegexFilter. These basically do the scanning based on a tag and > attribute, but restricted to LinkTag tags. I think combining these > with HasAttributeFilter pretty much gives us what we want (indeed we > could in theory deprecate the two Link*Filter classes in favour of > this - not sure if we'd want to do that or not). > > Also, you can indeed have both regex and case insensitive - it's built > into Java's Pattern, and is actually used by LinkRegexFilter [sample > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > Pattern.UNICODE_CASE)]. > > So after a fair bit of consideration and changing my mind several > times, I've settled back on extending HasAttributeFilter. > > This is the list of things we need to be able to tell the filter: > > Attribute > Value > Attrib value case sensitive? > Locale (but not for regex as use Pattern.UNICODE_CASE) > Regex on/off > Regex type (MATCH, LOOKINGAT, FIND) > > You could certainly combine the regex on/off with regex type in the > constructor, e.g.: > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, > sensitive here>) > HasAttributeFilter(String attribute, String value, > sensitive here>, int regexType) > > Where regexType = 0 (OFF), 1 (MATCH), etc. Then by default for the > others it would be 0 (OFF). I don't think that's too confusing. > > The issue is how to fold in the Locale into the constructor, as it's > not used by the regex (the regex either uses US-ASCII or Unicode, > depending if Pattern.UNICODE_CASE is set). Also regexes can include > locale-specific sections in them (as well as, of course, the usual > case-insensitive stuff). So I think we want to mutually exclude > passing values for locale and regex: > > The non regex constructors: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > > or alternatively: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, String caseInsensitiveLocale) > [for the last one therefore we turn on case-insensitivity] > > and for the regex-accepting constructors either: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, int regexType) > > or alternatively: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > I personally favour choice number 1 for the non-regex constructors and > number 2 for the regex constructors, so I think it the list of > constructors should look like: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) > > I'd love some feedback please :) > > --------------------------------------------------------------------------------- > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > if Sourceforge CVS will stop being broken I might try and add it. In > the case of less than 2 filters being added, I'm in favour of throwing > an IllegalArgumentException - does that sound reasonable? > > --------------------------------------------------------------------------------- > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > more than two filters. According to the ever reliable (ahem!) > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries > "is true iff an odd number of the variables are true". So "true true > false" is false, and "false true false" is true. > > Ian > > On 5/9/06, Derrick Oswald wrote: > > Ian, > > > > The conversion of case requires either an assumption of encoding or an > > explicit one. > > See for example the additional Locale property on StringFilter. > > > > The regex library requires or assumes a strategy, either MATCH, > > LOOKINGAT or FIND. > > See for example the additional int property on RegexFilter. > > > > I'm not sure how much could be gained by subclassing the existing > > HasAttributeFilter. > > > > Another strategy would be to add boolean properties for 'InText' (on by > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > > and RegexFilter. Then of course you would need to add an AttributeName > > property. The attribute name being allowed to be null is a good idea, > > and would be the default if it's just not set, no need for an extra > > boolean 'nameIsNull' property. By the way, searching the tag name would > > come for free if the attributes checking loop started at index zero. > > That would mean adding three boolean and a string property to the two > > classes. I think these are differences enough to warrant new classes. In > > fact, maybe this should be one really prickly class called a > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > the above. I don't think something can be case-insensitive and a regex > > filter though, so these aren't completely orthogonal. So maybe a 'type' > > property: > > straight string match > > case insensitive match - needs or assumes a Locale > > regex match - needs or assumes a strategy > > I leave it up to you though. Sounds like a fair piece of work. > > > > The extra constructors on the AndFilter and OrFilter are also good ideas. > > > > The XorFilter seems like a good thing to round out the logical operations. > > Would it also take an array of filters and only return true if just one > > is matched? > > > > The FilterBuilder would need to be altered to handle these changes of > > course, assuming this was a goal. > > This would be easier if there were just new SearchFilter and XorFilter > > classes rather than changes to the existing HasAttributeFilter, > > StringFilter, and RegexFilter (because new classes could be ignored, > > like the CssSelectorFilter is currently being). > > > > Derrick > > > > Ian Macfarlane wrote: > > > > > I would also like to be able to set the attribute as null but the > > > attribute value as not-null. In this case, it should attempt to match > > > all attributes against the attribute value. > > > > > > Please email me if you have any objections to this (or anything else). > > > > > > Thanks > > > > > > Ian Macfarlane > > > > > > On 5/8/06, Ian Macfarlane wrote: > > > > > >> I would like to add the following functionality to HasAttributeFilter: > > >> > > >> 1) A boolean flag to set if the matching should be case-insensitive. I > > >> think this could be done with a boolean, one new constructor (String > > >> attribute, String value, boolean attribValue) and get/set method pair. > > >> > > >> 2) A flag to mark that the attribValue should be parsed as a regular > > >> expression (I don't really see the benefit of doing this with the tag > > >> name). This should also obey the case-sensitivity rule in (1). For > > >> this, I imagine a further constructor and get/set method pair. (a > > >> sample use case of this is "post\d+" to match post1, post22, > > >> post343545, etc). > > >> > > >> > > >> I'm willing to go ahead and code these, but I thought I should run > > >> this past you other developers too in case you dislike either idea. > > >> I'm also open to either: > > >> > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > > >> it seems a small enough change to be suitable as part of the class > > >> size-wise). > > >> > > >> b) changing the one/two boolean constructors to be one constructor > > >> that takes an INT flag, and add flags for the different combinations > > >> (e.g. CASE_SENSITIVE = 1, USE_REGEX = 2, so both together would be 3). > > >> This seems unnecessarily complex, and doing it the way I suggested > > >> above still allows for this in the future if desired. > > >> > > >> > > >> Thanks for your feedback, > > >> > > >> Ian Macfarlane ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Ian M. <ian...@gm...> - 2006-05-16 13:40:41
|
Oops! I missed mentioning that I added the new constructor AndFilter as well as OrFilter. Ian On 16/05/06, Ian Macfarlane <ian...@gm...> wrote: > I've just committed the the new constructors for OrFilter and a new > class XorFilter, as these were simple useful additions and > non-controversial. > > I would still love some feedback about the (revised) proposed changes > to HasAttributeFilter (see below in the email), as I don't want to > write it only for more senior devs to afterwards decide to change it > back because they didn't look at it before. If any of it isn't clear, > please ask. > > Ian > > On 09/05/06, Ian Macfarlane <ian...@gm...> wrote: > > I think the existing StringFilter and RegexFilter, as they apply to > > Text nodes only at the moment, should probably be left alone. > > Whichever class we apply this to should handle tags only. One set for > > text, one for tags (although I also think that the two text-searching > > ones should possible be made one class). > > > > Now I look through the existing filters, it strikes me that we might > > already have some of this in place in the form of LinkStringFilter / > > LinkRegexFilter. These basically do the scanning based on a tag and > > attribute, but restricted to LinkTag tags. I think combining these > > with HasAttributeFilter pretty much gives us what we want (indeed we > > could in theory deprecate the two Link*Filter classes in favour of > > this - not sure if we'd want to do that or not). > > > > Also, you can indeed have both regex and case insensitive - it's built > > into Java's Pattern, and is actually used by LinkRegexFilter [sample > > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > > Pattern.UNICODE_CASE)]. > > > > So after a fair bit of consideration and changing my mind several > > times, I've settled back on extending HasAttributeFilter. > > > > This is the list of things we need to be able to tell the filter: > > > > Attribute > > Value > > Attrib value case sensitive? > > Locale (but not for regex as use Pattern.UNICODE_CASE) > > Regex on/off > > Regex type (MATCH, LOOKINGAT, FIND) > > > > You could certainly combine the regex on/off with regex type in the > > constructor, e.g.: > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, <stuff for case > > sensitive here>) > > HasAttributeFilter(String attribute, String value, <stuff for case > > sensitive here>, int regexType) > > > > Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the > > others it would be 0 (OFF). I don't think that's too confusing. > > > > The issue is how to fold in the Locale into the constructor, as it's > > not used by the regex (the regex either uses US-ASCII or Unicode, > > depending if Pattern.UNICODE_CASE is set). Also regexes can include > > locale-specific sections in them (as well as, of course, the usual > > case-insensitive stuff). So I think we want to mutually exclude > > passing values for locale and regex: > > > > The non regex constructors: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > > > or alternatively: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, String caseInsensiti= veLocale) > > [for the last one therefore we turn on case-insensitivity] > > > > and for the regex-accepting constructors either: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, int regexType) > > > > or alternatively: > > > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > > > > I personally favour choice number 1 for the non-regex constructors and > > number 2 for the regex constructors, so I think it the list of > > constructors should look like: > > > > HasAttributeFilter(String attribute) > > HasAttributeFilter(String attribute, String value) > > HasAttributeFilter(String attribute, String value, boolean caseSensitiv= e) > > HasAttributeFilter(String attribute, String value, boolean > > caseSensitive, String locale) > > HasAttributeFilter(String attribute, String value, int regexType) > > HasAttributeFilter(String attribute, String value, int regexType, int > > regexFlags) > > > > I'd love some feedback please :) > > > > -----------------------------------------------------------------------= ---------- > > > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > > if Sourceforge CVS will stop being broken I might try and add it. In > > the case of less than 2 filters being added, I'm in favour of throwing > > an IllegalArgumentException - does that sound reasonable? > > > > -----------------------------------------------------------------------= ---------- > > > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > > more than two filters. According to the ever reliable (ahem!) > > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries > > "is true iff an odd number of the variables are true". So "true true > > false" is false, and "false true false" is true. > > > > Ian > > > > On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > > > Ian, > > > > > > The conversion of case requires either an assumption of encoding or a= n > > > explicit one. > > > See for example the additional Locale property on StringFilter. > > > > > > The regex library requires or assumes a strategy, either MATCH, > > > LOOKINGAT or FIND. > > > See for example the additional int property on RegexFilter. > > > > > > I'm not sure how much could be gained by subclassing the existing > > > HasAttributeFilter. > > > > > > Another strategy would be to add boolean properties for 'InText' (on = by > > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilt= er > > > and RegexFilter. Then of course you would need to add an AttributeNa= me > > > property. The attribute name being allowed to be null is a good idea, > > > and would be the default if it's just not set, no need for an extra > > > boolean 'nameIsNull' property. By the way, searching the tag name wo= uld > > > come for free if the attributes checking loop started at index zero. > > > That would mean adding three boolean and a string property to the two > > > classes. I think these are differences enough to warrant new classes.= In > > > fact, maybe this should be one really prickly class called a > > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > > the above. I don't think something can be case-insensitive and a rege= x > > > filter though, so these aren't completely orthogonal. So maybe a 'ty= pe' > > > property: > > > straight string match > > > case insensitive match - needs or assumes a Locale > > > regex match - needs or assumes a strategy > > > I leave it up to you though. Sounds like a fair piece of work. > > > > > > The extra constructors on the AndFilter and OrFilter are also good id= eas. > > > > > > The XorFilter seems like a good thing to round out the logical operat= ions. > > > Would it also take an array of filters and only return true if just o= ne > > > is matched? > > > > > > The FilterBuilder would need to be altered to handle these changes of > > > course, assuming this was a goal. > > > This would be easier if there were just new SearchFilter and XorFilte= r > > > classes rather than changes to the existing HasAttributeFilter, > > > StringFilter, and RegexFilter (because new classes could be ignored, > > > like the CssSelectorFilter is currently being). > > > > > > Derrick > > > > > > Ian Macfarlane wrote: > > > > > > > I would also like to be able to set the attribute as null but the > > > > attribute value as not-null. In this case, it should attempt to mat= ch > > > > all attributes against the attribute value. > > > > > > > > Please email me if you have any objections to this (or anything els= e). > > > > > > > > Thanks > > > > > > > > Ian Macfarlane > > > > > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > > > > > >> I would like to add the following functionality to HasAttributeFil= ter: > > > >> > > > >> 1) A boolean flag to set if the matching should be case-insensitiv= e. I > > > >> think this could be done with a boolean, one new constructor (Stri= ng > > > >> attribute, String value, boolean attribValue) and get/set method p= air. > > > >> > > > >> 2) A flag to mark that the attribValue should be parsed as a regul= ar > > > >> expression (I don't really see the benefit of doing this with the = tag > > > >> name). This should also obey the case-sensitivity rule in (1). For > > > >> this, I imagine a further constructor and get/set method pair. (a > > > >> sample use case of this is "post\d+" to match post1, post22, > > > >> post343545, etc). > > > >> > > > >> > > > >> I'm willing to go ahead and code these, but I thought I should run > > > >> this past you other developers too in case you dislike either idea= . > > > >> I'm also open to either: > > > >> > > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (b= ut > > > >> it seems a small enough change to be suitable as part of the class > > > >> size-wise). > > > >> > > > >> b) changing the one/two boolean constructors to be one constructor > > > >> that takes an INT flag, and add flags for the different combinatio= ns > > > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together woul= d be 3). > > > >> This seems unnecessarily complex, and doing it the way I suggested > > > >> above still allows for this in the future if desired. > > > >> > > > >> > > > >> Thanks for your feedback, > > > >> > > > >> Ian Macfarlane > |
From: Ian M. <ian...@gm...> - 2006-05-16 13:34:20
|
I've just committed the the new constructors for OrFilter and a new class XorFilter, as these were simple useful additions and non-controversial. I would still love some feedback about the (revised) proposed changes to HasAttributeFilter (see below in the email), as I don't want to write it only for more senior devs to afterwards decide to change it back because they didn't look at it before. If any of it isn't clear, please ask. Ian On 09/05/06, Ian Macfarlane <ian...@gm...> wrote: > I think the existing StringFilter and RegexFilter, as they apply to > Text nodes only at the moment, should probably be left alone. > Whichever class we apply this to should handle tags only. One set for > text, one for tags (although I also think that the two text-searching > ones should possible be made one class). > > Now I look through the existing filters, it strikes me that we might > already have some of this in place in the form of LinkStringFilter / > LinkRegexFilter. These basically do the scanning based on a tag and > attribute, but restricted to LinkTag tags. I think combining these > with HasAttributeFilter pretty much gives us what we want (indeed we > could in theory deprecate the two Link*Filter classes in favour of > this - not sure if we'd want to do that or not). > > Also, you can indeed have both regex and case insensitive - it's built > into Java's Pattern, and is actually used by LinkRegexFilter [sample > Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | > Pattern.UNICODE_CASE)]. > > So after a fair bit of consideration and changing my mind several > times, I've settled back on extending HasAttributeFilter. > > This is the list of things we need to be able to tell the filter: > > Attribute > Value > Attrib value case sensitive? > Locale (but not for regex as use Pattern.UNICODE_CASE) > Regex on/off > Regex type (MATCH, LOOKINGAT, FIND) > > You could certainly combine the regex on/off with regex type in the > constructor, e.g.: > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, <stuff for case > sensitive here>) > HasAttributeFilter(String attribute, String value, <stuff for case > sensitive here>, int regexType) > > Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the > others it would be 0 (OFF). I don't think that's too confusing. > > The issue is how to fold in the Locale into the constructor, as it's > not used by the regex (the regex either uses US-ASCII or Unicode, > depending if Pattern.UNICODE_CASE is set). Also regexes can include > locale-specific sections in them (as well as, of course, the usual > case-insensitive stuff). So I think we want to mutually exclude > passing values for locale and regex: > > The non regex constructors: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > > or alternatively: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, String caseInsensitive= Locale) > [for the last one therefore we turn on case-insensitivity] > > and for the regex-accepting constructors either: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, int regexType) > > or alternatively: > > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). > > > I personally favour choice number 1 for the non-regex constructors and > number 2 for the regex constructors, so I think it the list of > constructors should look like: > > HasAttributeFilter(String attribute) > HasAttributeFilter(String attribute, String value) > HasAttributeFilter(String attribute, String value, boolean caseSensitive) > HasAttributeFilter(String attribute, String value, boolean > caseSensitive, String locale) > HasAttributeFilter(String attribute, String value, int regexType) > HasAttributeFilter(String attribute, String value, int regexType, int > regexFlags) > > I'd love some feedback please :) > > -------------------------------------------------------------------------= -------- > > AndFilter/OrFilter taking arrays - this seems like you'd like it, so > if Sourceforge CVS will stop being broken I might try and add it. In > the case of less than 2 filters being added, I'm in favour of throwing > an IllegalArgumentException - does that sound reasonable? > > -------------------------------------------------------------------------= -------- > > XorFilter - I'm not 100% sure how the XOR logic should work if it has > more than two filters. According to the ever reliable (ahem!) > Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries > "is true iff an odd number of the variables are true". So "true true > false" is false, and "false true false" is true. > > Ian > > On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > > Ian, > > > > The conversion of case requires either an assumption of encoding or an > > explicit one. > > See for example the additional Locale property on StringFilter. > > > > The regex library requires or assumes a strategy, either MATCH, > > LOOKINGAT or FIND. > > See for example the additional int property on RegexFilter. > > > > I'm not sure how much could be gained by subclassing the existing > > HasAttributeFilter. > > > > Another strategy would be to add boolean properties for 'InText' (on by > > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > > and RegexFilter. Then of course you would need to add an AttributeName > > property. The attribute name being allowed to be null is a good idea, > > and would be the default if it's just not set, no need for an extra > > boolean 'nameIsNull' property. By the way, searching the tag name woul= d > > come for free if the attributes checking loop started at index zero. > > That would mean adding three boolean and a string property to the two > > classes. I think these are differences enough to warrant new classes. I= n > > fact, maybe this should be one really prickly class called a > > SearchFilter that combines what StringFilter and RegexFilter do, plus > > the above. I don't think something can be case-insensitive and a regex > > filter though, so these aren't completely orthogonal. So maybe a 'type= ' > > property: > > straight string match > > case insensitive match - needs or assumes a Locale > > regex match - needs or assumes a strategy > > I leave it up to you though. Sounds like a fair piece of work. > > > > The extra constructors on the AndFilter and OrFilter are also good idea= s. > > > > The XorFilter seems like a good thing to round out the logical operatio= ns. > > Would it also take an array of filters and only return true if just one > > is matched? > > > > The FilterBuilder would need to be altered to handle these changes of > > course, assuming this was a goal. > > This would be easier if there were just new SearchFilter and XorFilter > > classes rather than changes to the existing HasAttributeFilter, > > StringFilter, and RegexFilter (because new classes could be ignored, > > like the CssSelectorFilter is currently being). > > > > Derrick > > > > Ian Macfarlane wrote: > > > > > I would also like to be able to set the attribute as null but the > > > attribute value as not-null. In this case, it should attempt to match > > > all attributes against the attribute value. > > > > > > Please email me if you have any objections to this (or anything else)= . > > > > > > Thanks > > > > > > Ian Macfarlane > > > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > > > >> I would like to add the following functionality to HasAttributeFilte= r: > > >> > > >> 1) A boolean flag to set if the matching should be case-insensitive.= I > > >> think this could be done with a boolean, one new constructor (String > > >> attribute, String value, boolean attribValue) and get/set method pai= r. > > >> > > >> 2) A flag to mark that the attribValue should be parsed as a regular > > >> expression (I don't really see the benefit of doing this with the ta= g > > >> name). This should also obey the case-sensitivity rule in (1). For > > >> this, I imagine a further constructor and get/set method pair. (a > > >> sample use case of this is "post\d+" to match post1, post22, > > >> post343545, etc). > > >> > > >> > > >> I'm willing to go ahead and code these, but I thought I should run > > >> this past you other developers too in case you dislike either idea. > > >> I'm also open to either: > > >> > > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > > >> it seems a small enough change to be suitable as part of the class > > >> size-wise). > > >> > > >> b) changing the one/two boolean constructors to be one constructor > > >> that takes an INT flag, and add flags for the different combinations > > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would = be 3). > > >> This seems unnecessarily complex, and doing it the way I suggested > > >> above still allows for this in the future if desired. > > >> > > >> > > >> Thanks for your feedback, > > >> > > >> Ian Macfarlane |
From: Ian M. <ian...@gm...> - 2006-05-09 11:56:22
|
I think the existing StringFilter and RegexFilter, as they apply to Text nodes only at the moment, should probably be left alone. Whichever class we apply this to should handle tags only. One set for text, one for tags (although I also think that the two text-searching ones should possible be made one class). Now I look through the existing filters, it strikes me that we might already have some of this in place in the form of LinkStringFilter / LinkRegexFilter. These basically do the scanning based on a tag and attribute, but restricted to LinkTag tags. I think combining these with HasAttributeFilter pretty much gives us what we want (indeed we could in theory deprecate the two Link*Filter classes in favour of this - not sure if we'd want to do that or not). Also, you can indeed have both regex and case insensitive - it's built into Java's Pattern, and is actually used by LinkRegexFilter [sample Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE)]. So after a fair bit of consideration and changing my mind several times, I've settled back on extending HasAttributeFilter. This is the list of things we need to be able to tell the filter: Attribute Value Attrib value case sensitive? Locale (but not for regex as use Pattern.UNICODE_CASE) Regex on/off Regex type (MATCH, LOOKINGAT, FIND) You could certainly combine the regex on/off with regex type in the constructor, e.g.: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, <stuff for case sensitive here>) HasAttributeFilter(String attribute, String value, <stuff for case sensitive here>, int regexType) Where regexType =3D 0 (OFF), 1 (MATCH), etc. Then by default for the others it would be 0 (OFF). I don't think that's too confusing. The issue is how to fold in the Locale into the constructor, as it's not used by the regex (the regex either uses US-ASCII or Unicode, depending if Pattern.UNICODE_CASE is set). Also regexes can include locale-specific sections in them (as well as, of course, the usual case-insensitive stuff). So I think we want to mutually exclude passing values for locale and regex: The non regex constructors: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, String locale) or alternatively: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, String caseInsensitiveLo= cale) [for the last one therefore we turn on case-insensitivity] and for the regex-accepting constructors either: HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, boolean caseSensitive, int regexType) or alternatively: HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) (e.g. let them pass Pattern.CASE_INSENSITIVE etc). I personally favour choice number 1 for the non-regex constructors and number 2 for the regex constructors, so I think it the list of constructors should look like: HasAttributeFilter(String attribute) HasAttributeFilter(String attribute, String value) HasAttributeFilter(String attribute, String value, boolean caseSensitive) HasAttributeFilter(String attribute, String value, boolean caseSensitive, String locale) HasAttributeFilter(String attribute, String value, int regexType) HasAttributeFilter(String attribute, String value, int regexType, int regexFlags) I'd love some feedback please :) ---------------------------------------------------------------------------= ------ AndFilter/OrFilter taking arrays - this seems like you'd like it, so if Sourceforge CVS will stop being broken I might try and add it. In the case of less than 2 filters being added, I'm in favour of throwing an IllegalArgumentException - does that sound reasonable? ---------------------------------------------------------------------------= ------ XorFilter - I'm not 100% sure how the XOR logic should work if it has more than two filters. According to the ever reliable (ahem!) Wikipedia http://en.wikipedia.org/wiki/XOR XOR over multiple entries "is true iff an odd number of the variables are true". So "true true false" is false, and "false true false" is true. Ian On 5/9/06, Derrick Oswald <Der...@ro...> wrote: > Ian, > > The conversion of case requires either an assumption of encoding or an > explicit one. > See for example the additional Locale property on StringFilter. > > The regex library requires or assumes a strategy, either MATCH, > LOOKINGAT or FIND. > See for example the additional int property on RegexFilter. > > I'm not sure how much could be gained by subclassing the existing > HasAttributeFilter. > > Another strategy would be to add boolean properties for 'InText' (on by > default), 'InAttributeName', and 'InAttributeValue' to the StringFilter > and RegexFilter. Then of course you would need to add an AttributeName > property. The attribute name being allowed to be null is a good idea, > and would be the default if it's just not set, no need for an extra > boolean 'nameIsNull' property. By the way, searching the tag name would > come for free if the attributes checking loop started at index zero. > That would mean adding three boolean and a string property to the two > classes. I think these are differences enough to warrant new classes. In > fact, maybe this should be one really prickly class called a > SearchFilter that combines what StringFilter and RegexFilter do, plus > the above. I don't think something can be case-insensitive and a regex > filter though, so these aren't completely orthogonal. So maybe a 'type' > property: > straight string match > case insensitive match - needs or assumes a Locale > regex match - needs or assumes a strategy > I leave it up to you though. Sounds like a fair piece of work. > > The extra constructors on the AndFilter and OrFilter are also good ideas. > > The XorFilter seems like a good thing to round out the logical operations= . > Would it also take an array of filters and only return true if just one > is matched? > > The FilterBuilder would need to be altered to handle these changes of > course, assuming this was a goal. > This would be easier if there were just new SearchFilter and XorFilter > classes rather than changes to the existing HasAttributeFilter, > StringFilter, and RegexFilter (because new classes could be ignored, > like the CssSelectorFilter is currently being). > > Derrick > > Ian Macfarlane wrote: > > > I would also like to be able to set the attribute as null but the > > attribute value as not-null. In this case, it should attempt to match > > all attributes against the attribute value. > > > > Please email me if you have any objections to this (or anything else). > > > > Thanks > > > > Ian Macfarlane > > > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > > > >> I would like to add the following functionality to HasAttributeFilter: > >> > >> 1) A boolean flag to set if the matching should be case-insensitive. I > >> think this could be done with a boolean, one new constructor (String > >> attribute, String value, boolean attribValue) and get/set method pair. > >> > >> 2) A flag to mark that the attribValue should be parsed as a regular > >> expression (I don't really see the benefit of doing this with the tag > >> name). This should also obey the case-sensitivity rule in (1). For > >> this, I imagine a further constructor and get/set method pair. (a > >> sample use case of this is "post\d+" to match post1, post22, > >> post343545, etc). > >> > >> > >> I'm willing to go ahead and code these, but I thought I should run > >> this past you other developers too in case you dislike either idea. > >> I'm also open to either: > >> > >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but > >> it seems a small enough change to be suitable as part of the class > >> size-wise). > >> > >> b) changing the one/two boolean constructors to be one constructor > >> that takes an INT flag, and add flags for the different combinations > >> (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would be= 3). > >> This seems unnecessarily complex, and doing it the way I suggested > >> above still allows for this in the future if desired. > >> > >> > >> Thanks for your feedback, > >> > >> Ian Macfarlane > >> > > > > > > ------------------------------------------------------- > > Using Tomcat but need to do more? Need to support web services, securit= y? > > Get stuff done quickly with pre-integrated technology to make your job > > easier > > Download IBM WebSphere Application Server v.1.0.1 based on Apache > > Geronimo > > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=120709&bid&3057&dat=121642 > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job ea= sier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim= o > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat= =3D121642 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Derrick O. <Der...@Ro...> - 2006-05-09 03:14:10
|
Ian, The conversion of case requires either an assumption of encoding or an explicit one. See for example the additional Locale property on StringFilter. The regex library requires or assumes a strategy, either MATCH, LOOKINGAT or FIND. See for example the additional int property on RegexFilter. I'm not sure how much could be gained by subclassing the existing HasAttributeFilter. Another strategy would be to add boolean properties for 'InText' (on by default), 'InAttributeName', and 'InAttributeValue' to the StringFilter and RegexFilter. Then of course you would need to add an AttributeName property. The attribute name being allowed to be null is a good idea, and would be the default if it's just not set, no need for an extra boolean 'nameIsNull' property. By the way, searching the tag name would come for free if the attributes checking loop started at index zero. That would mean adding three boolean and a string property to the two classes. I think these are differences enough to warrant new classes. In fact, maybe this should be one really prickly class called a SearchFilter that combines what StringFilter and RegexFilter do, plus the above. I don't think something can be case-insensitive and a regex filter though, so these aren't completely orthogonal. So maybe a 'type' property: straight string match case insensitive match - needs or assumes a Locale regex match - needs or assumes a strategy I leave it up to you though. Sounds like a fair piece of work. The extra constructors on the AndFilter and OrFilter are also good ideas. The XorFilter seems like a good thing to round out the logical operations. Would it also take an array of filters and only return true if just one is matched? The FilterBuilder would need to be altered to handle these changes of course, assuming this was a goal. This would be easier if there were just new SearchFilter and XorFilter classes rather than changes to the existing HasAttributeFilter, StringFilter, and RegexFilter (because new classes could be ignored, like the CssSelectorFilter is currently being). Derrick Ian Macfarlane wrote: > I would also like to be able to set the attribute as null but the > attribute value as not-null. In this case, it should attempt to match > all attributes against the attribute value. > > Please email me if you have any objections to this (or anything else). > > Thanks > > Ian Macfarlane > > On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > >> I would like to add the following functionality to HasAttributeFilter: >> >> 1) A boolean flag to set if the matching should be case-insensitive. I >> think this could be done with a boolean, one new constructor (String >> attribute, String value, boolean attribValue) and get/set method pair. >> >> 2) A flag to mark that the attribValue should be parsed as a regular >> expression (I don't really see the benefit of doing this with the tag >> name). This should also obey the case-sensitivity rule in (1). For >> this, I imagine a further constructor and get/set method pair. (a >> sample use case of this is "post\d+" to match post1, post22, >> post343545, etc). >> >> >> I'm willing to go ahead and code these, but I thought I should run >> this past you other developers too in case you dislike either idea. >> I'm also open to either: >> >> a) putting the regexp stuff in a subclass of HasAttributeFilter (but >> it seems a small enough change to be suitable as part of the class >> size-wise). >> >> b) changing the one/two boolean constructors to be one constructor >> that takes an INT flag, and add flags for the different combinations >> (e.g. CASE_SENSITIVE = 1, USE_REGEX = 2, so both together would be 3). >> This seems unnecessarily complex, and doing it the way I suggested >> above still allows for this in the future if desired. >> >> >> Thanks for your feedback, >> >> Ian Macfarlane >> > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=k&kid0709&bid&3057&dat1642 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: Ian M. <ian...@gm...> - 2006-05-08 14:17:51
|
I would also like to be able to set the attribute as null but the attribute value as not-null. In this case, it should attempt to match all attributes against the attribute value. Please email me if you have any objections to this (or anything else). Thanks Ian Macfarlane On 5/8/06, Ian Macfarlane <ian...@gm...> wrote: > I would like to add the following functionality to HasAttributeFilter: > > 1) A boolean flag to set if the matching should be case-insensitive. I > think this could be done with a boolean, one new constructor (String > attribute, String value, boolean attribValue) and get/set method pair. > > 2) A flag to mark that the attribValue should be parsed as a regular > expression (I don't really see the benefit of doing this with the tag > name). This should also obey the case-sensitivity rule in (1). For > this, I imagine a further constructor and get/set method pair. (a > sample use case of this is "post\d+" to match post1, post22, > post343545, etc). > > > I'm willing to go ahead and code these, but I thought I should run > this past you other developers too in case you dislike either idea. > I'm also open to either: > > a) putting the regexp stuff in a subclass of HasAttributeFilter (but > it seems a small enough change to be suitable as part of the class > size-wise). > > b) changing the one/two boolean constructors to be one constructor > that takes an INT flag, and add flags for the different combinations > (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would be 3)= . > This seems unnecessarily complex, and doing it the way I suggested > above still allows for this in the future if desired. > > > Thanks for your feedback, > > Ian Macfarlane > |
From: Ian M. <ian...@gm...> - 2006-05-08 14:00:57
|
I would like to be able to do AndFilter and OrFilter with an array of existing filters (there is a method to do so but a constructor is far more handy, and stops you making an invalid one then making it valid in a second step). Having a quick peek at the source, it looks like this can be achieved easily with a simple new constructor that takes an array of filters. Does anyone object to this? Secondly, I thought it might be handy to add an XorFilter (eXclusive OR). Is this something people might be interested in? Thanks Ian Macfarlane |
From: Ian M. <ian...@gm...> - 2006-05-08 13:45:47
|
I would like to add the following functionality to HasAttributeFilter: 1) A boolean flag to set if the matching should be case-insensitive. I think this could be done with a boolean, one new constructor (String attribute, String value, boolean attribValue) and get/set method pair. 2) A flag to mark that the attribValue should be parsed as a regular expression (I don't really see the benefit of doing this with the tag name). This should also obey the case-sensitivity rule in (1). For this, I imagine a further constructor and get/set method pair. (a sample use case of this is "post\d+" to match post1, post22, post343545, etc). I'm willing to go ahead and code these, but I thought I should run this past you other developers too in case you dislike either idea. I'm also open to either: a) putting the regexp stuff in a subclass of HasAttributeFilter (but it seems a small enough change to be suitable as part of the class size-wise). b) changing the one/two boolean constructors to be one constructor that takes an INT flag, and add flags for the different combinations (e.g. CASE_SENSITIVE =3D 1, USE_REGEX =3D 2, so both together would be 3). This seems unnecessarily complex, and doing it the way I suggested above still allows for this in the future if desired. Thanks for your feedback, Ian Macfarlane |
From: Yuta O. <ok...@ar...> - 2006-04-20 09:20:39
|
Thank you for your advice! I modified our code as reseting the parser and calling visitAllNodesWith() again, parsing process is done successfully by corrected encoding. And I have correction about JIS handling. I make scanJIS() to recognize "[ESC] ( I" as the end of JIS encoding string, but it is mistake. According to ISO-2022-JP, It is necessary to return to ASCII charset at the end of the line and the text. JIS X 0201-1976 "Kana" charset, that is single byte charset, is not ASCII charset. Note that the codes I modified are only support the Japanese charsets. There are many type of charset(ex. Chinese, Korean, Latin, etc...) which use other escape sequences. If another problem is happen about escape sequence handling, following URLs help you to settle the problem. Wikipedia - ISO/IEC 2022 http://en.wikipedia.org/wiki/ISO_2022 International Register of Coded Character Sets http://www.itscj.ipsj.or.jp/ISO-IR/ |
From: Derrick O. <Der...@Ro...> - 2006-04-19 12:21:42
|
Yuta, Thanks for the updated JIS handling. I will incorporate it into the Lexer. As Matthew has indicated, the EncodingChangeException is thrown to let the user know that some nodes already handed out by the parser are incorrect according to the encoding. This is really the fault of the HTTP server, which should have sent the correct encoding as part of the Content-Type header string. But, given that you have no control over the server, the exception is the only solution. After the exception is thrown, the parser has set it's encoding to the new value, so you should be able to just reset and reparse, see for example the handling in StringBean: catch (EncodingChangeException ece) { mIsPre = false; mIsScript = false; mIsStyle = false; try { // try again with the encoding now in force mParser.reset (); mBuffer = new StringBuffer (4096); mParser.visitAllNodesWith (this); updateStrings (mBuffer.toString ()); } catch (ParserException pe) { updateStrings (pe.toString ()); } finally { mBuffer = new StringBuffer (4096); } } You'll notice that it is up to the user code (StringBean for example) to reset it's own state so that the reparse doesn't start from an arbitrary state. Derrick Matthew Buckett wrote: >Yuta Okamoto wrote: > > > >>But it's one thing after another. When HTML parser find a "Content-Type" >>META tag, correct the current charset and read string before META tag once >>again to compare with the buffer already read by default encoding in >>org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML >>parser throws ParserException(EncodingChangeException) because of comparing >>"[ESC]" from first character of old buffer with double byte character from >>that of new buffer. >> >>I'm overwhelmed by that. What should I do? In the meantime, I attach the >>revised code to this mail. please see the below. >> >> > >Throwning an Exception is the sensible thing todo as otherwise you may >have mishandled the content due to the incorrect encoding. > >I changed EncodingChangeException so that you could find the orginal and >replacement encodings. Then you can reset the parser and attempt to >reparse the whole document using the new encoding. Eg: > > try > { > parser.visitAllNodesWith(visitor); > } > catch (EncodingChangeException ece) > { > log.debug("Switch from " + ece.getOrginalEncoding() + " to " > + ece.getReplacementEncoding()); > String encoding = ece.getReplacementEncoding(); > parser.reset(); > parser.setEncoding(encoding); > visitor = getUserFilter(type); > parser.visitAllNodesWith(visitor); > } > >I don't believe I ever sent the patch for EncodingChangeException back >to the list. Unfortunately my hacked copy of HTMLParser is on my work >computer at the moment, but I can dig it out when I'm back at work. > > > |
From: Matthew B. <mat...@ou...> - 2006-04-19 10:03:37
|
Yuta Okamoto wrote: > But it's one thing after another. When HTML parser find a "Content-Type" > META tag, correct the current charset and read string before META tag once > again to compare with the buffer already read by default encoding in > org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML > parser throws ParserException(EncodingChangeException) because of comparing > "[ESC]" from first character of old buffer with double byte character from > that of new buffer. > > I'm overwhelmed by that. What should I do? In the meantime, I attach the > revised code to this mail. please see the below. Throwning an Exception is the sensible thing todo as otherwise you may have mishandled the content due to the incorrect encoding. I changed EncodingChangeException so that you could find the orginal and replacement encodings. Then you can reset the parser and attempt to reparse the whole document using the new encoding. Eg: try { parser.visitAllNodesWith(visitor); } catch (EncodingChangeException ece) { log.debug("Switch from " + ece.getOrginalEncoding() + " to " + ece.getReplacementEncoding()); String encoding = ece.getReplacementEncoding(); parser.reset(); parser.setEncoding(encoding); visitor = getUserFilter(type); parser.visitAllNodesWith(visitor); } I don't believe I ever sent the patch for EncodingChangeException back to the list. Unfortunately my hacked copy of HTMLParser is on my work computer at the moment, but I can dig it out when I'm back at work. -- -- Matthew Buckett, VLE Developer -- Learning Technologies Group, Oxford University Computing Services -- Tel: +44 (0)1865 283660 http://www.oucs.ox.ac.uk/ltg/ |
From: Yuta O. <ok...@ar...> - 2006-04-19 09:49:18
|
Dear All, I'm Yuta Okamoto, parttime employee of Ariel Networks, Inc.. I'm writing to ask you problems with HTML documents including "JIS encoding" (ISO-2022-JP) strings. In Japan, there are many type and version of character set. JIS encoding, one of the popular Japanese charset, is defined as a subset of ISO-2022. We're developing an application using HTML parser library, and face some problems. For example, some kind of HTML document including JIS encoding strings as below: <HTML> <HEAD> <TITLE>[JIS encoding strings]</TITLE> <meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp"> ... </HEAD> <BODY> ... </BODY> </HTML> In this case, HTML parser can't recognize "</TITLE>" and set down following tags and strings as content of "TITLE". For finding a reason, I get the source of HTML parser and trace its process. In the result, I found causes in org.htmlparser.lexer.Lexer.parseString() and scanJIS(). Within JIS encoding strings, several kind of "escape sequence" defined by ISO-2022 to switch character set. For example, [ESC] $ B [double byte characters] [ESC] ( B Where "[ESC] $ B" means "switch to JIS X 0208-1983(new JIS) charset". And "[ESC] ( B" means "switch to US-ASCII charset". For more detail, please see ISO-2022, RFC1468 or RFC1554. HTML parser recognize a string enclosed by ISO-2022 escape sequences. However, It recognize the string only beginning with "[ESC] $ B" and ending with "[ESC] ( J", meaning "switch to JIS X 0201-1976 ("Roman" set)". On the above example, HTML parser can't recognize the end of JIS encoding string by the end of the document. In order to resolve it, I revised "org.htmlparser.lexer.Lexer.java" and this problem is improved. But it's one thing after another. When HTML parser find a "Content-Type" META tag, correct the current charset and read string before META tag once again to compare with the buffer already read by default encoding in org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML parser throws ParserException(EncodingChangeException) because of comparing "[ESC]" from first character of old buffer with double byte character from that of new buffer. I'm overwhelmed by that. What should I do? In the meantime, I attach the revised code to this mail. please see the below. Regards, Okamoto ---------- /** * Advance the cursor through a JIS escape sequence.<p> * * NOTE:<br> * A list of ISO-2022 escape sequences for charset switching.<br> * For more detail, see ISO-2022, RFC1468 or RFC1554.<p> * * [ double byte characters ] * <ul> * <li>(*) JIS X 0208-1978(old JIS): [ESC] $ @ * <li>(*) JIS X 0208-1983(new JIS): [ESC] $ B * <li>JIS X 0208-1990: [ESC] & @ [ESC] $ B * <li>JIS X 0212-1990: [ESC] $ ( D * <li>1st plane of JIS X 0213:2000: [ESC] $ ( O * <li>1st plane of JIS X 0213:2004: [ESC] $ ( Q * <li>2nd plane of JIS X 0213:2000: [ESC] $ ( P * </ul> * * <p>[ single byte characters ] * <ul> * <li>(*) ISO/IEC 646 IRV(US-ASCII): [ESC] ( B * <li>(*) JIS X 0201-1976 ("Roman" set) * <ul> * <li>[ESC] ( J * <li>[ESC] ( H (NOT RECOMMENDED but rarely used) * </ul> * <li>JIS X 0201-1976 ("Kana" set): [ESC] ( I (NOT RECOMMENDED but rarely used) * </ul> * * <p>(*): commonly used * * @param cursor A cursor positioned within the escape sequence. * @exception ParserException If a problem occurs reading from the source. */ protected void scanJIS (Cursor cursor) throws ParserException { boolean done; char ch; int state; done = false; state = 0; while (!done) { ch = mPage.getCharacter (cursor); if (Page.EOF == ch) done = true; else switch (state) { case 0: if (0x1b == ch) // escape state = 1; break; case 1: if ('(' == ch) state = 2; else state = 0; break; case 2: if ('B' == ch || 'J' == ch || 'H' == ch || 'I' == ch) done = true; else state = 0; break; default: throw new IllegalStateException ("state " + state); } } } /** * Parse a string node. * Scan characters until "</", "<%", "<!" or < followed by a * letter is encountered, or the input stream is exhausted, in which * case <code>null</code> is returned. * @param start The position at which to start scanning. * @param quotesmart If <code>true</code>, strings ignore quoted contents. * @return The parsed node. * @exception ParserException If a problem occurs reading from the source. */ protected Node parseString (int start, boolean quotesmart) throws ParserException { boolean done; char ch; char quote; done = false; quote = 0; while (!done) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if (0x1b == ch) // escape { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if ('$' == ch) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; // JIS X 0208-1978 and JIS X 0208-1983 else if ('@' == ch || 'B' == ch) scanJIS (mCursor); /* // JIS X 0212-1990 else if ('(' == ch) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if ('D' == ch) scanJIS (mCursor); else { mCursor.retreat (); mCursor.retreat (); mCursor.retreat (); } } */ else { mCursor.retreat (); mCursor.retreat (); } } else mCursor.retreat (); } else if ( ... } } |
From: Ian M. <ian...@gm...> - 2006-02-13 15:18:18
|
I've just commited a new class called NodeTreeWalker to CVS. It is located in org.htmlparser.util This class allows you to iterate through Node's in a tree or sub-tree in either a depth-first or breadth-first order (and you may switch the method being used during the search if desired). Think of it like a NodeIterator for a tree of Nodes instead of a linear sequence of Nodes. You may pick any Node in a document to be the root Node. It also supports limiting the depth that the search, particularly useful for iterative breadth-first searches where you wish to inspect one level of children at a time. The breadth-first search code (along with previous code I've written e.g. the next/previousSibling methods) originally came from a tree-traversal program using HTMLParser that I wrote at the company I work for (NetRank Ltd), which has had quite a bit of testing. I'm hoping that there aren't too many horrible bugs in it :) Limitations: The only limitation is that the root of the tree must be a single Node, so perhaps at some point we may wish to create something along the lines of a DocumentHoldingNode within which the entire Document is stored. That way, it could support traversing documents with multiple root Nodes (e.g. docType and html). Also, it might get stuck if someone deliberately mangles up a document tree such that a Nodes parents/grandparents are also the same Node. This could probably be checked for in this class, but would probably be better suited to the actual get/set parent/children methods as this issue would affect more than just this class. Comments/code review welcomed. Best wishes Ian Macfarlane NetRank Ltd ps: For future versions of the class, I would like to also see previousNode() functionality, but I do not expect to write this myself for the time being. If anyone wanted to write this, please go ahead, but make sure it's in the same style as the existing methods. |
From: Arjohn K. <arj...@ad...> - 2005-11-30 14:46:53
|
Dear all, I'm mailing you to let you know that there's an excellent (open source) tool for analyzing and improving Java code: findbugs. I don't know if anyone of you knows about this tool, but I just fed it the v1.5 code and it found a considerable number issues, most of which are quite easy to fix. More info about findbugs can be found at: http://findbugs.sourceforge.net/ It would be great if some of you could give this tool a try and use its findings so that we can have an even better html parser (yes, I'm writing this for my own benefit ;-) ). I guess I could have reported the most serious issues using the issue tracker, but that would have left out a lot of the other potential improvements. Regards, Arjohn Kampman |
From: Derrick O. <Der...@Ro...> - 2005-11-11 14:06:26
|
Please welcome Yuking Lie, a graduate student and Nguyen Huu Chon, a 26 year old student from Vietnam, as developers of HTML Parser. |
From: Derrick O. <Der...@Ro...> - 2005-11-11 13:49:53
|
Please welcome Ian Macfarlane. Ian has already submitted improvements to the htmlparser project related to tag heirarchy, new tag types and whitespace in text nodes. He works for Netrank in the UK and brings with him some real world experience in using HTML Parser in a production system. Welcome Ian. |
From: Derrick O. <Der...@Ro...> - 2005-11-04 12:48:31
|
IMHO, the text shouldn't ever be null, but if it is, toHtml() would (should?) return an empty string, so isWhitespace() should also return true. Ian Macfarlane wrote: >>Conversion of character references like is already performed by the util.Translate class. >> >> >Oh good! No need for me to write it then :) > > > >>There is no &tab; character reference as far as I'm aware >> >> >You're right, I just guessed a whitespace entity name, typed it into >Google and found references to it. Sorry, I ought to have checked it >out a bit better first. > >Derrick, for a isWhiteSpace() method, what do you think it ought to do >when the String is null? > >Ian > >On 11/3/05, Derrick Oswald <Der...@ro...> wrote: > > >>Conversion of character references like is already performed by >>the util.Translate class. >>There is no &tab; character reference as far as I'm aware (see >>http://www.w3.org/TR/REC-html40/sgml/entities.html). >> >>Ian Macfarlane wrote: >> >> >> >>>Thanks for your reply, >>> >>>I wasn't suggesting trimming the actual text of the text nodes >>>permanently, merely wondering if using the trim() method to see if the >>>resulting string was empty would be sufficient, or whether we should >>>also look for various white-space HTML entities (e.g. &tab; also) for >>>purposes of determining this. >>> >>>Now I think about it some more, white space alone is probably what we >>>want to do. If we want to get things like &tab; we ought to write some >>>sort of method that would replace those types of HTML character >>>references with the actual characters, if that's feasible. >>> >>>The only other question I've got - what do you all think should happen >>>if the contents of the text node is null? Should it return true >>>(because there's no characters), false (because it's not actually a >>>white space String) or throw a NullPointerException (which would >>>negate the value of this method by forcing the end-user to write lots >>>of code to use this method)? Can a text node ever be null without the >>>user changing the text ot be null? >>> >>>Ian >>> >>>String is immutable so String.trim().equals("") won't change the >>>original String object. >>> >>>On 11/2/05, Axel <ax...@gm...> wrote: >>> >>> >>> >>> >>>>On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: >>>> >>>> >>>> >>>> >>>>>I was thinking it might be worthwhile adding a method to Text/TextNode >>>>>along the lines of: >>>>> >>>>>boolean isWhiteSpace() >>>>> >>>>>Which would return if the TextNode consisted of solely white space >>>>>characters (or was the empty String). >>>>> >>>>>Now this could simply be done using String.trim().equals(""), however >>>>>that wouldn't account for: >>>>> >>>>>- the non-breaking space character (#160) >>>>>- The HTML code (also   as Firefox/IE do) >>>>>- The HTML code   (also   as Firefox/IE do) >>>>> >>>>>So my question is, do you think should this method should treat those >>>>>as spaces and remove/ignore them also for purposes of determining if >>>>>the TextNode is white space? Or should it only trim normal whitespace >>>>>(space, tab, carriage returns, etc). >>>>> >>>>> >>>>> >>>>> >>>>I think, if every character (or entity converted to a >>>>unicode-character) in the TextNode is true for >>>>Character#isWhitespace() the boolean isWhiteSpace() should return >>>>true; >>>>IMO the TextNode shouldn't be trimmed automatically. Only a special >>>>function should allow this to do. >>>> >>>>-- >>>>Axel Kramer >>>>http://www.plog4u.org - Wikipedia Eclipse Plugin >>>> >>>> >>>>------------------------------------------------------- >>>>SF.Net email is sponsored by: >>>>Tame your development challenges with Apache's Geronimo App Server. Download >>>>it for free - -and be entered to win a 42" plasma tv or your very own >>>>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>>>_______________________________________________ >>>>Htmlparser-developer mailing list >>>>Htm...@li... >>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>>> >>>> >>>> >>>> >>>> >>>------------------------------------------------------- >>>SF.Net email is sponsored by: >>>Tame your development challenges with Apache's Geronimo App Server. Download >>>it for free - -and be entered to win a 42" plasma tv or your very own >>>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>>_______________________________________________ >>>Htmlparser-developer mailing list >>>Htm...@li... >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>> >>> >>> >>> >>> >> >>------------------------------------------------------- >>SF.Net email is sponsored by: >>Tame your development challenges with Apache's Geronimo App Server. Download >>it for free - -and be entered to win a 42" plasma tv or your very own >>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>_______________________________________________ >>Htmlparser-developer mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> >> > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: Ian M. <ian...@gm...> - 2005-11-03 14:57:22
|
> Conversion of character references like is already performed by th= e util.Translate class. Oh good! No need for me to write it then :) > There is no &tab; character reference as far as I'm aware You're right, I just guessed a whitespace entity name, typed it into Google and found references to it. Sorry, I ought to have checked it out a bit better first. Derrick, for a isWhiteSpace() method, what do you think it ought to do when the String is null? Ian On 11/3/05, Derrick Oswald <Der...@ro...> wrote: > Conversion of character references like is already performed by > the util.Translate class. > There is no &tab; character reference as far as I'm aware (see > http://www.w3.org/TR/REC-html40/sgml/entities.html). > > Ian Macfarlane wrote: > > >Thanks for your reply, > > > >I wasn't suggesting trimming the actual text of the text nodes > >permanently, merely wondering if using the trim() method to see if the > >resulting string was empty would be sufficient, or whether we should > >also look for various white-space HTML entities (e.g. &tab; also) for > >purposes of determining this. > > > >Now I think about it some more, white space alone is probably what we > >want to do. If we want to get things like &tab; we ought to write some > >sort of method that would replace those types of HTML character > >references with the actual characters, if that's feasible. > > > >The only other question I've got - what do you all think should happen > >if the contents of the text node is null? Should it return true > >(because there's no characters), false (because it's not actually a > >white space String) or throw a NullPointerException (which would > >negate the value of this method by forcing the end-user to write lots > >of code to use this method)? Can a text node ever be null without the > >user changing the text ot be null? > > > >Ian > > > >String is immutable so String.trim().equals("") won't change the > >original String object. > > > >On 11/2/05, Axel <ax...@gm...> wrote: > > > > > >>On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: > >> > >> > >>>I was thinking it might be worthwhile adding a method to Text/TextNode > >>>along the lines of: > >>> > >>>boolean isWhiteSpace() > >>> > >>>Which would return if the TextNode consisted of solely white space > >>>characters (or was the empty String). > >>> > >>>Now this could simply be done using String.trim().equals(""), however > >>>that wouldn't account for: > >>> > >>>- the non-breaking space character (#160) > >>>- The HTML code (also   as Firefox/IE do) > >>>- The HTML code   (also   as Firefox/IE do) > >>> > >>>So my question is, do you think should this method should treat those > >>>as spaces and remove/ignore them also for purposes of determining if > >>>the TextNode is white space? Or should it only trim normal whitespace > >>>(space, tab, carriage returns, etc). > >>> > >>> > >>I think, if every character (or entity converted to a > >>unicode-character) in the TextNode is true for > >>Character#isWhitespace() the boolean isWhiteSpace() should return > >>true; > >>IMO the TextNode shouldn't be trimmed automatically. Only a special > >>function should allow this to do. > >> > >>-- > >>Axel Kramer > >>http://www.plog4u.org - Wikipedia Eclipse Plugin > >> > >> > >>------------------------------------------------------- > >>SF.Net email is sponsored by: > >>Tame your development challenges with Apache's Geronimo App Server. Dow= nload > >>it for free - -and be entered to win a 42" plasma tv or your very own > >>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > >>_______________________________________________ > >>Htmlparser-developer mailing list > >>Htm...@li... > >>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > >> > >> > >> > > > > > >------------------------------------------------------- > >SF.Net email is sponsored by: > >Tame your development challenges with Apache's Geronimo App Server. Down= load > >it for free - -and be entered to win a 42" plasma tv or your very own > >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > >_______________________________________________ > >Htmlparser-developer mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Derrick O. <Der...@Ro...> - 2005-11-03 14:28:06
|
Conversion of character references like is already performed by the util.Translate class. There is no &tab; character reference as far as I'm aware (see http://www.w3.org/TR/REC-html40/sgml/entities.html). Ian Macfarlane wrote: >Thanks for your reply, > >I wasn't suggesting trimming the actual text of the text nodes >permanently, merely wondering if using the trim() method to see if the >resulting string was empty would be sufficient, or whether we should >also look for various white-space HTML entities (e.g. &tab; also) for >purposes of determining this. > >Now I think about it some more, white space alone is probably what we >want to do. If we want to get things like &tab; we ought to write some >sort of method that would replace those types of HTML character >references with the actual characters, if that's feasible. > >The only other question I've got - what do you all think should happen >if the contents of the text node is null? Should it return true >(because there's no characters), false (because it's not actually a >white space String) or throw a NullPointerException (which would >negate the value of this method by forcing the end-user to write lots >of code to use this method)? Can a text node ever be null without the >user changing the text ot be null? > >Ian > >String is immutable so String.trim().equals("") won't change the >original String object. > >On 11/2/05, Axel <ax...@gm...> wrote: > > >>On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: >> >> >>>I was thinking it might be worthwhile adding a method to Text/TextNode >>>along the lines of: >>> >>>boolean isWhiteSpace() >>> >>>Which would return if the TextNode consisted of solely white space >>>characters (or was the empty String). >>> >>>Now this could simply be done using String.trim().equals(""), however >>>that wouldn't account for: >>> >>>- the non-breaking space character (#160) >>>- The HTML code (also   as Firefox/IE do) >>>- The HTML code   (also   as Firefox/IE do) >>> >>>So my question is, do you think should this method should treat those >>>as spaces and remove/ignore them also for purposes of determining if >>>the TextNode is white space? Or should it only trim normal whitespace >>>(space, tab, carriage returns, etc). >>> >>> >>I think, if every character (or entity converted to a >>unicode-character) in the TextNode is true for >>Character#isWhitespace() the boolean isWhiteSpace() should return >>true; >>IMO the TextNode shouldn't be trimmed automatically. Only a special >>function should allow this to do. >> >>-- >>Axel Kramer >>http://www.plog4u.org - Wikipedia Eclipse Plugin >> >> >>------------------------------------------------------- >>SF.Net email is sponsored by: >>Tame your development challenges with Apache's Geronimo App Server. Download >>it for free - -and be entered to win a 42" plasma tv or your very own >>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>_______________________________________________ >>Htmlparser-developer mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> >> > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: Ian M. <ian...@gm...> - 2005-11-03 12:44:59
|
Thanks for your reply, I wasn't suggesting trimming the actual text of the text nodes permanently, merely wondering if using the trim() method to see if the resulting string was empty would be sufficient, or whether we should also look for various white-space HTML entities (e.g. &tab; also) for purposes of determining this. Now I think about it some more, white space alone is probably what we want to do. If we want to get things like &tab; we ought to write some sort of method that would replace those types of HTML character references with the actual characters, if that's feasible. The only other question I've got - what do you all think should happen if the contents of the text node is null? Should it return true (because there's no characters), false (because it's not actually a white space String) or throw a NullPointerException (which would negate the value of this method by forcing the end-user to write lots of code to use this method)? Can a text node ever be null without the user changing the text ot be null? Ian String is immutable so String.trim().equals("") won't change the original String object. On 11/2/05, Axel <ax...@gm...> wrote: > On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: > > I was thinking it might be worthwhile adding a method to Text/TextNode > > along the lines of: > > > > boolean isWhiteSpace() > > > > Which would return if the TextNode consisted of solely white space > > characters (or was the empty String). > > > > Now this could simply be done using String.trim().equals(""), however > > that wouldn't account for: > > > > - the non-breaking space character (#160) > > - The HTML code (also   as Firefox/IE do) > > - The HTML code   (also   as Firefox/IE do) > > > > So my question is, do you think should this method should treat those > > as spaces and remove/ignore them also for purposes of determining if > > the TextNode is white space? Or should it only trim normal whitespace > > (space, tab, carriage returns, etc). > I think, if every character (or entity converted to a > unicode-character) in the TextNode is true for > Character#isWhitespace() the boolean isWhiteSpace() should return > true; > IMO the TextNode shouldn't be trimmed automatically. Only a special > function should allow this to do. > > -- > Axel Kramer > http://www.plog4u.org - Wikipedia Eclipse Plugin > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |