Thread: [Htmlparser-developer] Method to check if TextNode is just whitespace
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2005-11-01 10:56:44
|
I was thinking it might be worthwhile adding a method to Text/TextNode along the lines of: boolean isWhiteSpace() Which would return if the TextNode consisted of solely white space characters (or was the empty String). Now this could simply be done using String.trim().equals(""), however that wouldn't account for: - the non-breaking space character (#160) - The HTML code (also   as Firefox/IE do) - The HTML code   (also   as Firefox/IE do) So my question is, do you think should this method should treat those as spaces and remove/ignore them also for purposes of determining if the TextNode is white space? Or should it only trim normal whitespace (space, tab, carriage returns, etc). Thanks for your advice Ian Macfarlane |
From: Axel <ax...@gm...> - 2005-11-02 22:07:16
|
On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: > I was thinking it might be worthwhile adding a method to Text/TextNode > along the lines of: > > boolean isWhiteSpace() > > Which would return if the TextNode consisted of solely white space > characters (or was the empty String). > > Now this could simply be done using String.trim().equals(""), however > that wouldn't account for: > > - the non-breaking space character (#160) > - The HTML code (also   as Firefox/IE do) > - The HTML code   (also   as Firefox/IE do) > > So my question is, do you think should this method should treat those > as spaces and remove/ignore them also for purposes of determining if > the TextNode is white space? Or should it only trim normal whitespace > (space, tab, carriage returns, etc). I think, if every character (or entity converted to a unicode-character) in the TextNode is true for Character#isWhitespace() the boolean isWhiteSpace() should return true; IMO the TextNode shouldn't be trimmed automatically. Only a special function should allow this to do. -- Axel Kramer http://www.plog4u.org - Wikipedia Eclipse Plugin |
From: Ian M. <ian...@gm...> - 2005-11-03 12:44:59
|
Thanks for your reply, I wasn't suggesting trimming the actual text of the text nodes permanently, merely wondering if using the trim() method to see if the resulting string was empty would be sufficient, or whether we should also look for various white-space HTML entities (e.g. &tab; also) for purposes of determining this. Now I think about it some more, white space alone is probably what we want to do. If we want to get things like &tab; we ought to write some sort of method that would replace those types of HTML character references with the actual characters, if that's feasible. The only other question I've got - what do you all think should happen if the contents of the text node is null? Should it return true (because there's no characters), false (because it's not actually a white space String) or throw a NullPointerException (which would negate the value of this method by forcing the end-user to write lots of code to use this method)? Can a text node ever be null without the user changing the text ot be null? Ian String is immutable so String.trim().equals("") won't change the original String object. On 11/2/05, Axel <ax...@gm...> wrote: > On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: > > I was thinking it might be worthwhile adding a method to Text/TextNode > > along the lines of: > > > > boolean isWhiteSpace() > > > > Which would return if the TextNode consisted of solely white space > > characters (or was the empty String). > > > > Now this could simply be done using String.trim().equals(""), however > > that wouldn't account for: > > > > - the non-breaking space character (#160) > > - The HTML code (also   as Firefox/IE do) > > - The HTML code   (also   as Firefox/IE do) > > > > So my question is, do you think should this method should treat those > > as spaces and remove/ignore them also for purposes of determining if > > the TextNode is white space? Or should it only trim normal whitespace > > (space, tab, carriage returns, etc). > I think, if every character (or entity converted to a > unicode-character) in the TextNode is true for > Character#isWhitespace() the boolean isWhiteSpace() should return > true; > IMO the TextNode shouldn't be trimmed automatically. Only a special > function should allow this to do. > > -- > Axel Kramer > http://www.plog4u.org - Wikipedia Eclipse Plugin > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Derrick O. <Der...@Ro...> - 2005-11-03 14:28:06
|
Conversion of character references like is already performed by the util.Translate class. There is no &tab; character reference as far as I'm aware (see http://www.w3.org/TR/REC-html40/sgml/entities.html). Ian Macfarlane wrote: >Thanks for your reply, > >I wasn't suggesting trimming the actual text of the text nodes >permanently, merely wondering if using the trim() method to see if the >resulting string was empty would be sufficient, or whether we should >also look for various white-space HTML entities (e.g. &tab; also) for >purposes of determining this. > >Now I think about it some more, white space alone is probably what we >want to do. If we want to get things like &tab; we ought to write some >sort of method that would replace those types of HTML character >references with the actual characters, if that's feasible. > >The only other question I've got - what do you all think should happen >if the contents of the text node is null? Should it return true >(because there's no characters), false (because it's not actually a >white space String) or throw a NullPointerException (which would >negate the value of this method by forcing the end-user to write lots >of code to use this method)? Can a text node ever be null without the >user changing the text ot be null? > >Ian > >String is immutable so String.trim().equals("") won't change the >original String object. > >On 11/2/05, Axel <ax...@gm...> wrote: > > >>On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: >> >> >>>I was thinking it might be worthwhile adding a method to Text/TextNode >>>along the lines of: >>> >>>boolean isWhiteSpace() >>> >>>Which would return if the TextNode consisted of solely white space >>>characters (or was the empty String). >>> >>>Now this could simply be done using String.trim().equals(""), however >>>that wouldn't account for: >>> >>>- the non-breaking space character (#160) >>>- The HTML code (also   as Firefox/IE do) >>>- The HTML code   (also   as Firefox/IE do) >>> >>>So my question is, do you think should this method should treat those >>>as spaces and remove/ignore them also for purposes of determining if >>>the TextNode is white space? Or should it only trim normal whitespace >>>(space, tab, carriage returns, etc). >>> >>> >>I think, if every character (or entity converted to a >>unicode-character) in the TextNode is true for >>Character#isWhitespace() the boolean isWhiteSpace() should return >>true; >>IMO the TextNode shouldn't be trimmed automatically. Only a special >>function should allow this to do. >> >>-- >>Axel Kramer >>http://www.plog4u.org - Wikipedia Eclipse Plugin >> >> >>------------------------------------------------------- >>SF.Net email is sponsored by: >>Tame your development challenges with Apache's Geronimo App Server. Download >>it for free - -and be entered to win a 42" plasma tv or your very own >>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>_______________________________________________ >>Htmlparser-developer mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> >> > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: Ian M. <ian...@gm...> - 2005-11-03 14:57:22
|
> Conversion of character references like is already performed by th= e util.Translate class. Oh good! No need for me to write it then :) > There is no &tab; character reference as far as I'm aware You're right, I just guessed a whitespace entity name, typed it into Google and found references to it. Sorry, I ought to have checked it out a bit better first. Derrick, for a isWhiteSpace() method, what do you think it ought to do when the String is null? Ian On 11/3/05, Derrick Oswald <Der...@ro...> wrote: > Conversion of character references like is already performed by > the util.Translate class. > There is no &tab; character reference as far as I'm aware (see > http://www.w3.org/TR/REC-html40/sgml/entities.html). > > Ian Macfarlane wrote: > > >Thanks for your reply, > > > >I wasn't suggesting trimming the actual text of the text nodes > >permanently, merely wondering if using the trim() method to see if the > >resulting string was empty would be sufficient, or whether we should > >also look for various white-space HTML entities (e.g. &tab; also) for > >purposes of determining this. > > > >Now I think about it some more, white space alone is probably what we > >want to do. If we want to get things like &tab; we ought to write some > >sort of method that would replace those types of HTML character > >references with the actual characters, if that's feasible. > > > >The only other question I've got - what do you all think should happen > >if the contents of the text node is null? Should it return true > >(because there's no characters), false (because it's not actually a > >white space String) or throw a NullPointerException (which would > >negate the value of this method by forcing the end-user to write lots > >of code to use this method)? Can a text node ever be null without the > >user changing the text ot be null? > > > >Ian > > > >String is immutable so String.trim().equals("") won't change the > >original String object. > > > >On 11/2/05, Axel <ax...@gm...> wrote: > > > > > >>On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: > >> > >> > >>>I was thinking it might be worthwhile adding a method to Text/TextNode > >>>along the lines of: > >>> > >>>boolean isWhiteSpace() > >>> > >>>Which would return if the TextNode consisted of solely white space > >>>characters (or was the empty String). > >>> > >>>Now this could simply be done using String.trim().equals(""), however > >>>that wouldn't account for: > >>> > >>>- the non-breaking space character (#160) > >>>- The HTML code (also   as Firefox/IE do) > >>>- The HTML code   (also   as Firefox/IE do) > >>> > >>>So my question is, do you think should this method should treat those > >>>as spaces and remove/ignore them also for purposes of determining if > >>>the TextNode is white space? Or should it only trim normal whitespace > >>>(space, tab, carriage returns, etc). > >>> > >>> > >>I think, if every character (or entity converted to a > >>unicode-character) in the TextNode is true for > >>Character#isWhitespace() the boolean isWhiteSpace() should return > >>true; > >>IMO the TextNode shouldn't be trimmed automatically. Only a special > >>function should allow this to do. > >> > >>-- > >>Axel Kramer > >>http://www.plog4u.org - Wikipedia Eclipse Plugin > >> > >> > >>------------------------------------------------------- > >>SF.Net email is sponsored by: > >>Tame your development challenges with Apache's Geronimo App Server. Dow= nload > >>it for free - -and be entered to win a 42" plasma tv or your very own > >>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > >>_______________________________________________ > >>Htmlparser-developer mailing list > >>Htm...@li... > >>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > >> > >> > >> > > > > > >------------------------------------------------------- > >SF.Net email is sponsored by: > >Tame your development challenges with Apache's Geronimo App Server. Down= load > >it for free - -and be entered to win a 42" plasma tv or your very own > >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > >_______________________________________________ > >Htmlparser-developer mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Derrick O. <Der...@Ro...> - 2005-11-04 12:48:31
|
IMHO, the text shouldn't ever be null, but if it is, toHtml() would (should?) return an empty string, so isWhitespace() should also return true. Ian Macfarlane wrote: >>Conversion of character references like is already performed by the util.Translate class. >> >> >Oh good! No need for me to write it then :) > > > >>There is no &tab; character reference as far as I'm aware >> >> >You're right, I just guessed a whitespace entity name, typed it into >Google and found references to it. Sorry, I ought to have checked it >out a bit better first. > >Derrick, for a isWhiteSpace() method, what do you think it ought to do >when the String is null? > >Ian > >On 11/3/05, Derrick Oswald <Der...@ro...> wrote: > > >>Conversion of character references like is already performed by >>the util.Translate class. >>There is no &tab; character reference as far as I'm aware (see >>http://www.w3.org/TR/REC-html40/sgml/entities.html). >> >>Ian Macfarlane wrote: >> >> >> >>>Thanks for your reply, >>> >>>I wasn't suggesting trimming the actual text of the text nodes >>>permanently, merely wondering if using the trim() method to see if the >>>resulting string was empty would be sufficient, or whether we should >>>also look for various white-space HTML entities (e.g. &tab; also) for >>>purposes of determining this. >>> >>>Now I think about it some more, white space alone is probably what we >>>want to do. If we want to get things like &tab; we ought to write some >>>sort of method that would replace those types of HTML character >>>references with the actual characters, if that's feasible. >>> >>>The only other question I've got - what do you all think should happen >>>if the contents of the text node is null? Should it return true >>>(because there's no characters), false (because it's not actually a >>>white space String) or throw a NullPointerException (which would >>>negate the value of this method by forcing the end-user to write lots >>>of code to use this method)? Can a text node ever be null without the >>>user changing the text ot be null? >>> >>>Ian >>> >>>String is immutable so String.trim().equals("") won't change the >>>original String object. >>> >>>On 11/2/05, Axel <ax...@gm...> wrote: >>> >>> >>> >>> >>>>On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: >>>> >>>> >>>> >>>> >>>>>I was thinking it might be worthwhile adding a method to Text/TextNode >>>>>along the lines of: >>>>> >>>>>boolean isWhiteSpace() >>>>> >>>>>Which would return if the TextNode consisted of solely white space >>>>>characters (or was the empty String). >>>>> >>>>>Now this could simply be done using String.trim().equals(""), however >>>>>that wouldn't account for: >>>>> >>>>>- the non-breaking space character (#160) >>>>>- The HTML code (also   as Firefox/IE do) >>>>>- The HTML code   (also   as Firefox/IE do) >>>>> >>>>>So my question is, do you think should this method should treat those >>>>>as spaces and remove/ignore them also for purposes of determining if >>>>>the TextNode is white space? Or should it only trim normal whitespace >>>>>(space, tab, carriage returns, etc). >>>>> >>>>> >>>>> >>>>> >>>>I think, if every character (or entity converted to a >>>>unicode-character) in the TextNode is true for >>>>Character#isWhitespace() the boolean isWhiteSpace() should return >>>>true; >>>>IMO the TextNode shouldn't be trimmed automatically. Only a special >>>>function should allow this to do. >>>> >>>>-- >>>>Axel Kramer >>>>http://www.plog4u.org - Wikipedia Eclipse Plugin >>>> >>>> >>>>------------------------------------------------------- >>>>SF.Net email is sponsored by: >>>>Tame your development challenges with Apache's Geronimo App Server. Download >>>>it for free - -and be entered to win a 42" plasma tv or your very own >>>>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>>>_______________________________________________ >>>>Htmlparser-developer mailing list >>>>Htm...@li... >>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>>> >>>> >>>> >>>> >>>> >>>------------------------------------------------------- >>>SF.Net email is sponsored by: >>>Tame your development challenges with Apache's Geronimo App Server. Download >>>it for free - -and be entered to win a 42" plasma tv or your very own >>>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>>_______________________________________________ >>>Htmlparser-developer mailing list >>>Htm...@li... >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>> >>> >>> >>> >>> >> >>------------------------------------------------------- >>SF.Net email is sponsored by: >>Tame your development challenges with Apache's Geronimo App Server. Download >>it for free - -and be entered to win a 42" plasma tv or your very own >>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>_______________________________________________ >>Htmlparser-developer mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> >> > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |