htmlparser-user Mailing List for HTML Parser (Page 36)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Derrick O. <Der...@Ro...> - 2006-06-09 01:41:48
|
Ian, Don't you just hate Windows Search - completely broken and it's been that way for a half dozen years. But if you complain it doesn't get you anywhere... Correct so far. - the interpretation of the bytes from the input stream follows the META tag after that's encountered, a Java String doesn't really have a charset as far as I can tell - it's Unicode UTF-16 (I may be wrong here), so the answer is: it will have the 'correct' charset - whatever set it last, header or META. The regenerated toHtml() String will have the 'correct' charset because it's coming from an array of char which (as far as I can tell) covers most of the possible charsets. [I say most because there is a move afoot to make chars int32 in size to accommodate many more Chinese glyphs etc. and I'm not sure if that's in Mustang (Java 1.5) or not.] Now if you want to write an array of bytes on disk or pass a string to another program with 8 bit chars, you need to choose an encoding that can accommodate your charset... whole new ballgame. Bottom line is the encoding only matters if it's converted to bytes (I think). Check out the 'Save As Unicode' option in NotePad, it doesn't ask for a charset, (but then again, it may *know* the charset from user settings) but that sets the encoding (Unicode UTF-8 I think) for the file on disk. - in a number of places this is exactly the processing used, reset() followed by reparse, see for example StringBean.setStrings () - the point being that the client *must* rehandle nodes it was given, usually by starting from scratch - I don't know any other way - because what it was given was erroneous. In the case of a String as input, reparse won't yield any different characters (they just come from the String via charAt() and that won't change because the String is immutable) so the reset is redundant, except that the StringSource will have it's encoding (member variable) set correctly the second time so the hiccup won't happen twice. But the conversion from the byte stream to a String has to have been correct regardless of what the HTTP header says, otherwise you're pimped. So if the HttpClient gives you a String you have to ask: - did it look at the META tag? - is the META tag correct? If it sounds confused, that's because it probably is - in my own mind. Derrick Ian Macfarlane wrote: >That will teach me to rely on windows search. Bleh. > >Ok, so if the headers kick the file out as one charset, then the meta >tag states that it is a different one, I assume (based on the W3C >recommendations and a quick peek at InputStreamSource) if the new >encoding is compatible (characters parsed so far are the same) it will >just reparse the rest of the page with the new charset, otherwise it >will throw an EncodingChangeException. Am I right so far? > >Now if I walk through these two potential paths: > >- If the exception is not thrown, is the parsed document encoded with >the charset specified in the headers or in the meta tag? I.e. if I >convert it back to a String from a Nodelist etc, will it have the >correct charset from the meta tag still? > >- If the exception is thrown, can I reparse the entire document from >the original String or would I have to go back to the orignal byte[] >to do this? > >Thanks, > >Ian > >On 6/7/06, Derrick Oswald <der...@ro...> wrote: > > >>Its thrown in >> org.htmlparser.lexer.InputStreamSource.setEncoding >>(String) >> >> >> >>Ian Macfarlane <ian...@gm...> wrote: >> >> Derrick, >> >>I can't see anywhere EncodingChangeException is thrown in the code, >>perhaps this is not implemented yet? >> >>Ian >> >>On 6/5/06, Derrick Oswald wrote: >> >> >>>Ian, >>> >>>If you have a String in Java, it's Unicode encoded in UTF-16 - no? >>>(the trick of course, is in how it got to be a String, or how the String >>>gets saved to a Stream) >>>so I don't think you *need* to specify the encoding if you are passing >>>in a String. >>>Looking at the StringSource.java code, the encoding which may be passed >>>in the constructor is just stored as a property. >>>It doesn't appear to be used. But if set properly on the constructor it >>>would avoid a retrace when the META tag is encountered. >>>You would do something like this: >>>new Parser (new Lexer (new Page (my_string, my_encoding))) >>> >>>There is code in MetaTag.doSemanticAction() to set the page encoding >>>based on the META tag. >>>This mechanism wouldn't do anything under the hood if the input is a >>>String (based on the the fact the StringSource just stores the encoding). >>>But, if the HttpClient incorrectly converted the stream to a String >>>based on the HTTP header content type and the META tag actually has the >>>correct encoding you have a problem (this is the reason for the >>>EncodingChangeException thrown by the parser). >>> >>>Conversion from the parse tree to a String actually just regurgitates >>>the characters read in, so the charset and encoding don't enter into it >>>here. >>> >>>Submitting the String to be parsed again brings up the same issues as >>>the first time. >>> >>>Derrick >>> >>>Ian Macfarlane wrote: >>> >>> >>> >>>>I have a few questions regarding the best way to perform multiple >>>>parsing to and from HTML stored as a String and HTMLParser parsed >>>>(tree) format. >>>> >>>>1) Firstly, when first parsing (using Parser not Lexer, I need a >>>>tree), is there a way to pass it the charset (e.g. UTF-8) that was >>>>specified in the HTTP headers? Do I need to do this if it is already >>>>encoded correctly? (I'm using Apache HTTPClient which can convert into >>>>a Byte[] or a correctly encoded String using the headers found, and >>>>I'm using the latter option). >>>> >>>>2) Once I have done this, I'd want it to be overridden if the Meta >>>>http-equiv Content-Type gives me a different one. Can the parser >>>>automatically do this? Or do I have to attempt to read it myself? >>>> >>>>3) Now I've got the body tag, and a charset specified either by the >>>>headers or the meta tag (or if none, a sensible default), I want to >>>>convert the document back into a String again. Do I need to be >>>>concerned about the charset again here, or do the Node/NodeList >>>>toString methods handle this? >>>> >>>>4) Finally, once I have a String that's a product of the above, and I >>>>want to again convert it into an HTMLParser tree, do I need to specify >>>>the charset again here? >>>> >>>>Thanks >>>> >>>>Ian >>>> >>>> >>>> > > |
From: Ian M. <ian...@gm...> - 2006-06-08 12:28:25
|
That will teach me to rely on windows search. Bleh. Ok, so if the headers kick the file out as one charset, then the meta tag states that it is a different one, I assume (based on the W3C recommendations and a quick peek at InputStreamSource) if the new encoding is compatible (characters parsed so far are the same) it will just reparse the rest of the page with the new charset, otherwise it will throw an EncodingChangeException. Am I right so far? Now if I walk through these two potential paths: - If the exception is not thrown, is the parsed document encoded with the charset specified in the headers or in the meta tag? I.e. if I convert it back to a String from a Nodelist etc, will it have the correct charset from the meta tag still? - If the exception is thrown, can I reparse the entire document from the original String or would I have to go back to the orignal byte[] to do this? Thanks, Ian On 6/7/06, Derrick Oswald <der...@ro...> wrote: > Its thrown in > org.htmlparser.lexer.InputStreamSource.setEncoding > (String) > > > > Ian Macfarlane <ian...@gm...> wrote: > > Derrick, > > I can't see anywhere EncodingChangeException is thrown in the code, > perhaps this is not implemented yet? > > Ian > > On 6/5/06, Derrick Oswald wrote: > > Ian, > > > > If you have a String in Java, it's Unicode encoded in UTF-16 - no? > > (the trick of course, is in how it got to be a String, or how the String > > gets saved to a Stream) > > so I don't think you *need* to specify the encoding if you are passing > > in a String. > > Looking at the StringSource.java code, the encoding which may be passed > > in the constructor is just stored as a property. > > It doesn't appear to be used. But if set properly on the constructor it > > would avoid a retrace when the META tag is encountered. > > You would do something like this: > > new Parser (new Lexer (new Page (my_string, my_encoding))) > > > > There is code in MetaTag.doSemanticAction() to set the page encoding > > based on the META tag. > > This mechanism wouldn't do anything under the hood if the input is a > > String (based on the the fact the StringSource just stores the encoding). > > But, if the HttpClient incorrectly converted the stream to a String > > based on the HTTP header content type and the META tag actually has the > > correct encoding you have a problem (this is the reason for the > > EncodingChangeException thrown by the parser). > > > > Conversion from the parse tree to a String actually just regurgitates > > the characters read in, so the charset and encoding don't enter into it > > here. > > > > Submitting the String to be parsed again brings up the same issues as > > the first time. > > > > Derrick > > > > Ian Macfarlane wrote: > > > > >I have a few questions regarding the best way to perform multiple > > >parsing to and from HTML stored as a String and HTMLParser parsed > > >(tree) format. > > > > > >1) Firstly, when first parsing (using Parser not Lexer, I need a > > >tree), is there a way to pass it the charset (e.g. UTF-8) that was > > >specified in the HTTP headers? Do I need to do this if it is already > > >encoded correctly? (I'm using Apache HTTPClient which can convert into > > >a Byte[] or a correctly encoded String using the headers found, and > > >I'm using the latter option). > > > > > >2) Once I have done this, I'd want it to be overridden if the Meta > > >http-equiv Content-Type gives me a different one. Can the parser > > >automatically do this? Or do I have to attempt to read it myself? > > > > > >3) Now I've got the body tag, and a charset specified either by the > > >headers or the meta tag (or if none, a sensible default), I want to > > >convert the document back into a String again. Do I need to be > > >concerned about the charset again here, or do the Node/NodeList > > >toString methods handle this? > > > > > >4) Finally, once I have a String that's a product of the above, and I > > >want to again convert it into an HTMLParser tree, do I need to specify > > >the charset again here? > > > > > >Thanks > > > > > >Ian > > > > > > > > >_______________________________________________ > > >Htmlparser-user mailing list > > >Htm...@li... > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Ian M. <ian...@gm...> - 2006-06-07 22:13:24
|
The File class in Java has a method that gets you a list of all File objects in that directory. The rest should be easy. Ian On 6/7/06, Mark Stark <htm...@ey...> wrote: > Have you any idea how to pass recursively a list of files in a directory > to the string bean or any given visitor? > > Derrick Oswald schrieb: > > If you don't care how many carriage returns are present in the output, > > just output one after processing each tag in visitTag() and visitEndTag(). > > > > Mark Stark wrote: > > > >> Thanks Derrick, > >> > >> i have to add, that i've removed the breaksFlow() statement. i add a > >> carriageReturn after all segments (text between some bracktes). i later > >> save it in a file (key - value) > >> > >> my intention is, to extract all strings from a given html, write them > >> into a file, and replace these strings with some other values. (translation) > >> > >> the problem is, if "Organisationseinheiten $[weblogEnabled$" is > >> recognized as one connected segment, it is not possible to replace it in > >> a second run with the translation. is it understandable? :) > >> > >> p.s.: it is important that that parser can pass the templates with this > >> $$ subs. > >> > >> thanks a lot > >> > >> > >> > >> Derrick Oswald schrieb: > >> > >> > >>> Mark, > >>> > >>> A newline is only inserted in the output if the tag breaks the normal > >>> flow of text. > >>> The list of tags that do this is from the HTML specification and is > >>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. > >>> > >>> The StringBean processing is driven by the tags that are encountered. If > >>> it doesn't see a tag that causes a break, none is emitted. > >>> > >>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an > >>> argument could be made that it shouldn't print at all, but if your > >>> browser prints something and it inserts a newline, an argument could > >>> also be made to change the operation of the StringBean to assume that a > >>> break is pending *after* tags that break the flow, and output newlines > >>> accordingly. I fear this would cause more problems than it solves though. > >>> > >>> Presumably this 'dollar text' will be substituted by some server side > >>> processing into a real <TD>xxxx</TD> section, perhaps the parser should > >>> be applied after this processing. > >>> > >>> Derrick > >>> > >>> Mark Stark wrote: > >>> > >>> > >>> > >>>> I made a system.out before collapsing the string and got following hint > >>>> > >>>> Txt (3664[96,78],3672[96,86]): Personen > >>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t > >>>> Txt (3688[97,7],3697[98,7]): \n \t > >>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t > >>>> Txt (3797[100,79],3805[100,87]): Projekte > >>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t > >>>> Txt (3821[101,7],3830[102,7]): \n \t > >>>> Txt (3846[102,23],3850[103,2]): \n\t\t > >>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten > >>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t > >>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t > >>>> > >>>> The output from these lines after collaps() is > >>>> Personen > >>>> Projekte > >>>> Organisationseinheiten $[weblogEnabled$ > >>>> > >>>> The "failure" (i dont know if its a failure at all) should be into > >>>> collapse() > >>>> > >>>> > >>>> Mark Stark schrieb: > >>>> > >>>> > >>>> > >>>> > >>>>> hi, > >>>>> > >>>>> i'am using StringBean to extract strings from a given html source. This > >>>>> code caues htmlparser to only recognize one connected string > >>>>> > >>>>> <td class="yes"> > >>>>> <strong>Organisationseinheiten</strong> > >>>>> </td> > >>>>> $[weblogEnabled$ > >>>>> <td class="no"> > >>>>> > >>>>> returned: Organisationseinheiten $[weblogEnabled$ > >>>>> > >>>>> But it should be > >>>>> > >>>>> Organisationseinheiten > >>>>> > >>>>> $[weblogEnabled$ > >>>>> > >>>>> Can someone give me a hint which part of StringBean causes this? > >>>>> > >>>>> thanks a lot > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Htmlparser-user mailing list > >>>>> Htm...@li... > >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> Htmlparser-user mailing list > >>>> Htm...@li... > >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>>> > >>>> > >>>> > >>>> > >>>> > >>> _______________________________________________ > >>> Htmlparser-user mailing list > >>> Htm...@li... > >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>> > >>> > >>> > >>> > >> > >> > >> > >> _______________________________________________ > >> Htmlparser-user mailing list > >> Htm...@li... > >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >> > >> > >> > > > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <der...@ro...> - 2006-06-07 21:40:39
|
Its thrown in org.htmlparser.lexer.InputStreamSource.setEncoding (String) Ian Macfarlane <ian...@gm...> wrote: Derrick, I can't see anywhere EncodingChangeException is thrown in the code, perhaps this is not implemented yet? Ian On 6/5/06, Derrick Oswald wrote: > Ian, > > If you have a String in Java, it's Unicode encoded in UTF-16 - no? > (the trick of course, is in how it got to be a String, or how the String > gets saved to a Stream) > so I don't think you *need* to specify the encoding if you are passing > in a String. > Looking at the StringSource.java code, the encoding which may be passed > in the constructor is just stored as a property. > It doesn't appear to be used. But if set properly on the constructor it > would avoid a retrace when the META tag is encountered. > You would do something like this: > new Parser (new Lexer (new Page (my_string, my_encoding))) > > There is code in MetaTag.doSemanticAction() to set the page encoding > based on the META tag. > This mechanism wouldn't do anything under the hood if the input is a > String (based on the the fact the StringSource just stores the encoding). > But, if the HttpClient incorrectly converted the stream to a String > based on the HTTP header content type and the META tag actually has the > correct encoding you have a problem (this is the reason for the > EncodingChangeException thrown by the parser). > > Conversion from the parse tree to a String actually just regurgitates > the characters read in, so the charset and encoding don't enter into it > here. > > Submitting the String to be parsed again brings up the same issues as > the first time. > > Derrick > > Ian Macfarlane wrote: > > >I have a few questions regarding the best way to perform multiple > >parsing to and from HTML stored as a String and HTMLParser parsed > >(tree) format. > > > >1) Firstly, when first parsing (using Parser not Lexer, I need a > >tree), is there a way to pass it the charset (e.g. UTF-8) that was > >specified in the HTTP headers? Do I need to do this if it is already > >encoded correctly? (I'm using Apache HTTPClient which can convert into > >a Byte[] or a correctly encoded String using the headers found, and > >I'm using the latter option). > > > >2) Once I have done this, I'd want it to be overridden if the Meta > >http-equiv Content-Type gives me a different one. Can the parser > >automatically do this? Or do I have to attempt to read it myself? > > > >3) Now I've got the body tag, and a charset specified either by the > >headers or the meta tag (or if none, a sensible default), I want to > >convert the document back into a String again. Do I need to be > >concerned about the charset again here, or do the Node/NodeList > >toString methods handle this? > > > >4) Finally, once I have a String that's a product of the above, and I > >want to again convert it into an HTMLParser tree, do I need to specify > >the charset again here? > > > >Thanks > > > >Ian > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Ian M. <ian...@gm...> - 2006-06-07 20:14:24
|
Derrick, I can't see anywhere EncodingChangeException is thrown in the code, perhaps this is not implemented yet? Ian On 6/5/06, Derrick Oswald <Der...@ro...> wrote: > Ian, > > If you have a String in Java, it's Unicode encoded in UTF-16 - no? > (the trick of course, is in how it got to be a String, or how the String > gets saved to a Stream) > so I don't think you *need* to specify the encoding if you are passing > in a String. > Looking at the StringSource.java code, the encoding which may be passed > in the constructor is just stored as a property. > It doesn't appear to be used. But if set properly on the constructor it > would avoid a retrace when the META tag is encountered. > You would do something like this: > new Parser (new Lexer (new Page (my_string, my_encoding))) > > There is code in MetaTag.doSemanticAction() to set the page encoding > based on the META tag. > This mechanism wouldn't do anything under the hood if the input is a > String (based on the the fact the StringSource just stores the encoding). > But, if the HttpClient incorrectly converted the stream to a String > based on the HTTP header content type and the META tag actually has the > correct encoding you have a problem (this is the reason for the > EncodingChangeException thrown by the parser). > > Conversion from the parse tree to a String actually just regurgitates > the characters read in, so the charset and encoding don't enter into it > here. > > Submitting the String to be parsed again brings up the same issues as > the first time. > > Derrick > > Ian Macfarlane wrote: > > >I have a few questions regarding the best way to perform multiple > >parsing to and from HTML stored as a String and HTMLParser parsed > >(tree) format. > > > >1) Firstly, when first parsing (using Parser not Lexer, I need a > >tree), is there a way to pass it the charset (e.g. UTF-8) that was > >specified in the HTTP headers? Do I need to do this if it is already > >encoded correctly? (I'm using Apache HTTPClient which can convert into > >a Byte[] or a correctly encoded String using the headers found, and > >I'm using the latter option). > > > >2) Once I have done this, I'd want it to be overridden if the Meta > >http-equiv Content-Type gives me a different one. Can the parser > >automatically do this? Or do I have to attempt to read it myself? > > > >3) Now I've got the body tag, and a charset specified either by the > >headers or the meta tag (or if none, a sensible default), I want to > >convert the document back into a String again. Do I need to be > >concerned about the charset again here, or do the Node/NodeList > >toString methods handle this? > > > >4) Finally, once I have a String that's a product of the above, and I > >want to again convert it into an HTMLParser tree, do I need to specify > >the charset again here? > > > >Thanks > > > >Ian > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Mark S. <htm...@ey...> - 2006-06-07 12:53:15
|
Have you any idea how to pass recursively a list of files in a directory to the string bean or any given visitor? Derrick Oswald schrieb: > If you don't care how many carriage returns are present in the output, > just output one after processing each tag in visitTag() and visitEndTag(). > > Mark Stark wrote: > >> Thanks Derrick, >> >> i have to add, that i've removed the breaksFlow() statement. i add a >> carriageReturn after all segments (text between some bracktes). i later >> save it in a file (key - value) >> >> my intention is, to extract all strings from a given html, write them >> into a file, and replace these strings with some other values. (translation) >> >> the problem is, if "Organisationseinheiten $[weblogEnabled$" is >> recognized as one connected segment, it is not possible to replace it in >> a second run with the translation. is it understandable? :) >> >> p.s.: it is important that that parser can pass the templates with this >> $$ subs. >> >> thanks a lot >> >> >> >> Derrick Oswald schrieb: >> >> >>> Mark, >>> >>> A newline is only inserted in the output if the tag breaks the normal >>> flow of text. >>> The list of tags that do this is from the HTML specification and is >>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. >>> >>> The StringBean processing is driven by the tags that are encountered. If >>> it doesn't see a tag that causes a break, none is emitted. >>> >>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an >>> argument could be made that it shouldn't print at all, but if your >>> browser prints something and it inserts a newline, an argument could >>> also be made to change the operation of the StringBean to assume that a >>> break is pending *after* tags that break the flow, and output newlines >>> accordingly. I fear this would cause more problems than it solves though. >>> >>> Presumably this 'dollar text' will be substituted by some server side >>> processing into a real <TD>xxxx</TD> section, perhaps the parser should >>> be applied after this processing. >>> >>> Derrick >>> >>> Mark Stark wrote: >>> >>> >>> >>>> I made a system.out before collapsing the string and got following hint >>>> >>>> Txt (3664[96,78],3672[96,86]): Personen >>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >>>> Txt (3688[97,7],3697[98,7]): \n \t >>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >>>> Txt (3797[100,79],3805[100,87]): Projekte >>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >>>> Txt (3821[101,7],3830[102,7]): \n \t >>>> Txt (3846[102,23],3850[103,2]): \n\t\t >>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten >>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >>>> >>>> The output from these lines after collaps() is >>>> Personen >>>> Projekte >>>> Organisationseinheiten $[weblogEnabled$ >>>> >>>> The "failure" (i dont know if its a failure at all) should be into >>>> collapse() >>>> >>>> >>>> Mark Stark schrieb: >>>> >>>> >>>> >>>> >>>>> hi, >>>>> >>>>> i'am using StringBean to extract strings from a given html source. This >>>>> code caues htmlparser to only recognize one connected string >>>>> >>>>> <td class="yes"> >>>>> <strong>Organisationseinheiten</strong> >>>>> </td> >>>>> $[weblogEnabled$ >>>>> <td class="no"> >>>>> >>>>> returned: Organisationseinheiten $[weblogEnabled$ >>>>> >>>>> But it should be >>>>> >>>>> Organisationseinheiten >>>>> >>>>> $[weblogEnabled$ >>>>> >>>>> Can someone give me a hint which part of StringBean causes this? >>>>> >>>>> thanks a lot >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Htmlparser-user mailing list >>>>> Htm...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Htmlparser-user mailing list >>>> Htm...@li... >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-06-07 11:55:54
|
There was the concept of a suite in the JUnit 3.8, maybe it's something like that. Mark Stark wrote: >Thank you, this works fine :) > >it is not a htmlparser question, but do you know how to run multiple >TestClasses with JUnit4TestAdapter > >return new JUnit4TestAdapter(SegmentFindingVisitorTest.class); > >Derrick Oswald schrieb: > > >>If you don't care how many carriage returns are present in the output, >>just output one after processing each tag in visitTag() and visitEndTag(). >> >>Mark Stark wrote: >> >> > > |
From: Mark S. <htm...@ey...> - 2006-06-07 10:41:30
|
Thank you, this works fine :) it is not a htmlparser question, but do you know how to run multiple TestClasses with JUnit4TestAdapter return new JUnit4TestAdapter(SegmentFindingVisitorTest.class); Derrick Oswald schrieb: > If you don't care how many carriage returns are present in the output, > just output one after processing each tag in visitTag() and visitEndTag(). > > Mark Stark wrote: > >> Thanks Derrick, >> >> i have to add, that i've removed the breaksFlow() statement. i add a >> carriageReturn after all segments (text between some bracktes). i later >> save it in a file (key - value) >> >> my intention is, to extract all strings from a given html, write them >> into a file, and replace these strings with some other values. (translation) >> >> the problem is, if "Organisationseinheiten $[weblogEnabled$" is >> recognized as one connected segment, it is not possible to replace it in >> a second run with the translation. is it understandable? :) >> >> p.s.: it is important that that parser can pass the templates with this >> $$ subs. >> >> thanks a lot >> >> >> >> Derrick Oswald schrieb: >> >> >>> Mark, >>> >>> A newline is only inserted in the output if the tag breaks the normal >>> flow of text. >>> The list of tags that do this is from the HTML specification and is >>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. >>> >>> The StringBean processing is driven by the tags that are encountered. If >>> it doesn't see a tag that causes a break, none is emitted. >>> >>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an >>> argument could be made that it shouldn't print at all, but if your >>> browser prints something and it inserts a newline, an argument could >>> also be made to change the operation of the StringBean to assume that a >>> break is pending *after* tags that break the flow, and output newlines >>> accordingly. I fear this would cause more problems than it solves though. >>> >>> Presumably this 'dollar text' will be substituted by some server side >>> processing into a real <TD>xxxx</TD> section, perhaps the parser should >>> be applied after this processing. >>> >>> Derrick >>> >>> Mark Stark wrote: >>> >>> >>> >>>> I made a system.out before collapsing the string and got following hint >>>> >>>> Txt (3664[96,78],3672[96,86]): Personen >>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >>>> Txt (3688[97,7],3697[98,7]): \n \t >>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >>>> Txt (3797[100,79],3805[100,87]): Projekte >>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >>>> Txt (3821[101,7],3830[102,7]): \n \t >>>> Txt (3846[102,23],3850[103,2]): \n\t\t >>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten >>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >>>> >>>> The output from these lines after collaps() is >>>> Personen >>>> Projekte >>>> Organisationseinheiten $[weblogEnabled$ >>>> >>>> The "failure" (i dont know if its a failure at all) should be into >>>> collapse() >>>> >>>> >>>> Mark Stark schrieb: >>>> >>>> >>>> >>>> >>>>> hi, >>>>> >>>>> i'am using StringBean to extract strings from a given html source. This >>>>> code caues htmlparser to only recognize one connected string >>>>> >>>>> <td class="yes"> >>>>> <strong>Organisationseinheiten</strong> >>>>> </td> >>>>> $[weblogEnabled$ >>>>> <td class="no"> >>>>> >>>>> returned: Organisationseinheiten $[weblogEnabled$ >>>>> >>>>> But it should be >>>>> >>>>> Organisationseinheiten >>>>> >>>>> $[weblogEnabled$ >>>>> >>>>> Can someone give me a hint which part of StringBean causes this? >>>>> >>>>> thanks a lot >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Htmlparser-user mailing list >>>>> Htm...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Htmlparser-user mailing list >>>> Htm...@li... >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-06-06 23:26:31
|
If you don't care how many carriage returns are present in the output, just output one after processing each tag in visitTag() and visitEndTag(). Mark Stark wrote: >Thanks Derrick, > >i have to add, that i've removed the breaksFlow() statement. i add a >carriageReturn after all segments (text between some bracktes). i later >save it in a file (key - value) > >my intention is, to extract all strings from a given html, write them >into a file, and replace these strings with some other values. (translation) > >the problem is, if "Organisationseinheiten $[weblogEnabled$" is >recognized as one connected segment, it is not possible to replace it in >a second run with the translation. is it understandable? :) > >p.s.: it is important that that parser can pass the templates with this >$$ subs. > >thanks a lot > > > >Derrick Oswald schrieb: > > >>Mark, >> >>A newline is only inserted in the output if the tag breaks the normal >>flow of text. >>The list of tags that do this is from the HTML specification and is >>encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. >> >>The StringBean processing is driven by the tags that are encountered. If >>it doesn't see a tag that causes a break, none is emitted. >> >>Since the text $[weblogEnabled$ is outside of any TD tag in a table, an >>argument could be made that it shouldn't print at all, but if your >>browser prints something and it inserts a newline, an argument could >>also be made to change the operation of the StringBean to assume that a >>break is pending *after* tags that break the flow, and output newlines >>accordingly. I fear this would cause more problems than it solves though. >> >>Presumably this 'dollar text' will be substituted by some server side >>processing into a real <TD>xxxx</TD> section, perhaps the parser should >>be applied after this processing. >> >>Derrick >> >>Mark Stark wrote: >> >> >> >>>I made a system.out before collapsing the string and got following hint >>> >>>Txt (3664[96,78],3672[96,86]): Personen >>>Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >>>Txt (3688[97,7],3697[98,7]): \n \t >>>Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >>>Txt (3797[100,79],3805[100,87]): Projekte >>>Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >>>Txt (3821[101,7],3830[102,7]): \n \t >>>Txt (3846[102,23],3850[103,2]): \n\t\t >>>Txt (3858[103,10],3880[103,32]): Organisationseinheiten >>>Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >>>Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >>> >>>The output from these lines after collaps() is >>>Personen >>>Projekte >>>Organisationseinheiten $[weblogEnabled$ >>> >>>The "failure" (i dont know if its a failure at all) should be into >>>collapse() >>> >>> >>>Mark Stark schrieb: >>> >>> >>> >>> >>>>hi, >>>> >>>>i'am using StringBean to extract strings from a given html source. This >>>>code caues htmlparser to only recognize one connected string >>>> >>>><td class="yes"> >>>> <strong>Organisationseinheiten</strong> >>>></td> >>>> $[weblogEnabled$ >>>><td class="no"> >>>> >>>>returned: Organisationseinheiten $[weblogEnabled$ >>>> >>>>But it should be >>>> >>>>Organisationseinheiten >>>> >>>>$[weblogEnabled$ >>>> >>>>Can someone give me a hint which part of StringBean causes this? >>>> >>>>thanks a lot >>>> >>>> >>>> >>>>_______________________________________________ >>>>Htmlparser-user mailing list >>>>Htm...@li... >>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>> >>>> >>>> >>> >>>_______________________________________________ >>>Htmlparser-user mailing list >>>Htm...@li... >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >>> >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Mark S. <htm...@ey...> - 2006-06-06 12:25:15
|
Thanks Derrick, i have to add, that i've removed the breaksFlow() statement. i add a carriageReturn after all segments (text between some bracktes). i later save it in a file (key - value) my intention is, to extract all strings from a given html, write them into a file, and replace these strings with some other values. (translation) the problem is, if "Organisationseinheiten $[weblogEnabled$" is recognized as one connected segment, it is not possible to replace it in a second run with the translation. is it understandable? :) p.s.: it is important that that parser can pass the templates with this $$ subs. thanks a lot Derrick Oswald schrieb: > Mark, > > A newline is only inserted in the output if the tag breaks the normal > flow of text. > The list of tags that do this is from the HTML specification and is > encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. > > The StringBean processing is driven by the tags that are encountered. If > it doesn't see a tag that causes a break, none is emitted. > > Since the text $[weblogEnabled$ is outside of any TD tag in a table, an > argument could be made that it shouldn't print at all, but if your > browser prints something and it inserts a newline, an argument could > also be made to change the operation of the StringBean to assume that a > break is pending *after* tags that break the flow, and output newlines > accordingly. I fear this would cause more problems than it solves though. > > Presumably this 'dollar text' will be substituted by some server side > processing into a real <TD>xxxx</TD> section, perhaps the parser should > be applied after this processing. > > Derrick > > Mark Stark wrote: > >> I made a system.out before collapsing the string and got following hint >> >> Txt (3664[96,78],3672[96,86]): Personen >> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >> Txt (3688[97,7],3697[98,7]): \n \t >> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >> Txt (3797[100,79],3805[100,87]): Projekte >> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >> Txt (3821[101,7],3830[102,7]): \n \t >> Txt (3846[102,23],3850[103,2]): \n\t\t >> Txt (3858[103,10],3880[103,32]): Organisationseinheiten >> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >> >> The output from these lines after collaps() is >> Personen >> Projekte >> Organisationseinheiten $[weblogEnabled$ >> >> The "failure" (i dont know if its a failure at all) should be into >> collapse() >> >> >> Mark Stark schrieb: >> >> >>> hi, >>> >>> i'am using StringBean to extract strings from a given html source. This >>> code caues htmlparser to only recognize one connected string >>> >>> <td class="yes"> >>> <strong>Organisationseinheiten</strong> >>> </td> >>> $[weblogEnabled$ >>> <td class="no"> >>> >>> returned: Organisationseinheiten $[weblogEnabled$ >>> >>> But it should be >>> >>> Organisationseinheiten >>> >>> $[weblogEnabled$ >>> >>> Can someone give me a hint which part of StringBean causes this? >>> >>> thanks a lot >>> >>> >>> >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-06-06 11:57:12
|
Mark, A newline is only inserted in the output if the tag breaks the normal flow of text. The list of tags that do this is from the HTML specification and is encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. The StringBean processing is driven by the tags that are encountered. If it doesn't see a tag that causes a break, none is emitted. Since the text $[weblogEnabled$ is outside of any TD tag in a table, an argument could be made that it shouldn't print at all, but if your browser prints something and it inserts a newline, an argument could also be made to change the operation of the StringBean to assume that a break is pending *after* tags that break the flow, and output newlines accordingly. I fear this would cause more problems than it solves though. Presumably this 'dollar text' will be substituted by some server side processing into a real <TD>xxxx</TD> section, perhaps the parser should be applied after this processing. Derrick Mark Stark wrote: >I made a system.out before collapsing the string and got following hint > >Txt (3664[96,78],3672[96,86]): Personen >Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >Txt (3688[97,7],3697[98,7]): \n \t >Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >Txt (3797[100,79],3805[100,87]): Projekte >Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >Txt (3821[101,7],3830[102,7]): \n \t >Txt (3846[102,23],3850[103,2]): \n\t\t >Txt (3858[103,10],3880[103,32]): Organisationseinheiten >Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t > >The output from these lines after collaps() is >Personen >Projekte >Organisationseinheiten $[weblogEnabled$ > >The "failure" (i dont know if its a failure at all) should be into >collapse() > > >Mark Stark schrieb: > > >>hi, >> >>i'am using StringBean to extract strings from a given html source. This >>code caues htmlparser to only recognize one connected string >> >><td class="yes"> >> <strong>Organisationseinheiten</strong> >></td> >> $[weblogEnabled$ >><td class="no"> >> >>returned: Organisationseinheiten $[weblogEnabled$ >> >>But it should be >> >>Organisationseinheiten >> >>$[weblogEnabled$ >> >>Can someone give me a hint which part of StringBean causes this? >> >>thanks a lot >> >> >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Mark S. <htm...@ey...> - 2006-06-06 10:59:57
|
I made a system.out before collapsing the string and got following hint Txt (3664[96,78],3672[96,86]): Personen Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t Txt (3688[97,7],3697[98,7]): \n \t Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t Txt (3797[100,79],3805[100,87]): Projekte Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t Txt (3821[101,7],3830[102,7]): \n \t Txt (3846[102,23],3850[103,2]): \n\t\t Txt (3858[103,10],3880[103,32]): Organisationseinheiten Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t The output from these lines after collaps() is Personen Projekte Organisationseinheiten $[weblogEnabled$ The "failure" (i dont know if its a failure at all) should be into collapse() Mark Stark schrieb: > hi, > > i'am using StringBean to extract strings from a given html source. This > code caues htmlparser to only recognize one connected string > > <td class="yes"> > <strong>Organisationseinheiten</strong> > </td> > $[weblogEnabled$ > <td class="no"> > > returned: Organisationseinheiten $[weblogEnabled$ > > But it should be > > Organisationseinheiten > > $[weblogEnabled$ > > Can someone give me a hint which part of StringBean causes this? > > thanks a lot > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Mark S. <htm...@ey...> - 2006-06-06 10:30:03
|
hi, i'am using StringBean to extract strings from a given html source. This code caues htmlparser to only recognize one connected string <td class="yes"> <strong>Organisationseinheiten</strong> </td> $[weblogEnabled$ <td class="no"> returned: Organisationseinheiten $[weblogEnabled$ But it should be Organisationseinheiten $[weblogEnabled$ Can someone give me a hint which part of StringBean causes this? thanks a lot |
From: Derrick O. <Der...@Ro...> - 2006-06-06 01:48:33
|
Ian, If you have a String in Java, it's Unicode encoded in UTF-16 - no? (the trick of course, is in how it got to be a String, or how the String gets saved to a Stream) so I don't think you *need* to specify the encoding if you are passing in a String. Looking at the StringSource.java code, the encoding which may be passed in the constructor is just stored as a property. It doesn't appear to be used. But if set properly on the constructor it would avoid a retrace when the META tag is encountered. You would do something like this: new Parser (new Lexer (new Page (my_string, my_encoding))) There is code in MetaTag.doSemanticAction() to set the page encoding based on the META tag. This mechanism wouldn't do anything under the hood if the input is a String (based on the the fact the StringSource just stores the encoding). But, if the HttpClient incorrectly converted the stream to a String based on the HTTP header content type and the META tag actually has the correct encoding you have a problem (this is the reason for the EncodingChangeException thrown by the parser). Conversion from the parse tree to a String actually just regurgitates the characters read in, so the charset and encoding don't enter into it here. Submitting the String to be parsed again brings up the same issues as the first time. Derrick Ian Macfarlane wrote: >I have a few questions regarding the best way to perform multiple >parsing to and from HTML stored as a String and HTMLParser parsed >(tree) format. > >1) Firstly, when first parsing (using Parser not Lexer, I need a >tree), is there a way to pass it the charset (e.g. UTF-8) that was >specified in the HTTP headers? Do I need to do this if it is already >encoded correctly? (I'm using Apache HTTPClient which can convert into >a Byte[] or a correctly encoded String using the headers found, and >I'm using the latter option). > >2) Once I have done this, I'd want it to be overridden if the Meta >http-equiv Content-Type gives me a different one. Can the parser >automatically do this? Or do I have to attempt to read it myself? > >3) Now I've got the body tag, and a charset specified either by the >headers or the meta tag (or if none, a sensible default), I want to >convert the document back into a String again. Do I need to be >concerned about the charset again here, or do the Node/NodeList >toString methods handle this? > >4) Finally, once I have a String that's a product of the above, and I >want to again convert it into an HTMLParser tree, do I need to specify >the charset again here? > >Thanks > >Ian > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Ian M. <ian...@gm...> - 2006-06-06 01:46:32
|
(reposting as it doesn't seem to have gone through the first time) I have a few questions regarding the best way to perform multiple parsing to and from HTML stored as a String and HTMLParser parsed (tree) format. 1) Firstly, when first parsing (using Parser not Lexer, I need a tree), is there a way to pass it the charset (e.g. UTF-8) that was specified in the HTTP headers? Do I need to do this if it is already encoded correctly? (I'm using Apache HTTPClient which can convert into a Byte[] or a correctly encoded String using the headers found, and I'm using the latter option). 2) Once I have done this, I'd want it to be overridden if the Meta http-equiv Content-Type gives me a different one. Can the parser automatically do this? Or do I have to attempt to read it myself? 3) Now I've got the body tag, and a charset specified either by the headers or the meta tag (or if none, a sensible default), I want to convert the document back into a String again. Do I need to be concerned about the charset again here, or do the Node/NodeList toString methods handle this? 4) Finally, once I have a String that's a product of the above, and I want to again convert it into an HTMLParser tree, do I need to specify the charset again here? Thanks Ian |
From: Ian M. <ian...@gm...> - 2006-06-06 01:46:17
|
NodeTreeWalker lets you choose depth first of breadth first iteration, but looking at the code, off the top of my head parsing that code should lead to row 1 being reached first in both situations. Ian On 6/2/06, Jay Kim <jy...@eq...> wrote: > > Derrick, > > I ran into another issue while finding the location of the specific > word. > It happened when I tested with a table. For example, here is the source > of sample HTML: > > <HTML> > <head> > <title>Test HTML </title> > </head> > <body> > <table border=1> > <tr> > <td>AAA</td> > <td>BBB</td> > <td>CCC</td> > </tr> > <tr> > <td>BBB</td> > <td>CCC</td> > <td>DDD</td> > </tr> > <tr> > <td>AAA</td> > <td>BBB</td> > <td>CCC</td> > </tr> > </table> > </body> > </HTML> > > And, if I load it in a browser, it'll look like this (with borders): > > AAA BBB CCC > BBB CCC DDD > AAA BBB CCC > > So, if I select 'BBB' in (row[2], col [1]) on IE, and get the word > count, it'll return 2 because it counts 'BBB' in (row[1], col[2]) first. > But, the htmlparser traverse nodes differently - it seems like it > detects 'BBB' in (row[2], col [1]) first before it detects the one in > row[1]. > > Is there any way to configure the parser to look into the first row > first (or, top-down on the view)? > > Please let me know if anything is not clear to you. > > Thanks, > > Jay > > -----Original Message----- > From: htm...@li... > [mailto:htm...@li...] On Behalf Of > Derrick Oswald > Sent: Friday, June 02, 2006 4:53 AM > To: This is the user list of htmlparser > Subject: Re: [Htmlparser-user] Finding a whole word > > You'll need to manipulate the children() NodeList of the parent of the > node you want to tag: > NodeList siblings = > text_node_with_the_text.getParent().getChildren(); > You'll need to change the text of the original node to have only the > text up to the insertion, then add the <a> and </a> nodes and another > text node with the rest of the text. > > Jay Kim wrote: > > >Derrick, > > > >Thanks for your comments. I still have to experiments with different > >files to see what's going on with the start position. > >Assuming that I can get the correct position/offset for the specific > >word, and then store the position information, the next step is to > >create a HTML tag at that position. For example, > > > >Original source: > > > ><html> > ><head> > ><title>test</title> > ></head> > ><body> > ><h1>this is test</h1> > ><p>AAA BBB CCC DDD > ><p>EEE FFF GGG HHH > >... > ></body> > ></html> > > > >And, let's say the search word is "GGG", and location is identified, > and > >I need to create the following HTML. > > > ><html> > ><head> > ><title>test</title> > ></head> > ><body> > ><h1>this is test</h1> > ><p>AAA BBB CCC DDD > ><p>EEE FFF <a name="mytag"></a>GGG HHH > >... > ></body> > ></html> > > > >I've tried StringBean to achieve this by overriding visitTag, > >visitStringNode, and etc., but I don't know if it's the best way. > >Because once you know the word position, you don't have to go through > >each node using Visitor, right? > >Also, I want to preserve the original HTML format as much as possible. > >Please let me know what would be the best way to generate modified HTML > >by inserting some custom tags at the pre-selected locations. > > > >As always, thank you very much for your kind help, > > > >Jay > > > > > >-----Original Message----- > >From: htm...@li... > >[mailto:htm...@li...] On Behalf Of > >Derrick Oswald > >Sent: Thursday, June 01, 2006 4:56 AM > >To: htm...@li... > >Subject: Re: [Htmlparser-user] Finding a whole word > > > >Jay, > > > >Your count may be off because the parser may be fetching a different > >page from the one you counted. > >HTTP servers may change the page based on the user agent. > >It's only really reliable from a file, unless you save the contents of > >the page the parser is working with (see Page.getText()). > >And, yes, \r\n are turned into a single \n in the Text node, but the > >node positions don't count this. > >The Page class has getRow() and getColumn() so you can compare with the > > >numbers reported by a text editor, which saves manual counting. Note > >that these are zero-based, not one-based like most editors. > > > >Your second problem is really up to you, the programmer, to remember > >which nodes the strings came from. > >The string offset is only relative to the node position, which is > >absolute on the page. > >If I were you I would create an index of node position and string > >position as you form the text in visitStringNode. > > > >Derrick > > > >Jay Kim wrote: > > > > > > > >>Hi Derrick, > >> > >>Thanks very much for your help. I've tried your sample code, and it > >>gives me the right text that I can compare with. > >>But, I have couple of issues to get the offset of the searching word. > >> > >>1. When I try Text.getStartPosition(), it's not matched with the > >>character count that I get from the HTML source file - yeah, I counted > >>one by one myself. It's like 15 characters off. For example, the > >>character count that I got from the parser was 154, as apposed to 139 > >>that I counted from the file. > >>The numbers are still off even if I include/exclude new line > >> > >> > >characters. > > > > > >>Are there some other factors that I'm not aware of? > >> > >>2. After I found the node that contains the word(string) that I'm > >>searching for, I need to get the offset of that word. For example, > >> Node text = AAA BBB CCC DDD BBB EEE > >>And, if the word that I'm searching for is the second 'BBB', is there > >>any reliable way to get the offset of that word? (I can't just get the > >>index form that string because HTML string could be different). > >>Please let me know. > >> > >>Thanks, > >> > >>Jay > >> > >> > >> > >> > >> > >> > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Jay K. <jy...@eq...> - 2006-06-02 22:19:36
|
Derrick, I ran into another issue while finding the location of the specific word. It happened when I tested with a table. For example, here is the source of sample HTML: <HTML> <head> <title>Test HTML </title> </head> <body> <table border=3D1> <tr> <td>AAA</td> <td>BBB</td> <td>CCC</td> </tr> <tr> <td>BBB</td> <td>CCC</td> <td>DDD</td> </tr> <tr> <td>AAA</td> <td>BBB</td> <td>CCC</td> </tr> </table> </body> </HTML> And, if I load it in a browser, it'll look like this (with borders): AAA BBB CCC=20 BBB CCC DDD=20 AAA BBB CCC So, if I select 'BBB' in (row[2], col [1]) on IE, and get the word count, it'll return 2 because it counts 'BBB' in (row[1], col[2]) first. But, the htmlparser traverse nodes differently - it seems like it detects 'BBB' in (row[2], col [1]) first before it detects the one in row[1]. Is there any way to configure the parser to look into the first row first (or, top-down on the view)? Please let me know if anything is not clear to you. Thanks, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Friday, June 02, 2006 4:53 AM To: This is the user list of htmlparser Subject: Re: [Htmlparser-user] Finding a whole word You'll need to manipulate the children() NodeList of the parent of the=20 node you want to tag: NodeList siblings =3D text_node_with_the_text.getParent().getChildren(); You'll need to change the text of the original node to have only the=20 text up to the insertion, then add the <a> and </a> nodes and another=20 text node with the rest of the text. Jay Kim wrote: >Derrick, > >Thanks for your comments. I still have to experiments with different >files to see what's going on with the start position. >Assuming that I can get the correct position/offset for the specific >word, and then store the position information, the next step is to >create a HTML tag at that position. For example, > >Original source: > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF GGG HHH >... ></body> ></html> > >And, let's say the search word is "GGG", and location is identified, and >I need to create the following HTML. > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF <a name=3D"mytag"></a>GGG HHH >... ></body> ></html> > >I've tried StringBean to achieve this by overriding visitTag, >visitStringNode, and etc., but I don't know if it's the best way. >Because once you know the word position, you don't have to go through >each node using Visitor, right? >Also, I want to preserve the original HTML format as much as possible. >Please let me know what would be the best way to generate modified HTML >by inserting some custom tags at the pre-selected locations. > >As always, thank you very much for your kind help, >=20 >Jay >=20 > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Thursday, June 01, 2006 4:56 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay, > >Your count may be off because the parser may be fetching a different=20 >page from the one you counted. >HTTP servers may change the page based on the user agent. >It's only really reliable from a file, unless you save the contents of=20 >the page the parser is working with (see Page.getText()). >And, yes, \r\n are turned into a single \n in the Text node, but the=20 >node positions don't count this. >The Page class has getRow() and getColumn() so you can compare with the >numbers reported by a text editor, which saves manual counting. Note=20 >that these are zero-based, not one-based like most editors. > >Your second problem is really up to you, the programmer, to remember=20 >which nodes the strings came from. >The string offset is only relative to the node position, which is=20 >absolute on the page. >If I were you I would create an index of node position and string=20 >position as you form the text in visitStringNode. > >Derrick > >Jay Kim wrote: > > =20 > >>Hi Derrick, >> >>Thanks very much for your help. I've tried your sample code, and it >>gives me the right text that I can compare with. >>But, I have couple of issues to get the offset of the searching word. >> >>1. When I try Text.getStartPosition(), it's not matched with the >>character count that I get from the HTML source file - yeah, I counted >>one by one myself. It's like 15 characters off. For example, the >>character count that I got from the parser was 154, as apposed to 139 >>that I counted from the file. >>The numbers are still off even if I include/exclude new line >> =20 >> >characters. > =20 > >>Are there some other factors that I'm not aware of? >> >>2. After I found the node that contains the word(string) that I'm >>searching for, I need to get the offset of that word. For example, >> Node text =3D AAA BBB CCC DDD BBB EEE >>And, if the word that I'm searching for is the second 'BBB', is there >>any reliable way to get the offset of that word? (I can't just get the >>index form that string because HTML string could be different). >>Please let me know. >> >>Thanks, >> >>Jay >> >> >>=20 >> >> =20 >> > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > =20 > _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Ian M. <ian...@gm...> - 2006-06-02 19:11:19
|
I have a few questions regarding the best way to perform multiple parsing to and from HTML stored as a String and HTMLParser parsed (tree) format. 1) Firstly, when first parsing (using Parser not Lexer, I need a tree), is there a way to pass it the charset (e.g. UTF-8) that was specified in the HTTP headers? Do I need to do this if it is already encoded correctly? (I'm using Apache HTTPClient which can convert into a Byte[] or a correctly encoded String using the headers found, and I'm using the latter option). 2) Once I have done this, I'd want it to be overridden if the Meta http-equiv Content-Type gives me a different one. Can the parser automatically do this? Or do I have to attempt to read it myself? 3) Now I've got the body tag, and a charset specified either by the headers or the meta tag (or if none, a sensible default), I want to convert the document back into a String again. Do I need to be concerned about the charset again here, or do the Node/NodeList toString methods handle this? 4) Finally, once I have a String that's a product of the above, and I want to again convert it into an HTMLParser tree, do I need to specify the charset again here? Thanks Ian |
From: Derrick O. <Der...@Ro...> - 2006-06-02 11:53:50
|
You'll need to manipulate the children() NodeList of the parent of the node you want to tag: NodeList siblings = text_node_with_the_text.getParent().getChildren(); You'll need to change the text of the original node to have only the text up to the insertion, then add the <a> and </a> nodes and another text node with the rest of the text. Jay Kim wrote: >Derrick, > >Thanks for your comments. I still have to experiments with different >files to see what's going on with the start position. >Assuming that I can get the correct position/offset for the specific >word, and then store the position information, the next step is to >create a HTML tag at that position. For example, > >Original source: > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF GGG HHH >... ></body> ></html> > >And, let's say the search word is "GGG", and location is identified, and >I need to create the following HTML. > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF <a name="mytag"></a>GGG HHH >... ></body> ></html> > >I've tried StringBean to achieve this by overriding visitTag, >visitStringNode, and etc., but I don't know if it's the best way. >Because once you know the word position, you don't have to go through >each node using Visitor, right? >Also, I want to preserve the original HTML format as much as possible. >Please let me know what would be the best way to generate modified HTML >by inserting some custom tags at the pre-selected locations. > >As always, thank you very much for your kind help, > >Jay > > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Thursday, June 01, 2006 4:56 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay, > >Your count may be off because the parser may be fetching a different >page from the one you counted. >HTTP servers may change the page based on the user agent. >It's only really reliable from a file, unless you save the contents of >the page the parser is working with (see Page.getText()). >And, yes, \r\n are turned into a single \n in the Text node, but the >node positions don't count this. >The Page class has getRow() and getColumn() so you can compare with the >numbers reported by a text editor, which saves manual counting. Note >that these are zero-based, not one-based like most editors. > >Your second problem is really up to you, the programmer, to remember >which nodes the strings came from. >The string offset is only relative to the node position, which is >absolute on the page. >If I were you I would create an index of node position and string >position as you form the text in visitStringNode. > >Derrick > >Jay Kim wrote: > > > >>Hi Derrick, >> >>Thanks very much for your help. I've tried your sample code, and it >>gives me the right text that I can compare with. >>But, I have couple of issues to get the offset of the searching word. >> >>1. When I try Text.getStartPosition(), it's not matched with the >>character count that I get from the HTML source file - yeah, I counted >>one by one myself. It's like 15 characters off. For example, the >>character count that I got from the parser was 154, as apposed to 139 >>that I counted from the file. >>The numbers are still off even if I include/exclude new line >> >> >characters. > > >>Are there some other factors that I'm not aware of? >> >>2. After I found the node that contains the word(string) that I'm >>searching for, I need to get the offset of that word. For example, >> Node text = AAA BBB CCC DDD BBB EEE >>And, if the word that I'm searching for is the second 'BBB', is there >>any reliable way to get the offset of that word? (I can't just get the >>index form that string because HTML string could be different). >>Please let me know. >> >>Thanks, >> >>Jay >> >> >> >> >> >> > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Jay K. <jy...@eq...> - 2006-06-01 23:39:04
|
Derrick, Thanks for your comments. I still have to experiments with different files to see what's going on with the start position. Assuming that I can get the correct position/offset for the specific word, and then store the position information, the next step is to create a HTML tag at that position. For example, Original source: <html> <head> <title>test</title> </head> <body> <h1>this is test</h1> <p>AAA BBB CCC DDD <p>EEE FFF GGG HHH ... </body> </html> And, let's say the search word is "GGG", and location is identified, and I need to create the following HTML. <html> <head> <title>test</title> </head> <body> <h1>this is test</h1> <p>AAA BBB CCC DDD <p>EEE FFF <a name=3D"mytag"></a>GGG HHH ... </body> </html> I've tried StringBean to achieve this by overriding visitTag, visitStringNode, and etc., but I don't know if it's the best way. Because once you know the word position, you don't have to go through each node using Visitor, right? Also, I want to preserve the original HTML format as much as possible. Please let me know what would be the best way to generate modified HTML by inserting some custom tags at the pre-selected locations. As always, thank you very much for your kind help, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Thursday, June 01, 2006 4:56 AM To: htm...@li... Subject: Re: [Htmlparser-user] Finding a whole word Jay, Your count may be off because the parser may be fetching a different=20 page from the one you counted. HTTP servers may change the page based on the user agent. It's only really reliable from a file, unless you save the contents of=20 the page the parser is working with (see Page.getText()). And, yes, \r\n are turned into a single \n in the Text node, but the=20 node positions don't count this. The Page class has getRow() and getColumn() so you can compare with the=20 numbers reported by a text editor, which saves manual counting. Note=20 that these are zero-based, not one-based like most editors. Your second problem is really up to you, the programmer, to remember=20 which nodes the strings came from. The string offset is only relative to the node position, which is=20 absolute on the page. If I were you I would create an index of node position and string=20 position as you form the text in visitStringNode. Derrick Jay Kim wrote: >Hi Derrick, > >Thanks very much for your help. I've tried your sample code, and it >gives me the right text that I can compare with. >But, I have couple of issues to get the offset of the searching word. > >1. When I try Text.getStartPosition(), it's not matched with the >character count that I get from the HTML source file - yeah, I counted >one by one myself. It's like 15 characters off. For example, the >character count that I got from the parser was 154, as apposed to 139 >that I counted from the file. >The numbers are still off even if I include/exclude new line characters. >Are there some other factors that I'm not aware of? > >2. After I found the node that contains the word(string) that I'm >searching for, I need to get the offset of that word. For example, > Node text =3D AAA BBB CCC DDD BBB EEE >And, if the word that I'm searching for is the second 'BBB', is there >any reliable way to get the offset of that word? (I can't just get the >index form that string because HTML string could be different). >Please let me know. > >Thanks, >=20 >Jay >=20 > > =20 > _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2006-06-01 11:56:39
|
Jay, Your count may be off because the parser may be fetching a different page from the one you counted. HTTP servers may change the page based on the user agent. It's only really reliable from a file, unless you save the contents of the page the parser is working with (see Page.getText()). And, yes, \r\n are turned into a single \n in the Text node, but the node positions don't count this. The Page class has getRow() and getColumn() so you can compare with the numbers reported by a text editor, which saves manual counting. Note that these are zero-based, not one-based like most editors. Your second problem is really up to you, the programmer, to remember which nodes the strings came from. The string offset is only relative to the node position, which is absolute on the page. If I were you I would create an index of node position and string position as you form the text in visitStringNode. Derrick Jay Kim wrote: >Hi Derrick, > >Thanks very much for your help. I've tried your sample code, and it >gives me the right text that I can compare with. >But, I have couple of issues to get the offset of the searching word. > >1. When I try Text.getStartPosition(), it's not matched with the >character count that I get from the HTML source file - yeah, I counted >one by one myself. It's like 15 characters off. For example, the >character count that I got from the parser was 154, as apposed to 139 >that I counted from the file. >The numbers are still off even if I include/exclude new line characters. >Are there some other factors that I'm not aware of? > >2. After I found the node that contains the word(string) that I'm >searching for, I need to get the offset of that word. For example, > Node text = AAA BBB CCC DDD BBB EEE >And, if the word that I'm searching for is the second 'BBB', is there >any reliable way to get the offset of that word? (I can't just get the >index form that string because HTML string could be different). >Please let me know. > >Thanks, > >Jay > > > > |
From: Jay K. <jy...@eq...> - 2006-06-01 02:16:40
|
Hi Derrick, Thanks very much for your help. I've tried your sample code, and it gives me the right text that I can compare with. But, I have couple of issues to get the offset of the searching word. 1. When I try Text.getStartPosition(), it's not matched with the character count that I get from the HTML source file - yeah, I counted one by one myself. It's like 15 characters off. For example, the character count that I got from the parser was 154, as apposed to 139 that I counted from the file. The numbers are still off even if I include/exclude new line characters. Are there some other factors that I'm not aware of? 2. After I found the node that contains the word(string) that I'm searching for, I need to get the offset of that word. For example, Node text =3D AAA BBB CCC DDD BBB EEE And, if the word that I'm searching for is the second 'BBB', is there any reliable way to get the offset of that word? (I can't just get the index form that string because HTML string could be different). Please let me know. Thanks, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Tuesday, May 30, 2006 3:16 PM To: htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word You probably want to override visitStringNode (Text string) in the=20 StringBean like you've done, but you'll need to be smarter about it.=20 Like keeping track of where you are (in whitespace or not), perhaps by=20 looking at the last character in the StringBuffer and the first=20 character in the incoming text (the default behaviour is to just slap=20 them together - see below). That and parsing the incoming text to break=20 it into words. Each node has a getStartPosition () nethod that will tell you where you are in the HTML page in units of characters. /** * Appends the text to the output. * @param string The text node. */ public void visitStringNode (Text string) { if (!mIsScript && !mIsStyle) { String text =3D string.getText (); if (!mIsPre) { text =3D Translate.decode (text); if (getReplaceNonBreakingSpaces ()) text =3D text.replace ('\u00a0', ' '); if (getCollapse ()) collapse (mBuffer, text); else mBuffer.append (text); } else mBuffer.append (text); } } Jay Kim wrote: >Let me describe more on the the problems of using StringBean as a >NodeVisitor. >Here is my code snippet: > > private class TestVisitor extends StringBean { > @Override > public void visitStringNode(Text text) { > System.out.println("text=3D" + text.getText()); > } > } > > TestVisitor visitor =3D new TestVisitor(); > visitor.setCollapse(false); > htmlParser.visitAllNodesWith(visitor); > >And, if I feed the sample HTML below, the visitStringNode() methods does >not detect the second 'AAAAA' as one word, but instead, it splits into >two words ('AAA' and 'AA'), which is basically the same problem that= I >described in the first email. >Please let me know. >Thanks, >=20 >Jay=20 > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of Jay >Kim >Sent: Tuesday, May 30, 2006 10:45 AM >To: htm...@li... >Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word > >Derrick, > >Thank you so much for your quick respond, and getting back to me with >the solution. >Now that I'm able to count the number of words appears in a HTML file >correctly, my next task is to find out the offset (start position) of >each words. I'm guessing that I probably have to use NodeVisitor with >StringBean, but I'd like to get some guidelines before I dig into the >APIs. >So, for the following sample HTML: > ><HTML> ><head> ><title>Test HTML</title> ></head> ><body> ><p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> ></body> ></HTML> > >If I search for 'AAAAA', I want to get three matches with their starting >positions (offsets), such as, > Match 1 offset =3D 58 > Match 2 offset =3D 70 > Match 3 offset =3D 108 > >Could you show me how to achieve this? >Thanks a lot, >=20 >Jay >=20 > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Monday, May 29, 2006 4:45 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay >The text you want can be obtained with the StringBean if Collapse is >false. > >When collapse is true, there is a bug in the StringBean. >I've logged this as bug #1496863 StringBean collapse() adds extra=20 >whitespace=20 ><http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&g= roup _ >id=3D24399&atid=3D381399>=20 >so you can track it. >Derrick > >Jay Kim wrote: > > =20 > >>Hi, >> >>I'm trying to get the word count using htmlparser, but it doesn't seem >> =20 >> > > =20 > >>to be able to handle the following example. >> >>Let's say the source html looks like this: >> >><HTML> >> >><head> >> >><title>Test HTML</title> >> >></head> >> >><body> >> >><p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> >> >></body> >> >></HTML> >> >>And, if you load it in a browser, you'll see the word 'AAAAA' three=20 >>times. >> >>But, if you parse this html, it returns following nodes: >> >>AAAAA BBBBB AAA AA BBBBB AAAAA >> >>So, it breaks down the second 'AAAAA' into two words because of the=20 >>font tag in the middle. And, the word count from the parsed text would >> =20 >> > > =20 > >>be "2". >> >>Is there any way that I can get the same text/string/word that I see=20 >>on the browser? >> >>Thanks, >> >>Jay >> >> =20 >> > > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications >in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat= =3D12164 2 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications >in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmdk&kid=107521&bid$8729&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=107521&bid$8729&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > =20 > ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat= =3D121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: lu d. <dom...@gm...> - 2006-05-31 10:39:17
|
From: Derrick O. <Der...@Ro...> - 2006-05-30 22:16:00
|
You probably want to override visitStringNode (Text string) in the StringBean like you've done, but you'll need to be smarter about it. Like keeping track of where you are (in whitespace or not), perhaps by looking at the last character in the StringBuffer and the first character in the incoming text (the default behaviour is to just slap them together - see below). That and parsing the incoming text to break it into words. Each node has a getStartPosition () nethod that will tell you where you are in the HTML page in units of characters. /** * Appends the text to the output. * @param string The text node. */ public void visitStringNode (Text string) { if (!mIsScript && !mIsStyle) { String text = string.getText (); if (!mIsPre) { text = Translate.decode (text); if (getReplaceNonBreakingSpaces ()) text = text.replace ('\u00a0', ' '); if (getCollapse ()) collapse (mBuffer, text); else mBuffer.append (text); } else mBuffer.append (text); } } Jay Kim wrote: >Let me describe more on the the problems of using StringBean as a >NodeVisitor. >Here is my code snippet: > > private class TestVisitor extends StringBean { > @Override > public void visitStringNode(Text text) { > System.out.println("text=" + text.getText()); > } > } > > TestVisitor visitor = new TestVisitor(); > visitor.setCollapse(false); > htmlParser.visitAllNodesWith(visitor); > >And, if I feed the sample HTML below, the visitStringNode() methods does >not detect the second 'AAAAA' as one word, but instead, it splits into >two words ('AAA' and 'AA'), which is basically the same problem that I >described in the first email. >Please let me know. >Thanks, > >Jay > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of Jay >Kim >Sent: Tuesday, May 30, 2006 10:45 AM >To: htm...@li... >Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word > >Derrick, > >Thank you so much for your quick respond, and getting back to me with >the solution. >Now that I'm able to count the number of words appears in a HTML file >correctly, my next task is to find out the offset (start position) of >each words. I'm guessing that I probably have to use NodeVisitor with >StringBean, but I'd like to get some guidelines before I dig into the >APIs. >So, for the following sample HTML: > ><HTML> ><head> ><title>Test HTML</title> ></head> ><body> ><p>AAAAA BBBBB AAA<font color='red'>AA</font> BBBBB AAAAA</p> ></body> ></HTML> > >If I search for 'AAAAA', I want to get three matches with their starting >positions (offsets), such as, > Match 1 offset = 58 > Match 2 offset = 70 > Match 3 offset = 108 > >Could you show me how to achieve this? >Thanks a lot, > >Jay > > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Monday, May 29, 2006 4:45 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay >The text you want can be obtained with the StringBean if Collapse is >false. > >When collapse is true, there is a bug in the StringBean. >I've logged this as bug #1496863 StringBean collapse() adds extra >whitespace ><http://sourceforge.net/tracker/index.php?func=detail&aid=1496863&group_ >id=24399&atid=381399> >so you can track it. >Derrick > >Jay Kim wrote: > > > >>Hi, >> >>I'm trying to get the word count using htmlparser, but it doesn't seem >> >> > > > >>to be able to handle the following example. >> >>Let's say the source html looks like this: >> >><HTML> >> >><head> >> >><title>Test HTML</title> >> >></head> >> >><body> >> >><p>AAAAA BBBBB AAA<font color='red'>AA</font> BBBBB AAAAA</p> >> >></body> >> >></HTML> >> >>And, if you load it in a browser, you'll see the word 'AAAAA' three >>times. >> >>But, if you parse this html, it returns following nodes: >> >>AAAAA BBBBB AAA AA BBBBB AAAAA >> >>So, it breaks down the second 'AAAAA' into two words because of the >>font tag in the middle. And, the word count from the parsed text would >> >> > > > >>be "2". >> >>Is there any way that I can get the same text/string/word that I see >>on the browser? >> >>Thanks, >> >>Jay >> >> >> > > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications >in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications >in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmdk&kid7521&bid$8729&dat1642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmd=k&kid7521&bid$8729&dat1642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Jay K. <jy...@eq...> - 2006-05-30 20:15:49
|
Let me describe more on the the problems of using StringBean as a NodeVisitor. Here is my code snippet: private class TestVisitor extends StringBean { @Override public void visitStringNode(Text text) { System.out.println("text=3D" + text.getText()); } } TestVisitor visitor =3D new TestVisitor(); visitor.setCollapse(false); htmlParser.visitAllNodesWith(visitor); And, if I feed the sample HTML below, the visitStringNode() methods does not detect the second 'AAAAA' as one word, but instead, it splits into two words ('AAA' and 'AA'), which is basically the same problem that I described in the first email. Please let me know. Thanks, =20 Jay=20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Jay Kim Sent: Tuesday, May 30, 2006 10:45 AM To: htm...@li... Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word Derrick, Thank you so much for your quick respond, and getting back to me with the solution. Now that I'm able to count the number of words appears in a HTML file correctly, my next task is to find out the offset (start position) of each words. I'm guessing that I probably have to use NodeVisitor with StringBean, but I'd like to get some guidelines before I dig into the APIs. So, for the following sample HTML: <HTML> <head> <title>Test HTML</title> </head> <body> <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> </body> </HTML> If I search for 'AAAAA', I want to get three matches with their starting positions (offsets), such as, Match 1 offset =3D 58 Match 2 offset =3D 70 Match 3 offset =3D 108 Could you show me how to achieve this? Thanks a lot, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Monday, May 29, 2006 4:45 AM To: htm...@li... Subject: Re: [Htmlparser-user] Finding a whole word Jay The text you want can be obtained with the StringBean if Collapse is false. When collapse is true, there is a bug in the StringBean. I've logged this as bug #1496863 StringBean collapse() adds extra=20 whitespace=20 <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&gr= oup_ id=3D24399&atid=3D381399>=20 so you can track it. Derrick Jay Kim wrote: > Hi, > > I'm trying to get the word count using htmlparser, but it doesn't seem > to be able to handle the following example. > > Let's say the source html looks like this: > > <HTML> > > <head> > > <title>Test HTML</title> > > </head> > > <body> > > <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> > > </body> > > </HTML> > > And, if you load it in a browser, you'll see the word 'AAAAA' three=20 > times. > > But, if you parse this html, it returns following nodes: > > AAAAA BBBBB AAA AA BBBBB AAAAA > > So, it breaks down the second 'AAAAA' into two words because of the=20 > font tag in the middle. And, the word count from the parsed text would > be "2". > > Is there any way that I can get the same text/string/word that I see=20 > on the browser? > > Thanks, > > Jay > ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat= =3D121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmdk&kid=107521&bid$8729&dat=121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |