htmlparser-user Mailing List for HTML Parser (Page 41)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Riaz u. <ru...@ya...> - 2006-03-16 13:18:00
|
The error is occuring at this statement: nodelistOfLinks.add(listOfSpanTags.elementAt(i).getParent()); Ian Macfarlane <ian...@gm...> wrote: The stack trace will tell you which line the NullPointerException was thrown on. Why don't you tell us which line it's occuring on? That will help pin it down. Ian On 3/15/06, Riaz uddin wrote: > Hi, > I have attached the procedure below, now when I call this procedure it > returns a null pointer excetion in the add method. It was working fine when > I had it in the main function, but it does not run when I created this > procedure, I think I need some java help on this, can someone suggest what I > can do? > > public static NodeList extractLinkFromSpanTag(String url) throws > ParserException > { > int i =0; > > NodeList nodelistOfLinks = null; > Parser parser = new Parser(url); > // Step 2. Collecting Tags in a list. > NodeList list = parser.parse (null); > > //news links are at the span tag (time), spanList stores the > span tags > // Step 3. Keep only the SPAN tags in spanList. > NodeList listOfSpanTags = list.extractAllNodesThatMatch(new > TagNameFilter ("SPAN"),true); > > while(i < listOfSpanTags.size()) > { // Beginning While loop to extract links > Span spanTag = > (Span)listOfSpanTags.elementAt(i); > // System.out.println(listOfSpanTags.size()); > // We only need SPAN tags with attribute "class = > 'recenttimedate'" > // Move to the link in the span tag > if(spanTag.getText().equals("span class=recenttimedate")) > > nodelistOfLinks.add(listOfSpanTags.elementAt(i).getParent()); > i++; > }// End of while loop to extract links > while(i < nodelistOfLinks.size()) > { > System.out.println(nodelistOfLinks.elementAt(i)); > i++; > } > > return nodelistOfLinks; > } > > ________________________________ > Yahoo! Mail > Bring photos to life! New PhotoMail makes sharing a breeze. > > > ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user --------------------------------- Relax. Yahoo! Mail virus scanning helps detect nasty viruses! |
From: Ian M. <ian...@gm...> - 2006-03-15 21:07:29
|
The stack trace will tell you which line the NullPointerException was thrown on. Why don't you tell us which line it's occuring on? That will help pin it down. Ian On 3/15/06, Riaz uddin <ru...@ya...> wrote: > Hi, > I have attached the procedure below, now when I call this procedure it > returns a null pointer excetion in the add method. It was working fine wh= en > I had it in the main function, but it does not run when I created this > procedure, I think I need some java help on this, can someone suggest wha= t I > can do? > > public static NodeList extractLinkFromSpanTag(String url) throws > ParserException > { > int i =3D0; > > NodeList nodelistOfLinks =3D null; > Parser parser =3D new Parser(url); > // Step 2. Collecting Tags in a list. > NodeList list =3D parser.parse (null); > > //news links are at the span tag (time), spanList stores the > span tags > // Step 3. Keep only the SPAN tags in spanList. > NodeList listOfSpanTags =3D list.extractAllNodesThatMatch(new > TagNameFilter ("SPAN"),true); > > while(i < listOfSpanTags.size()) > { // Beginning While loop to extract links > Span spanTag =3D > (Span)listOfSpanTags.elementAt(i); > // System.out.println(listOfSpanTags.size()); > // We only need SPAN tags with attribute "class =3D > 'recenttimedate'" > // Move to the link in the span tag > if(spanTag.getText().equals("span class=3Drecenttimedate"= )) > > nodelistOfLinks.add(listOfSpanTags.elementAt(i).getParent()); > i++; > }// End of while loop to extract links > while(i < nodelistOfLinks.size()) > { > System.out.println(nodelistOfLinks.elementAt(i)); > i++; > } > > return nodelistOfLinks; > } > > ________________________________ > Yahoo! Mail > Bring photos to life! New PhotoMail makes sharing a breeze. > > > |
From: Riaz u. <ru...@ya...> - 2006-03-15 17:26:03
|
Hi, I have attached the procedure below, now when I call this procedure it returns a null pointer excetion in the add method. It was working fine when I had it in the main function, but it does not run when I created this procedure, I think I need some java help on this, can someone suggest what I can do? public static NodeList extractLinkFromSpanTag(String url) throws ParserException { int i =0; NodeList nodelistOfLinks = null; Parser parser = new Parser(url); // Step 2. Collecting Tags in a list. NodeList list = parser.parse (null); //news links are at the span tag (time), spanList stores the span tags // Step 3. Keep only the SPAN tags in spanList. NodeList listOfSpanTags = list.extractAllNodesThatMatch(new TagNameFilter ("SPAN"),true); while(i < listOfSpanTags.size()) { // Beginning While loop to extract links Span spanTag = (Span)listOfSpanTags.elementAt(i); // System.out.println(listOfSpanTags.size()); // We only need SPAN tags with attribute "class = 'recenttimedate'" // Move to the link in the span tag if(spanTag.getText().equals("span class=recenttimedate")) nodelistOfLinks.add(listOfSpanTags.elementAt(i).getParent()); i++; }// End of while loop to extract links while(i < nodelistOfLinks.size()) { System.out.println(nodelistOfLinks.elementAt(i)); i++; } return nodelistOfLinks; } --------------------------------- Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze. |
From: <abh...@hs...> - 2006-03-07 13:29:20
|
Hi , I am going through the HtmlParser classes to develop a utility which reads HTML from a java program. My HTML doc has the info like this <H2>My Name</H2><H3>Address</H3> <P>It is not useful</P><H3>Age</H3> <P>It is important</P> I have to read the content between <H1></H1> ,<H2> </H2> and the corresponding <P></P> tags . ( I was not able to make much headway reading the HTML Parser code. ) How to do this or how to get started. Thanks in advance Abhijeet ************************************************************ HSBC Software Development (India) Pvt Ltd HSBC Center Riverside,West Avenue , 25 B Kalyani Nagar Pune 411 006 INDIA Telephone: +91 20 26683000 Fax: +91 20 26681030 ************************************************************ ----------------------------------------- *********************************************************************** This e-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return e-mail. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. *********************************************************************** |
From: Derrick O. <Der...@Ro...> - 2006-03-07 12:36:36
|
Filter the full node list: NodeList nl = parser.parse (null); NodeList list = nl.extractAllNodesThatMatch (filter); NodeList list2 = nl.extractAllNodesThatMatch (filter2); Antony Sequeira wrote: >Hi > >My first task was to extract links from pages >I looked at example and tried the following > > NodeFilter filter = new NodeClassFilter (LinkTag.class); > NodeList list = parser.extractAllNodesThatMatch (filter); > log("links follow:"); > for (int i = 0; i < list.size (); i++) > log (list.elementAt (i).toHtml ()); > >this works fine if the partser was just constrcuted before running this code > >On the other hand, if I put the following code preceeding the code above > NodeList nl = parser.parse (null); > log(nl.asString()); > >I get nothing for the links. > >How do I structure my code when I want to do multiple things while >parsing a page. >For example, I want to extract links, I want to extract forms and form >fields, I want to extract text. > >-Antony Sequeira > > > |
From: Antony S. <ant...@gm...> - 2006-03-07 04:23:24
|
Hi My first task was to extract links from pages I looked at example and tried the following NodeFilter filter =3D new NodeClassFilter (LinkTag.class= ); NodeList list =3D parser.extractAllNodesThatMatch (filte= r); log("links follow:"); for (int i =3D 0; i < list.size (); i++) log (list.elementAt (i).toHtml ()); this works fine if the partser was just constrcuted before running this cod= e On the other hand, if I put the following code preceeding the code above NodeList nl =3D parser.parse (null); log(nl.asString()); I get nothing for the links. How do I structure my code when I want to do multiple things while parsing a page. For example, I want to extract links, I want to extract forms and form fields, I want to extract text. -Antony Sequeira |
From: Antony S. <ant...@gm...> - 2006-03-07 04:08:14
|
Thank you. I will use your suggested approach if my current approach does not work out= . Currently I have come up with a means of providing a URLConnection backed by a byte array (instead of a TCP connection) and using that connection to construct the parser object. I have attached the code file. It is ugly and very specific to my current experimentation. I use it like URL urlob =3D ByteBufferURL.fromByteArray(new URL("http://original url string so relative links get resolved right"),byetarray,bytecontentlenght); Parser parser =3D new Parser(urlob.openConnection()); This does not result in any network activity of resolving/connecting etc (at least in my limited testing) as desired. The advantage IMO is it keeps the rest of the code simple (hopefully). Responding since this may be useful to Lu=EDs Gomes. I have other unrelated questions that I'll ask in a separate thread Thanks for the pointers. -Antony On 3/4/06, Derrick Oswald <Der...@ro...> wrote: > Lu=EDs, > > I believe what you want to do is possible with the current API. > > Page page =3D new Page (new InputStreamSource (input, charset)); > page.setUrl (url); > Parser parser =3D new Parser (new Lexer (page)); > > You would use the HTTP headers to figure out if it's gzipped (and use a > GZIPInputStream) and determine the charset yourself. > > Derrick |
From: Ian M. <ian...@gm...> - 2006-03-06 12:47:20
|
May I also suggest you have a look at the NodeTreeWalker class in CVS? Lets you navigate a Node tree iteratively in breadth-first or depth-first fashion. Ian On 05/03/06, Konstantine <lis...@gm...> wrote: > On 3/3/06, Derrick Oswald <Der...@ro...> wrote: > > I would suggest trying the FilterBuilder utility. > > You'll want things like TagNameFilter to get the <H3> and > > HasParent/HasChild/HasSibling filters to navigate around the node tree. > <snip> > > thanks for the reply and your time. I was going through API, it's > pretty cool, although documentation is somewhat lacking. > |
From: Konstantine <lis...@gm...> - 2006-03-05 08:52:24
|
T24gMy8zLzA2LCBEZXJyaWNrIE9zd2FsZCA8RGVycmlja09zd2FsZEByb2dlcnMuY29tPiB3cm90 ZToKPiBJIHdvdWxkIHN1Z2dlc3QgdHJ5aW5nIHRoZSBGaWx0ZXJCdWlsZGVyIHV0aWxpdHkuCj4g WW91J2xsIHdhbnQgdGhpbmdzIGxpa2UgVGFnTmFtZUZpbHRlciB0byBnZXQgdGhlIDxIMz4gYW5k Cj4gSGFzUGFyZW50L0hhc0NoaWxkL0hhc1NpYmxpbmcgZmlsdGVycyB0byBuYXZpZ2F0ZSBhcm91 bmQgdGhlIG5vZGUgdHJlZS4KPHNuaXA+Cgp0aGFua3MgZm9yIHRoZSByZXBseSBhbmQgeW91ciB0 aW1lLiBJIHdhcyBnb2luZyB0aHJvdWdoIEFQSSwgaXQncwpwcmV0dHkgY29vbCwgYWx0aG91Z2gg ZG9jdW1lbnRhdGlvbiBpcyBzb21ld2hhdCBsYWNraW5nLgo= |
From: Derrick O. <Der...@Ro...> - 2006-03-04 12:12:32
|
Luís, I believe what you want to do is possible with the current API. Page page = new Page (new InputStreamSource (input, charset)); page.setUrl (url); Parser parser = new Parser (new Lexer (page)); You would use the HTTP headers to figure out if it's gzipped (and use a GZIPInputStream) and determine the charset yourself. Derrick Luís Manuel dos Santos Gomes wrote: > Hi, > <snip> > > > > By the way, I too have a related question for the developers: > > I want to decouple the HTMLParser from the URLConnection where the > network IO is done. > I still want the parser to resolve links against the original URL of > the page and to use the HTTP headers to parse the data (gunzipping > data and charset decoding). > > I think that the available constructors for Parser don't allow this > decoupling in a straightforward fashion and without loosing some of > these features. > > My current solution is to extend URLConnection and then use that > object to feed the parser. > > A, perhaps cleaner, solution would be to have a constructor taking > three args: > URL (for link resolving) > InputStream for the data > HTTP headers > > The HTTP headers could be as returned from URLConnection. > getHeaderFields() for interoperability: > public Map<String,List<String>> getHeaderFields(); > Returns an unmodifiable Map of the header fields. The Map keys are > Strings that represent the response-header field names. Each Map > value is an unmodifiable List of Strings that represents the > corresponding field values. > > The signature of the constructor I'm proposing is: > public Parser(String url, InputStream input, Map<String,List<String>> > httpHeaders); > > I will proceed with extending URLConnection and feeding it into the > Parser with the setter setConnection() (I reuse the Parser to parse > several documents) > while no better solution is in my knowledge. > > > Best Regards > > Luís Gomes > > > On Mar 4, 2006, at 1:51 AM, Antony Sequeira wrote: > >> Hi >> >> I am thinking of using htmlparser for a project. >> I have content of urls available in file on disk >> The file contains the headers, followed by the rest of the content as >> received from the webserver (so its just a series of bytes). >> I'll need something that can read and parse the headers, figure out >> the encoding for the rest of the content and then parse the rest of >> the content. >> >> I have seen the javadocs and done some digging. >> Here is what I think I need to do >> Write my own code to read through headers to figure out encoding >> Then call the following >> http://htmlparser.sourceforge.net/javadoc/org/htmlparser/ >> Parser.html#createParser(java.lang.String,%20java.lang.String) >> >> The questions I have on this approach is - >> 1. The 'html' parameter is of type 'String', I'd think it would >> automatically imply that strings content is already in java format >> (utf-16 ?) . So what is the point of having the charset argument ? >> I know utf-16 is a encoding and not charset, but I don't understand >> the relevance of charset once something is in a 'java String' which >> can only be unicode AFAIK. >> It would have made sense to me if the html parameter was byte array or >> some such thing. >> >> 2. I guess I could convert to String myself from the byte buffer once >> I have the code for encoding detection. But then what would I pass for >> the charset. It makes no sense to me in Java to say I have some data >> sitting in a 'java String' with charset iso-8859-1. I guess I am just >> confused about the need for charset specification when something is >> already in 'String'. >> >> Thanks in advance for any ideas and help. >> >> -Antony Sequeira >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by xPML, a groundbreaking scripting >> language >> that extends applications into web and mobile media. Attend the live >> webcast >> and join the prime developer group breaking into this new coding >> territory! >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting > language > that extends applications into web and mobile media. Attend the live > webcast > and join the prime developer group breaking into this new coding > territory! > http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: <lui...@gm...> - 2006-03-04 04:13:54
|
Hi, The charset parameter of the constructor Parser(String, String) will =20 be returned when you call getEncoding(). No other effect beside this, =20= I believe. To read text from an InputStream (accessing a file, socket, etc) a =20 Reader should be used. To create a Reader, an explicit charset should be given (letting the =20 Reader use the system's default is asking for problems...) Because the creation of the Reader precedes the reading, the text =20 encoding must be known prior to reading it. This is why the HTTP =20 "Content-Type/charset-encoding" header is useful. However, this =20 header is not always correct (consider it a hint), and sometimes is =20 not even available (!) and we should consult an oracle then... If the charset used is not the proper charset, then the String can be =20= FIXED converting it into bytes (with the same charset used for =20 decoding) and then back to a String using the correct charset. How to tell if THE correct charset was used? Well, for now you can look for an http-equiv meta tag that specifies =20 the charset. If you find such a tag and the charset is the same =20 you've used before then you may trust in you conversion. Otherwise you should choose to believe one of them (the HTTP header =20 or the HTTP-EQUIV tag) and discard the other. Otherwise, When can someone detect THE correct charset? The short =20 answer: it's not easy and not always possible. I hope this helps you Antony. By the way, I too have a related question for the developers: I want to decouple the HTMLParser from the URLConnection where the =20 network IO is done. I still want the parser to resolve links against the original URL of =20 the page and to use the HTTP headers to parse the data (gunzipping =20 data and charset decoding). I think that the available constructors for Parser don't allow this =20 decoupling in a straightforward fashion and without loosing some of =20 these features. My current solution is to extend URLConnection and then use that =20 object to feed the parser. A, perhaps cleaner, solution would be to have a constructor taking =20 three args: URL (for link resolving) InputStream for the data HTTP headers The HTTP headers could be as returned from URLConnection. =20 getHeaderFields() for interoperability: public Map<String,List<String>> getHeaderFields(); Returns an unmodifiable Map of the header fields. The Map keys are =20 Strings that represent the response-header field names. Each Map =20 value is an unmodifiable List of Strings that represents the =20 corresponding field values. The signature of the constructor I'm proposing is: public Parser(String url, InputStream input, Map<String,List<String>> =20= httpHeaders); I will proceed with extending URLConnection and feeding it into the =20 Parser with the setter setConnection() (I reuse the Parser to parse =20 several documents) while no better solution is in my knowledge. Best Regards Lu=EDs Gomes On Mar 4, 2006, at 1:51 AM, Antony Sequeira wrote: > Hi > > I am thinking of using htmlparser for a project. > I have content of urls available in file on disk > The file contains the headers, followed by the rest of the content as > received from the webserver (so its just a series of bytes). > I'll need something that can read and parse the headers, figure out > the encoding for the rest of the content and then parse the rest of > the content. > > I have seen the javadocs and done some digging. > Here is what I think I need to do > Write my own code to read through headers to figure out encoding > Then call the following > http://htmlparser.sourceforge.net/javadoc/org/htmlparser/=20 > Parser.html#createParser(java.lang.String,%20java.lang.String) > > The questions I have on this approach is - > 1. The 'html' parameter is of type 'String', I'd think it would > automatically imply that strings content is already in java format > (utf-16 ?) . So what is the point of having the charset argument ? > I know utf-16 is a encoding and not charset, but I don't understand > the relevance of charset once something is in a 'java String' which > can only be unicode AFAIK. > It would have made sense to me if the html parameter was byte array or > some such thing. > > 2. I guess I could convert to String myself from the byte buffer once > I have the code for encoding detection. But then what would I pass for > the charset. It makes no sense to me in Java to say I have some data > sitting in a 'java String' with charset iso-8859-1. I guess I am just > confused about the need for charset specification when something is > already in 'String'. > > Thanks in advance for any ideas and help. > > -Antony Sequeira > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting =20 > language > that extends applications into web and mobile media. Attend the =20 > live webcast > and join the prime developer group breaking into this new coding =20 > territory! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=110944&bid$1720&dat=121642= > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Antony S. <ant...@gm...> - 2006-03-04 01:51:48
|
Hi I am thinking of using htmlparser for a project. I have content of urls available in file on disk The file contains the headers, followed by the rest of the content as received from the webserver (so its just a series of bytes). I'll need something that can read and parse the headers, figure out the encoding for the rest of the content and then parse the rest of the content. I have seen the javadocs and done some digging. Here is what I think I need to do Write my own code to read through headers to figure out encoding Then call the following http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#create= Parser(java.lang.String,%20java.lang.String) The questions I have on this approach is - 1. The 'html' parameter is of type 'String', I'd think it would automatically imply that strings content is already in java format (utf-16 ?) . So what is the point of having the charset argument ? I know utf-16 is a encoding and not charset, but I don't understand the relevance of charset once something is in a 'java String' which can only be unicode AFAIK. It would have made sense to me if the html parameter was byte array or some such thing. 2. I guess I could convert to String myself from the byte buffer once I have the code for encoding detection. But then what would I pass for the charset. It makes no sense to me in Java to say I have some data sitting in a 'java String' with charset iso-8859-1. I guess I am just confused about the need for charset specification when something is already in 'String'. Thanks in advance for any ideas and help. -Antony Sequeira |
From: Derrick O. <Der...@Ro...> - 2006-03-03 12:35:20
|
I would suggest trying the FilterBuilder utility. You'll want things like TagNameFilter to get the <H3> and HasParent/HasChild/HasSibling filters to navigate around the node tree. abh...@hs... wrote: > > >Hi , >I am going through the HtmlParser classes to develop a utility which reads >HTML from a java program. > >My HTML doc has the info like this > ><H2>My Name</H2><H3>Address</H3> ><P>It is not useful</P><H3>Age</H3> ><P>It is important</P> > >I have to read the content between <H1><H2> and the corresponding <P> tags >. > How to do this or how to get started. > >Thanks in advance >Abhijeet > >************************************************************ >HSBC Software Development (India) Pvt Ltd >HSBC Center Riverside,West Avenue , >25 B Kalyani Nagar Pune 411 006 INDIA > >Telephone: +91 20 26683000 >Fax: +91 20 26681030 >************************************************************ > > >----------------------------------------- >*********************************************************************** >This e-mail is confidential. It may also be legally privileged. >If you are not the addressee you may not copy, forward, disclose >or use any part of it. If you have received this message in error, >please delete it and all copies from your system and notify the >sender immediately by return e-mail. > >Internet communications cannot be guaranteed to be timely, >secure, error or virus-free. The sender does not accept liability >for any errors or omissions. >*********************************************************************** > > > >------------------------------------------------------- >This SF.Net email is sponsored by xPML, a groundbreaking scripting language >that extends applications into web and mobile media. Attend the live webcast >and join the prime developer group breaking into this new coding territory! >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Derrick O. <Der...@Ro...> - 2006-03-03 12:32:04
|
If you just want to scan for strings, you can do that with pure Java. If you want to extract specific tagged pieces, then HTML Parser is for you. Use the parser.setInputHTML(String), and then all the API of the parser becomes available. Konstantine wrote: >Greetings >I have beginner level knowledge of Java so please be gentle with me >:-) I am trying to build a small application where the programs makes >a number of POST requests and processes the results of the requests. > >I came as far as creating a separate thread for each request and >storing the responses (full HTML document) in a StringBuffer belonging >to thread standard packages. Now I want to scan the buffered document >for various strings. > >Is HTMLParser write tool to use to do this, is there a standard >package I can use to achieve this? > >many thanks in advance >K. > > >FYI, the link Wiki[1] in left frame of home page and the link >frequently asked questions[2] in the request support page seem to have >problems/ > >[1] http://htmlparser.sourceforge.net/wiki/index.php >[2] http://htmlparser.sourceforge.net/faq.html >N?HS^?隊X???'???u??<?ڂ?.???y?"??*m?x%jx.j???^?קvƩ?X?jب?ȧ??m?ݚ???v&??קv?^?+????j?Z???{az???^??h???n???)?{h?????ا??+h?(m?????Z??jY?w??ǥrg?y$???Oxḝn?mj??^??{f????????j)b? b???ZZ?ǫ?ǫ?+-??.?ǟ????a??l??b??,???y?+???b????+-?w??f??????ser= > |
From: <abh...@hs...> - 2006-03-03 11:03:27
|
Hi , I am going through the HtmlParser classes to develop a utility which reads HTML from a java program. My HTML doc has the info like this <H2>My Name</H2><H3>Address</H3> <P>It is not useful</P><H3>Age</H3> <P>It is important</P> I have to read the content between <H1><H2> and the corresponding <P> tags . How to do this or how to get started. Thanks in advance Abhijeet ************************************************************ HSBC Software Development (India) Pvt Ltd HSBC Center Riverside,West Avenue , 25 B Kalyani Nagar Pune 411 006 INDIA Telephone: +91 20 26683000 Fax: +91 20 26681030 ************************************************************ ----------------------------------------- *********************************************************************** This e-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return e-mail. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. *********************************************************************** |
From: Konstantine <lis...@gm...> - 2006-03-02 18:49:21
|
R3JlZXRpbmdzCkkgaGF2ZSBiZWdpbm5lciBsZXZlbCBrbm93bGVkZ2Ugb2YgSmF2YSBzbyBwbGVh c2UgYmUgZ2VudGxlIHdpdGggbWUKOi0pIEkgYW0gdHJ5aW5nIHRvIGJ1aWxkIGEgc21hbGwgYXBw bGljYXRpb24gd2hlcmUgdGhlIHByb2dyYW1zIG1ha2VzCmEgIG51bWJlciBvZiBQT1NUIHJlcXVl c3RzIGFuZCBwcm9jZXNzZXMgdGhlIHJlc3VsdHMgb2YgdGhlIHJlcXVlc3RzLgoKSSBjYW1lIGFz IGZhciBhcyBjcmVhdGluZyBhIHNlcGFyYXRlIHRocmVhZCBmb3IgZWFjaCByZXF1ZXN0IGFuZApz dG9yaW5nIHRoZSByZXNwb25zZXMgKGZ1bGwgSFRNTCBkb2N1bWVudCkgaW4gYSBTdHJpbmdCdWZm ZXIgYmVsb25naW5nCnRvIHRocmVhZCBzdGFuZGFyZCBwYWNrYWdlcy4gTm93IEkgd2FudCB0byBz Y2FuIHRoZSBidWZmZXJlZCBkb2N1bWVudApmb3IgdmFyaW91cyBzdHJpbmdzLgoKSXMgSFRNTFBh cnNlciB3cml0ZSB0b29sIHRvIHVzZSB0byBkbyB0aGlzLCBpcyB0aGVyZSBhIHN0YW5kYXJkCnBh Y2thZ2UgSSBjYW4gdXNlIHRvIGFjaGlldmUgdGhpcz8KCm1hbnkgdGhhbmtzIGluIGFkdmFuY2UK Sy4KCgpGWUksIHRoZSBsaW5rIFdpa2lbMV0gaW4gbGVmdCBmcmFtZSBvZiBob21lIHBhZ2UgYW5k IHRoZSBsaW5rCmZyZXF1ZW50bHkgYXNrZWQgcXVlc3Rpb25zWzJdIGluIHRoZSByZXF1ZXN0IHN1 cHBvcnQgcGFnZSBzZWVtIHRvIGhhdmUKcHJvYmxlbXMvCgpbMV0gaHR0cDovL2h0bWxwYXJzZXIu c291cmNlZm9yZ2UubmV0L3dpa2kvaW5kZXgucGhwClsyXSBodHRwOi8vaHRtbHBhcnNlci5zb3Vy Y2Vmb3JnZS5uZXQvZmFxLmh0bWwK |
From: Derrick O. <Der...@Ro...> - 2006-03-02 02:35:59
|
Those are the primary resources. Mostly it's the Javadocs, for example there's a good summary piece of the most major difference (underlying lexer) in the lexer package: http://htmlparser.sourceforge.net/javadoc/org/htmlparser/lexer/package-summary.html Vincent Mallet wrote: > Thanks Derrick. > > Are "changes.txt" and "release.txt" all the documents about the > evolution of htmlparser between 1.4 and 1.5/1.6, or is there something > else that would talk about changes in concepts and design between the > different releases? > > Thanks, > > Vince. > > On 2/28/06, *Derrick Oswald* <Der...@ro... > <mailto:Der...@ro...>> wrote: > > > No, sorry, there is no 'backwards compatibility' switch. > > Vincent Mallet wrote: > > >Hello, > > > >I have some code that uses htmlparser 1.4 and I am looking at > >upgrading it to the latest 1.6 integration build. However, I am > seeing > >differences in the way the input is processed that make the work > more > >difficult. > > > >Given the input (note it's missing a quote): > >Hello <a href="http://www.foo.com>World</a> > > > >With htmlparser 1.4, I get the following nodes: > >Text: Hello > >Begin tag: a href="http://www.foo.com" > >Text: World > >End tag: a > > > >With htmlparser 1.6, I get these: > >Text: Hello > >LinkTag: link to http://www.foo.com>link</a> > > > >The 1.6 behavior makes error recovery a lot more difficult. Is > there a > >way to have 1.6 behave like 1.4 in this case? > > > >Thanks for your help, > > > > Vince. > |
From: Vincent M. <vm...@gm...> - 2006-03-01 17:34:51
|
Thanks Derrick. Are "changes.txt" and "release.txt" all the documents about the evolution o= f htmlparser between 1.4 and 1.5/1.6, or is there something else that would talk about changes in concepts and design between the different releases? Thanks, Vince. On 2/28/06, Derrick Oswald <Der...@ro...> wrote: > > > No, sorry, there is no 'backwards compatibility' switch. > > Vincent Mallet wrote: > > >Hello, > > > >I have some code that uses htmlparser 1.4 and I am looking at > >upgrading it to the latest 1.6 integration build. However, I am seeing > >differences in the way the input is processed that make the work more > >difficult. > > > >Given the input (note it's missing a quote): > >Hello <a href=3D"http://www.foo.com>World</a> > > > >With htmlparser 1.4, I get the following nodes: > >Text: Hello > >Begin tag: a href=3D"http://www.foo.com" > >Text: World > >End tag: a > > > >With htmlparser 1.6, I get these: > >Text: Hello > >LinkTag: link to http://www.foo.com>link</a> > > > >The 1.6 behavior makes error recovery a lot more difficult. Is there a > >way to have 1.6 behave like 1.4 in this case? > > > >Thanks for your help, > > > > Vince. > > > > > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting > language > that extends applications into web and mobile media. Attend the live > webcast > and join the prime developer group breaking into this new coding > territory! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D110944&bid=3D241720&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-03-01 01:58:44
|
No, sorry, there is no 'backwards compatibility' switch. Vincent Mallet wrote: >Hello, > >I have some code that uses htmlparser 1.4 and I am looking at >upgrading it to the latest 1.6 integration build. However, I am seeing >differences in the way the input is processed that make the work more >difficult. > >Given the input (note it's missing a quote): >Hello <a href="http://www.foo.com>World</a> > >With htmlparser 1.4, I get the following nodes: >Text: Hello >Begin tag: a href="http://www.foo.com" >Text: World >End tag: a > >With htmlparser 1.6, I get these: >Text: Hello >LinkTag: link to http://www.foo.com>link</a> > >The 1.6 behavior makes error recovery a lot more difficult. Is there a >way to have 1.6 behave like 1.4 in this case? > >Thanks for your help, > > Vince. > > > > |
From: Vincent M. <vm...@gm...> - 2006-03-01 00:47:06
|
Hello, I have some code that uses htmlparser 1.4 and I am looking at upgrading it to the latest 1.6 integration build. However, I am seeing differences in the way the input is processed that make the work more difficult. Given the input (note it's missing a quote): Hello <a href=3D"http://www.foo.com>World</a> With htmlparser 1.4, I get the following nodes: Text: Hello Begin tag: a href=3D"http://www.foo.com" Text: World End tag: a With htmlparser 1.6, I get these: Text: Hello LinkTag: link to http://www.foo.com>link</a> The 1.6 behavior makes error recovery a lot more difficult. Is there a way to have 1.6 behave like 1.4 in this case? Thanks for your help, Vince. |
From: Ian M. <ian...@gm...> - 2006-02-24 21:46:21
|
Actually, if you read the W3C specs, it really does look like the two would sit quite happily in a single class. In reality, they are semantically the same thing, except one of them is visible and one of them is not. Please have a look over the specs at http://www.w3.org/TR/REC-html40/struct/links.html again and let me know what you think. data: and view-source: are just protocols like http:, ftp:, javascript: etc. data: is used to store the file data in the html source (so an image can be encoded in a web page and only one file gets served), view-source: just means to open up the source code viewer for the URL rather than the HTML renderer. Ian On 24/02/06, Derrick Oswald <Der...@ro...> wrote: > > If you want to reuse the LinkTag name it should wait for 1.7 (or 2.0, > whatever). > That would mean an ATag class? > The boolean seems like overkill.... simplify, simplify, simplify... <A > for links, <LINK for anchors. > > Sorry, I don't know what you mean by data: and view-source: protocols. > > setLink sounds right. The others are legacy stuff that should probably > be cleaned out. > > rel and rev, yes, adding tag specific methods is exactly what a class > for each tag is all about. > > Ian Macfarlane wrote: > > >This project is still alive, if under slow development. There are > >still are number of checkins being made fairly often, and we are > >possibly going to branch for a 1.6 release. > > > >The name LinkTag has indeed been taken for anchor tag, but we can't > >change it now due to backwards compatibility reasons. > > > >I think we might want to make LinkTag support <link> tags, and have a > >boolean method that says if it's an anchor or not. In fact, reading > >the W3C spec on this > >(http://www.w3.org/TR/REC-html40/struct/links.html) this seems like it > >might be the right thing to do. > > > >Can I get some feedback from some of the other devs on this? If it > >seems like a good idea to do it this way? It looks to me like it > >probably is the best way to do it semantically and practically. > > > >Other things that look like they should be done (devs: please shout if > >you don't want any of this done): > > > >- add support for the data: and view-source: protocols > >- deprecate setMailLink and setJavascriptLink in favour of setLink > >- add get/set for rel and rev attributes > > > >Ian > > > >On 23/02/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> wrot= e: > > > > > >>Hello, > >> > >>I cannot migrate all my work to the C#/.NET platform, although HTML > >>parsing is a core functionality of my project. > >>I'm coding a crawler to feed our natural language research group with > >>corpus from the web. Currently I'm still evaluating options for the > >>HTML parsing module. I have developed my own HTML scanner based on > >>Java regexps, but it is too much difficult to maintain and extend > >>(after all, it can be a project by itself). > >> > >>My needs are far beyond the simple link extraction/modification. I > >>must handle every single tag that may reference an external resource > >>(and that includes IFrame). This includes parsing embedded CSS > >>imports. Embedded Javascript is still a problem... > >> > >>Anyway, the BIG question is: is this project alive? > >>I know it is an open source project that is supported by people free > >>will, and I find that _very_ _meritorious_. > >>I'm putting this question because I will make a decision now. > >> > >>I still would appreciate some feedback on subject of this thread (the > >>original post follows) > >> > >>Lu=EDs > >> > >>On Feb 15, 2006, at 4:30 PM, Third Eye wrote: > >> > >> > >> > >>>Hi! > >>>We did implement IFrameTag and named the class as IFrameTag. Our > >>>implementation is .Net port of this library and we have added some of > >>>our own enhancements. > >>>If you are interested, you can download it from > >>> > >>>http://www.netomatix.com > >>> > >>>Naveen > >>> > >>>On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> > >>>wrote: > >>> > >>> > >>>>Hi everybody. > >>>> > >>>>This is my first post to this list. > >>>>I'm replacing my own html processing code (regex based) with > >>>>HTMLParser. > >>>>The examples have been a great help! > >>>> > >>>>I need to handle IFRAME and LINK tags. The link tag is often used to > >>>>include external CSS. > >>>>The name "LinkTag" has already been taken for the anchor tags! How > >>>>should I name the class to handle the LINK tags? > >>>>Have anybody implemented the IframeTag and the "TrueLinkTag" classes? > >>>>I could do this and would be glad to contribute it to the project. > >>>>I'm using the version 20051112. I've not checked out from CVS because > >>>>I need a stable package. > >>>> > >>>>Cheers! > >>>> > >>>>Lu=EDs Gomes > >>>>(from Portugal) > >>>> > >>>> > >>>>------------------------------------------------------- > >>>>This SF.net email is sponsored by: Splunk Inc. Do you grep through > >>>>log files > >>>>for problems? Stop! Download the new AJAX search engine that makes > >>>>searching your log files as easy as surfing the web. DOWNLOAD > >>>>SPLUNK! > >>>>http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 > >>>>_______________________________________________ > >>>>Htmlparser-user mailing list > >>>>Htm...@li... > >>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>>> > >>>> > >>>> > >>>-- > >>>Naveen K Kohli > >>>http://www.netomatix.com > >>> > >>> > >>>------------------------------------------------------- > >>>This SF.net email is sponsored by: Splunk Inc. Do you grep through > >>>log files > >>>for problems? Stop! Download the new AJAX search engine that makes > >>>searching your log files as easy as surfing the web. DOWNLOAD > >>>SPLUNK! > >>>http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=103432&bid#0486&dat=1216= 42 > >>>_______________________________________________ > >>>Htmlparser-user mailing list > >>>Htm...@li... > >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>> > >>> > >>> > >> > >>------------------------------------------------------- > >>This SF.Net email is sponsored by xPML, a groundbreaking scripting lang= uage > >>that extends applications into web and mobile media. Attend the live we= bcast > >>and join the prime developer group breaking into this new coding territ= ory! > >>http://sel.as-us.falkag.net/sel?cmdlnk&kid=110944&bid$1720&dat=121642 > >>_______________________________________________ > >>Htmlparser-user mailing list > >>Htm...@li... > >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >> > >> > >> > > > > > >------------------------------------------------------- > >This SF.Net email is sponsored by xPML, a groundbreaking scripting langu= age > >that extends applications into web and mobile media. Attend the live web= cast > >and join the prime developer group breaking into this new coding territo= ry! > >http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=110944&bid$1720&dat=121642 > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting langua= ge > that extends applications into web and mobile media. Attend the live webc= ast > and join the prime developer group breaking into this new coding territor= y! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D110944&bid=3D241720&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-02-24 02:46:22
|
Yes, it is alive. The NetOMatix port is a c# clone, and shouldn't be construed as the new direction the project is taking. (it's my personal opinion that c# will only last as long as a Microsoft VP thinks it can spread FUD, Microsoft's path is still Visual Basic and always will be... hork, spit). Luís Manuel dos Santos Gomes wrote: > Hello, > > I cannot migrate all my work to the C#/.NET platform, although HTML > parsing is a core functionality of my project. > I'm coding a crawler to feed our natural language research group with > corpus from the web. Currently I'm still evaluating options for the > HTML parsing module. I have developed my own HTML scanner based on > Java regexps, but it is too much difficult to maintain and extend > (after all, it can be a project by itself). > > My needs are far beyond the simple link extraction/modification. I > must handle every single tag that may reference an external resource > (and that includes IFrame). This includes parsing embedded CSS > imports. Embedded Javascript is still a problem... > > Anyway, the BIG question is: is this project alive? > I know it is an open source project that is supported by people free > will, and I find that _very_ _meritorious_. > I'm putting this question because I will make a decision now. > > I still would appreciate some feedback on subject of this thread (the > original post follows) > > Luís > > On Feb 15, 2006, at 4:30 PM, Third Eye wrote: > >> Hi! >> We did implement IFrameTag and named the class as IFrameTag. Our >> implementation is .Net port of this library and we have added some of >> our own enhancements. >> If you are interested, you can download it from >> >> http://www.netomatix.com >> >> Naveen >> >> On 2/15/06, Luís Manuel dos Santos Gomes <lui...@gm...> wrote: >> >>> Hi everybody. >>> >>> This is my first post to this list. >>> I'm replacing my own html processing code (regex based) with >>> HTMLParser. >>> The examples have been a great help! >>> >>> I need to handle IFRAME and LINK tags. The link tag is often used to >>> include external CSS. >>> The name "LinkTag" has already been taken for the anchor tags! How >>> should I name the class to handle the LINK tags? >>> Have anybody implemented the IframeTag and the "TrueLinkTag" classes? >>> I could do this and would be glad to contribute it to the project. >>> I'm using the version 20051112. I've not checked out from CVS because >>> I need a stable package. >>> >>> Cheers! >>> >>> Luís Gomes >>> (from Portugal) >>> >>> >>> ------------------------------------------------------- >>> This SF.net email is sponsored by: Splunk Inc. Do you grep through >>> log files >>> for problems? Stop! Download the new AJAX search engine that makes >>> searching your log files as easy as surfing the web. DOWNLOAD >>> SPLUNK! >>> http://sel.as-us.falkag.net/sel?cmdlnk&kid3432&bid#0486&dat1642 >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >> >> >> -- >> Naveen K Kohli >> http://www.netomatix.com >> >> >> ------------------------------------------------------- >> This SF.net email is sponsored by: Splunk Inc. Do you grep through >> log files >> for problems? Stop! Download the new AJAX search engine that makes >> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting > language > that extends applications into web and mobile media. Attend the live > webcast > and join the prime developer group breaking into this new coding > territory! > http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-02-24 02:25:50
|
If you want to reuse the LinkTag name it should wait for 1.7 (or 2.0, whatever). That would mean an ATag class? The boolean seems like overkill.... simplify, simplify, simplify... <A for links, <LINK for anchors. Sorry, I don't know what you mean by data: and view-source: protocols. setLink sounds right. The others are legacy stuff that should probably be cleaned out. rel and rev, yes, adding tag specific methods is exactly what a class for each tag is all about. Ian Macfarlane wrote: >This project is still alive, if under slow development. There are >still are number of checkins being made fairly often, and we are >possibly going to branch for a 1.6 release. > >The name LinkTag has indeed been taken for anchor tag, but we can't >change it now due to backwards compatibility reasons. > >I think we might want to make LinkTag support <link> tags, and have a >boolean method that says if it's an anchor or not. In fact, reading >the W3C spec on this >(http://www.w3.org/TR/REC-html40/struct/links.html) this seems like it >might be the right thing to do. > >Can I get some feedback from some of the other devs on this? If it >seems like a good idea to do it this way? It looks to me like it >probably is the best way to do it semantically and practically. > >Other things that look like they should be done (devs: please shout if >you don't want any of this done): > >- add support for the data: and view-source: protocols >- deprecate setMailLink and setJavascriptLink in favour of setLink >- add get/set for rel and rev attributes > >Ian > >On 23/02/06, Luís Manuel dos Santos Gomes <lui...@gm...> wrote: > > >>Hello, >> >>I cannot migrate all my work to the C#/.NET platform, although HTML >>parsing is a core functionality of my project. >>I'm coding a crawler to feed our natural language research group with >>corpus from the web. Currently I'm still evaluating options for the >>HTML parsing module. I have developed my own HTML scanner based on >>Java regexps, but it is too much difficult to maintain and extend >>(after all, it can be a project by itself). >> >>My needs are far beyond the simple link extraction/modification. I >>must handle every single tag that may reference an external resource >>(and that includes IFrame). This includes parsing embedded CSS >>imports. Embedded Javascript is still a problem... >> >>Anyway, the BIG question is: is this project alive? >>I know it is an open source project that is supported by people free >>will, and I find that _very_ _meritorious_. >>I'm putting this question because I will make a decision now. >> >>I still would appreciate some feedback on subject of this thread (the >>original post follows) >> >>Luís >> >>On Feb 15, 2006, at 4:30 PM, Third Eye wrote: >> >> >> >>>Hi! >>>We did implement IFrameTag and named the class as IFrameTag. Our >>>implementation is .Net port of this library and we have added some of >>>our own enhancements. >>>If you are interested, you can download it from >>> >>>http://www.netomatix.com >>> >>>Naveen >>> >>>On 2/15/06, Luís Manuel dos Santos Gomes <lui...@gm...> >>>wrote: >>> >>> >>>>Hi everybody. >>>> >>>>This is my first post to this list. >>>>I'm replacing my own html processing code (regex based) with >>>>HTMLParser. >>>>The examples have been a great help! >>>> >>>>I need to handle IFRAME and LINK tags. The link tag is often used to >>>>include external CSS. >>>>The name "LinkTag" has already been taken for the anchor tags! How >>>>should I name the class to handle the LINK tags? >>>>Have anybody implemented the IframeTag and the "TrueLinkTag" classes? >>>>I could do this and would be glad to contribute it to the project. >>>>I'm using the version 20051112. I've not checked out from CVS because >>>>I need a stable package. >>>> >>>>Cheers! >>>> >>>>Luís Gomes >>>>(from Portugal) >>>> >>>> >>>>------------------------------------------------------- >>>>This SF.net email is sponsored by: Splunk Inc. Do you grep through >>>>log files >>>>for problems? Stop! Download the new AJAX search engine that makes >>>>searching your log files as easy as surfing the web. DOWNLOAD >>>>SPLUNK! >>>>http://sel.as-us.falkag.net/sel?cmdlnk&kid3432&bid#0486&dat1642 >>>>_______________________________________________ >>>>Htmlparser-user mailing list >>>>Htm...@li... >>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>-- >>>Naveen K Kohli >>>http://www.netomatix.com >>> >>> >>>------------------------------------------------------- >>>This SF.net email is sponsored by: Splunk Inc. Do you grep through >>>log files >>>for problems? Stop! Download the new AJAX search engine that makes >>>searching your log files as easy as surfing the web. DOWNLOAD >>>SPLUNK! >>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 >>>_______________________________________________ >>>Htmlparser-user mailing list >>>Htm...@li... >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >> >>------------------------------------------------------- >>This SF.Net email is sponsored by xPML, a groundbreaking scripting language >>that extends applications into web and mobile media. Attend the live webcast >>and join the prime developer group breaking into this new coding territory! >>http://sel.as-us.falkag.net/sel?cmdlnk&kid0944&bid$1720&dat1642 >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > >------------------------------------------------------- >This SF.Net email is sponsored by xPML, a groundbreaking scripting language >that extends applications into web and mobile media. Attend the live webcast >and join the prime developer group breaking into this new coding territory! >http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: vraja s. <vra...@ya...> - 2006-02-23 20:03:39
|
Hi guys Thank you for the previous help. I am stuck with this for quite a long time. Tag (26431[419,17],26434[419,20]): b test 0 Link to : http://www.epinions.com/Sony_DHandycam_CR_HC42_Camcorder; titled : Sony Handycam DCR-HC42 Mini DV Digital Camcorder; begins at : 26434; ends at : 26478, AccessKey=null LinkData -------- 0 Txt (26478[419,64],26526[419,112]): Sony Handycam DCR-HC42 Mini DV Digital Ca... *** END of LinkData *** test 1 End (26530[419,116],26534[419,120]): /b test 2 Tag (26534[419,120],26538[419,124]): br test 3 Tag (26538[419,124],26556[419,142]): span class="rgr" Txt (26556[419,142],26670[419,256]): Digital, Mini DV, 1 x CCD, Up to 1 Megap... End (26670[419,256],26677[419,263]): /span test 4 Tag (26677[419,263],26681[419,267]): br test 5 Txt (26681[419,267],26682[420,0]): \n test 6 Tag (26682[420,0],26700[420,18]): span class="rkr" Tag (26700[420,18],26834[420,152]): img src="http://img.epinions.com/images/e... Tag (26834[420,152],26838[420,156]): br Tag (26838[420,156],26899[420,217]): a href="/Sony_DHandycam_CR_HC42_Camcorde... Txt (26899[420,217],26917[420,235]): 7 consumer reviews End (26917[420,235],26921[420,239]): /a End (26921[420,239],26928[420,246]): /span test 7 Txt (26928[420,246],26929[421,0]): \n test 8 Tag (28022[428,17],28025[428,20]): b test 9 Link to : http://www.epinions.com/Sony_HDR_HC1_Camcorder; titled : Sony Handycam HDR-HC1 HDV Digital Camcorder; begins at : 28025; ends at : 28059, AccessKey=null LinkData -------- 0 Txt (28059[428,54],28102[428,97]): Sony Handycam HDR-HC1 HDV Digital Camcorde... *** END of LinkData *** test 10 The above are the children of a node. I have to check for "Sony Handycam DCR-HC42 Mini DV Digital" at node test1 if so I have to extract the link for "7 reviews" at node test7. since node test7 is children of another span tag I find it difficult to check for the required camera and extract the required link. Please help me out in soving this. Thanks Raj Rajasekaran Venkatachalam 3602 Spottswood Ave, Apt # 2 Memphis, TN 38111, USA Mobile # 901-246-4031 Work # 901-678-5323 |
From: Ian M. <ian...@gm...> - 2006-02-23 15:12:19
|
This project is still alive, if under slow development. There are still are number of checkins being made fairly often, and we are possibly going to branch for a 1.6 release. The name LinkTag has indeed been taken for anchor tag, but we can't change it now due to backwards compatibility reasons. I think we might want to make LinkTag support <link> tags, and have a boolean method that says if it's an anchor or not. In fact, reading the W3C spec on this (http://www.w3.org/TR/REC-html40/struct/links.html) this seems like it might be the right thing to do. Can I get some feedback from some of the other devs on this? If it seems like a good idea to do it this way? It looks to me like it probably is the best way to do it semantically and practically. Other things that look like they should be done (devs: please shout if you don't want any of this done): - add support for the data: and view-source: protocols - deprecate setMailLink and setJavascriptLink in favour of setLink - add get/set for rel and rev attributes Ian On 23/02/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> wrote: > Hello, > > I cannot migrate all my work to the C#/.NET platform, although HTML > parsing is a core functionality of my project. > I'm coding a crawler to feed our natural language research group with > corpus from the web. Currently I'm still evaluating options for the > HTML parsing module. I have developed my own HTML scanner based on > Java regexps, but it is too much difficult to maintain and extend > (after all, it can be a project by itself). > > My needs are far beyond the simple link extraction/modification. I > must handle every single tag that may reference an external resource > (and that includes IFrame). This includes parsing embedded CSS > imports. Embedded Javascript is still a problem... > > Anyway, the BIG question is: is this project alive? > I know it is an open source project that is supported by people free > will, and I find that _very_ _meritorious_. > I'm putting this question because I will make a decision now. > > I still would appreciate some feedback on subject of this thread (the > original post follows) > > Lu=EDs > > On Feb 15, 2006, at 4:30 PM, Third Eye wrote: > > > Hi! > > We did implement IFrameTag and named the class as IFrameTag. Our > > implementation is .Net port of this library and we have added some of > > our own enhancements. > > If you are interested, you can download it from > > > > http://www.netomatix.com > > > > Naveen > > > > On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> > > wrote: > >> Hi everybody. > >> > >> This is my first post to this list. > >> I'm replacing my own html processing code (regex based) with > >> HTMLParser. > >> The examples have been a great help! > >> > >> I need to handle IFRAME and LINK tags. The link tag is often used to > >> include external CSS. > >> The name "LinkTag" has already been taken for the anchor tags! How > >> should I name the class to handle the LINK tags? > >> Have anybody implemented the IframeTag and the "TrueLinkTag" classes? > >> I could do this and would be glad to contribute it to the project. > >> I'm using the version 20051112. I've not checked out from CVS because > >> I need a stable package. > >> > >> Cheers! > >> > >> Lu=EDs Gomes > >> (from Portugal) > >> > >> > >> ------------------------------------------------------- > >> This SF.net email is sponsored by: Splunk Inc. Do you grep through > >> log files > >> for problems? Stop! Download the new AJAX search engine that makes > >> searching your log files as easy as surfing the web. DOWNLOAD > >> SPLUNK! > >> http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 > >> _______________________________________________ > >> Htmlparser-user mailing list > >> Htm...@li... > >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >> > > > > > > -- > > Naveen K Kohli > > http://www.netomatix.com > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. Do you grep through > > log files > > for problems? Stop! Download the new AJAX search engine that makes > > searching your log files as easy as surfing the web. DOWNLOAD > > SPLUNK! > > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=103432&bid#0486&dat=12164= 2 > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting langua= ge > that extends applications into web and mobile media. Attend the live webc= ast > and join the prime developer group breaking into this new coding territor= y! > http://sel.as-us.falkag.net/sel?cmdlnk&kid=110944&bid$1720&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |