Thread: [Htmlparser-user] (no subject)
Brought to you by:
derrickoswald
From: <sun...@ya...> - 2003-09-16 12:56:23
|
I am not able to understand how to get all the attributes of <STYLE> tag and also how to parse the HTML document so that i may be able to navigate throught style sheets and get all the font related information from there.I want to Perform some font related search on the HTML document. I shall be greatfull if anyone can suggest me any remedy for that using this HTML parser. Regards Sunil ________________________________________________________________________ Yahoo! India Matrimony: Find your partner online. Go to http://yahoo.shaadi.com |
From: Varley, R. <Rog...@at...> - 2004-03-31 12:39:27
|
Hi Everyone I've just come across HTMLParser and I would like to know if it can help = me with my particular case and how I go about using it.=20 I am writing a servlet that receives as a parameter a url to another = site. The servlet needs to construct a request to this URL and retrieve = the HTML data. I need to parse the retrieved data and re-write any URL's = that it contains before passing the amended HTML page back to the = original client. I'd be grateful for pointers to any sample code. Regards Roger |
From: Derrick O. <Der...@Ro...> - 2004-03-31 23:15:43
|
Roger, The HTML parser is ideal for your application. The URL can be passed to the Parser constructor and a rewriting URL example is provided in the org.htmlparser.parserapplications.SiteCapturer class. Sending the modified page is accomplished by formulating the servlet response from the string returned from the toHtml() call rather than writing to a file. It shouldn't take more than a couple of hours before you can deploy it. Derrick Varley, Roger wrote: >Hi Everyone > >I've just come across HTMLParser and I would like to know if it can help me with my particular case and how I go about using it. > >I am writing a servlet that receives as a parameter a url to another site. The servlet needs to construct a request to this URL and retrieve the HTML data. I need to parse the retrieved data and re-write any URL's that it contains before passing the amended HTML page back to the original client. > >I'd be grateful for pointers to any sample code. > >Regards >Roger > > > > |
From: zac c. <zac...@ya...> - 2005-07-23 23:22:19
|
Thanks for your help Derrick. What I am trying to do is extract sentences from websites, so that I can analyse the grammar, use of jargon and so on in those sentences. Ideally I would like to automatically extract the text and create an Arraylist of sentences without any headings/menu links and so on. I played around with MyStringBean and I have set it so that it will only extract text from the <p> tag. However I notice that if the <p> is not closed with a </p> then the parser will continue to get text from any tags until a </p> is found. So I usually end up with all the text I want, followed by some unwanted headings/links and so on. For example if I parse http://news.bbc.co.uk/1/hi/entertainment/music/4710441.stm it will get all text from 'The LA court order...' to the end of the page, including unwanted links such as 'SEE ALSO: Doors manager Sugerman dies at 50 07 Jan 05 | Music' I cannot work out how to amend the MyStringBean to solve this problem. My code is just this: public void visitStringNode (Text string) { if (mIsText) super.visitStringNode (string); } public void visitTag (Tag tag) { String name; super.visitTag (tag); name = tag.getTagName (); if (name.equalsIgnoreCase ("p")){ System.out.println("found 'p' tag"); mIsText = true; } } Another problem is that not all text is contained within <p> tags. In that BBC news article, for example, the text in bold ('Two remaining members of The Doors...') is not extracted because it is not inside a <p> tag (its in a <b> tag before any <p>). So I'm wondering whether I should just use html_parser to indiscriminately get all text into a string, and then use standard java classes to analyse this text and try to spot 'proper sentences' within this string. Or do you think there is a better way to do this using htmlparser? If you were doing this what would be your approach? Appreciating the support, Zaccary --------------------- Date: Thu, 21 Jul 2005 20:42:23 -0400 From: Derrick Oswald <Der...@Ro...> To: htm...@li... Subject: Re: [Htmlparser-user] getting all text from a html page Reply-To: htm...@li... That looks like it would work. Did you try it? You shouldn't need to change the StringBean class, that's what all the "super." calls are for -- to get the original functionality plus some. There are general instructions on Java programming nearly everywhere on the web. The specifics for the parser are in the JavaDocs. Zac Craven wrote: > OK - then how do I use this MyStringBean? I need to do something like > this in my main program? > > MyStringBean sb = new MyStringBean(); > sb.setLinks(false); > sb.setURL(url); > String alltext = sb.getStrings(); > return alltext; > > Also, do I need to change the StringBean class at all? > > If there is some instruction on this anywhere pls let me know the URL > because I cannot find any info on this. > > Thanks, > Zac > > http://www.dur.ac.uk/z.a.craven/breadcrumbs/ ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com |
From: lu d. <dom...@gm...> - 2006-05-31 10:39:17
|
From: lu d. <dom...@gm...> - 2006-08-09 02:38:54
|
From: Gee R. <ge...@ya...> - 2007-08-06 11:06:45
|
Hello!=0A=0AI know Java but not HTML/. I want to extract text files of some= selected news form arabic news paper. Can someone help me how can I do thi= s?=0A=0ABest Regards- G. Raza=0A=0A=0A=0A=0A _________________________= ___________________________________________________________=0ALuggage? GPS?= Comic books? =0ACheck out fitting gifts for grads at Yahoo! Search=0Ahttp:= //search.yahoo.com/search?fr=3Doni_on_mail&p=3Dgraduation+gifts&cs=3Dbz |
From: Daniel W. <wic...@un...> - 2007-10-29 11:58:05
|
Hi, I would like to connect to a website that requires authentification (basic authentification for example)and then process the html code. Unfortunately I haven't found a way to authenticate me as a user via the htmlparser API. Did I overlook something, or what do I need to do? Any help is greatly appreciated! Cheers, Daniel |
From: Akihiko M <ams...@gm...> - 2010-05-21 05:02:51
|
From: semeera B <sem...@ya...> - 2011-04-24 18:49:53
|
http://ppcult.com.br/index-rhto18.php |
From: semeera B <sem...@ya...> - 2011-04-27 18:47:30
|
http://collinapecas.com.br/index002-291.html |
From: Asish S. <asi...@ho...> - 2012-01-16 06:20:11
|
...Hi! Baby, you wont be disappointed! http://www.os-bc.de/new-year.link.php?dgoogleId=50e0 |
From: Sfmu <son...@si...> - 2017-06-26 02:47:37
Attachments:
1.png
|
126.78.19.175 |
From: william l. <wil...@en...> - 2017-09-23 16:21:18
|
hey Htmlparser http://bit.ly/2ho5DXC Best Wishes william luo |
From: Elizabeth W. <spa...@ec...> - 2018-10-31 12:43:17
|
Sourceforge https://goo.gl/NkMSXY Elizabeth |
From: Elizabeth W. <spa...@bu...> - 2019-04-24 15:50:08
|
Sourceforge   http://www.findingfarm.com/redir?url=http://tiny.cc/959l5y&topic=dev-postgres      Elizabeth  |
From: Elizabeth W. <dil...@tt...> - 2019-07-25 15:56:35
|
Sourceforge http://ramin-bibak.blogsky.com/dailylink/?go=https://u.to/9efcFQ&id=8 Elizabeth Wong |
From: Elizabeth W. <spa...@hu...> - 2019-11-27 16:52:58
|
Sourceforge https://u.to/t8nOFg Elizabeth Wong |
From: Elizabeth W. <spa...@wi...> - 2019-12-16 13:21:56
|
Sourceforge https://xurl.es/juonw Elizabeth |
From: william l. <wil...@fi...> - 2020-05-21 08:22:53
|
  Htmlparser http://tinyurl.com/y98pl2t2 william luo |
From: Elizabeth W. <spa...@ma...> - 2020-08-21 21:54:54
|
Sourceforge https://u.to/Q-JoGQ Elizabeth Wong |
From: Elizabeth W. <spa...@ma...> - 2020-12-12 01:59:31
|
Sourceforge  https://j.mp/3oGSWIf   |