[Htmlparser-user] (no subject)
Brought to you by:
derrickoswald
From: zac c. <zac...@ya...> - 2005-07-23 23:22:19
|
Thanks for your help Derrick. What I am trying to do is extract sentences from websites, so that I can analyse the grammar, use of jargon and so on in those sentences. Ideally I would like to automatically extract the text and create an Arraylist of sentences without any headings/menu links and so on. I played around with MyStringBean and I have set it so that it will only extract text from the <p> tag. However I notice that if the <p> is not closed with a </p> then the parser will continue to get text from any tags until a </p> is found. So I usually end up with all the text I want, followed by some unwanted headings/links and so on. For example if I parse http://news.bbc.co.uk/1/hi/entertainment/music/4710441.stm it will get all text from 'The LA court order...' to the end of the page, including unwanted links such as 'SEE ALSO: Doors manager Sugerman dies at 50 07 Jan 05 | Music' I cannot work out how to amend the MyStringBean to solve this problem. My code is just this: public void visitStringNode (Text string) { if (mIsText) super.visitStringNode (string); } public void visitTag (Tag tag) { String name; super.visitTag (tag); name = tag.getTagName (); if (name.equalsIgnoreCase ("p")){ System.out.println("found 'p' tag"); mIsText = true; } } Another problem is that not all text is contained within <p> tags. In that BBC news article, for example, the text in bold ('Two remaining members of The Doors...') is not extracted because it is not inside a <p> tag (its in a <b> tag before any <p>). So I'm wondering whether I should just use html_parser to indiscriminately get all text into a string, and then use standard java classes to analyse this text and try to spot 'proper sentences' within this string. Or do you think there is a better way to do this using htmlparser? If you were doing this what would be your approach? Appreciating the support, Zaccary --------------------- Date: Thu, 21 Jul 2005 20:42:23 -0400 From: Derrick Oswald <Der...@Ro...> To: htm...@li... Subject: Re: [Htmlparser-user] getting all text from a html page Reply-To: htm...@li... That looks like it would work. Did you try it? You shouldn't need to change the StringBean class, that's what all the "super." calls are for -- to get the original functionality plus some. There are general instructions on Java programming nearly everywhere on the web. The specifics for the parser are in the JavaDocs. Zac Craven wrote: > OK - then how do I use this MyStringBean? I need to do something like > this in my main program? > > MyStringBean sb = new MyStringBean(); > sb.setLinks(false); > sb.setURL(url); > String alltext = sb.getStrings(); > return alltext; > > Also, do I need to change the StringBean class at all? > > If there is some instruction on this anywhere pls let me know the URL > because I cannot find any info on this. > > Thanks, > Zac > > http://www.dur.ac.uk/z.a.craven/breadcrumbs/ ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com |