[Htmlparser-user] (no subject)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks for your help Derrick.

What I am trying to do is extract sentences from
websites, so that I can analyse the grammar, use of
jargon and so on in those sentences.  Ideally I would
like to automatically extract the text and create an
Arraylist of sentences without any headings/menu links
and so on.

I played around with MyStringBean and I have set it so
that it will only extract text from the tag. 
However I notice that if the is not closed with a
 then the parser will continue to get text from
any tags until a is found. So I usually end up
with all the text I want, followed by some unwanted
headings/links and so on. For example if I parse 
http://news.bbc.co.uk/1/hi/entertainment/music/4710441.stm
it will get all text from 'The LA court order...' to
the end of the page, including unwanted links such as
'SEE ALSO: 
Doors manager Sugerman dies at 50 
07 Jan 05 | Music'

I cannot work out how to amend the MyStringBean to
solve this problem.  

My code is just this:

    public void visitStringNode (Text string)
    {
        if (mIsText)
            super.visitStringNode (string);
    }

    public void visitTag (Tag tag)
    {
        String name;

        super.visitTag (tag);
        name = tag.getTagName ();
        if (name.equalsIgnoreCase ("p")){
          System.out.println("found 'p' tag");
          mIsText = true;
        }

    }

Another problem is that not all text is contained
within tags. In that BBC news article, for
example, the text in bold ('Two remaining members of
The Doors...') is not extracted because it is not
inside a tag (its in a tag before any ). 
So I'm wondering whether I should just use html_parser
to indiscriminately get all text into a string, and
then use standard java classes to analyse this text
and try to spot 'proper sentences' within this string.
 Or do you think there is a better way to do this
using htmlparser? If you were doing this what would
be your approach?

Appreciating the support,
Zaccary

---------------------
Date: Thu, 21 Jul 2005 20:42:23 -0400
From: Derrick Oswald <Der...@Ro...>
To:  htm...@li...
Subject: Re: [Htmlparser-user] getting all text from a
html page
Reply-To: htm...@li...

That looks like it would work.  Did you try it?
You shouldn't need to change the StringBean class,
that's what all the 
"super." calls are for -- to get the original
functionality plus some.
There are general instructions on Java programming
nearly everywhere on 
the web. The specifics for the parser are in the
JavaDocs.

Zac Craven wrote:

> OK - then how do I use this MyStringBean?  I need to
do something 
like 
> this in my main program?
>
> MyStringBean sb = new MyStringBean();
> sb.setLinks(false);
> sb.setURL(url);
> String alltext = sb.getStrings();
> return alltext;
>
> Also, do I need to change the StringBean class at
all?
>
> If there is some instruction on this anywhere pls
let me know the URL 
> because I cannot find any info on this.
>
> Thanks,
> Zac
>
>

http://www.dur.ac.uk/z.a.craven/breadcrumbs/

___________________________________________________________ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com