htmlparser-user Mailing List for HTML Parser (Page 13)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

http://Werger2yfjh.servepics.com?Major-Blakelock

Thanks Derrick...
I shifted to HTMLParser in Python in the mean time.
So now, i make an invocation to the python code and then redirect its output
back to my java file . :)

Sandeep.

On Mon, Jun 14, 2010 at 10:16 PM, Derrick Oswald
<der...@gm...>wrote:

> Add also the htmllexer.jar to your classpath.
>
> On Mon, Jun 14, 2010 at 8:12 AM, Sandeep Kumar Gupta <
> san...@gm...> wrote:
>
>> Hello Everyone,
>> I need to parse JSP file so I am starting with an HTML parser, but it
>> seems the binary distribution that i downloaded from Sourceforge does not
>> have all the classes( i put htmlparser.jar in classpath of my eclipse
>> project) as mentioned in the JavaDoc. ? What is it that I am doing wrong ..
>>
>> And are there initial snippets that would get me started with the HTML
>> parser ?
>>
>> --
>> Sandeep
>>
>>
>> ------------------------------------------------------------------------------
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

-- 
Sandeep

Add also the htmllexer.jar to your classpath.

On Mon, Jun 14, 2010 at 8:12 AM, Sandeep Kumar Gupta <san...@gm...
> wrote:

> Hello Everyone,
> I need to parse JSP file so I am starting with an HTML parser, but it seems
> the binary distribution that i downloaded from Sourceforge does not have all
> the classes( i put htmlparser.jar in classpath of my eclipse project) as
> mentioned in the JavaDoc. ? What is it that I am doing wrong ..
>
> And are there initial snippets that would get me started with the HTML
> parser ?
>
> --
> Sandeep
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Hello Everyone,
I need to parse JSP file so I am starting with an HTML parser, but it seems
the binary distribution that i downloaded from Sourceforge does not have all
the classes( i put htmlparser.jar in classpath of my eclipse project) as
mentioned in the JavaDoc. ? What is it that I am doing wrong ..

And are there initial snippets that would get me started with the HTML
parser ?

-- 
Sandeep

Is it null or empty.
If it's null, it may be because that textInPage variable is local to that
block.

On Mon, May 31, 2010 at 12:08 PM, karanjit cheema
<kar...@gm...>wrote:

> hi
> i tried extracting text using the following code:
>
>
>            try {
>                Parser parser = new Parser (urlConnection);
>                TextExtractingVisitor visitor = new TextExtractingVisitor();
>                parser.visitAllNodesWith(visitor);
>                String textInPage = visitor.getExtractedText();
>               }
>            catch (ParserException pe)
>            {
>                pe.printStackTrace ();
>            }
>
>
> the field textInPage is always returning to be empty. can any one tell
> what the problem is?
>
>
> warm regards
> Karanjit Cheema
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

hi
i tried extracting text using the following code:

            try {
                Parser parser = new Parser (urlConnection);
                TextExtractingVisitor visitor = new TextExtractingVisitor();
                parser.visitAllNodesWith(visitor);
                String textInPage = visitor.getExtractedText();
               }
            catch (ParserException pe)
            {
                pe.printStackTrace ();
            }

the field textInPage is always returning to be empty. can any one tell
what the problem is?

warm regards
Karanjit Cheema

It should be found under HTMLParser-2.0-SNAPSHOT.

On Fri, May 21, 2010 at 7:15 AM, Akihiko M <ams...@gm...> wrote:
> Hi
>
> Htmlparser is a wonderful library. It always uses it.
> By the way,I want to use htmlparser2.0 by maven2.
> But it is not registered to maven2 central
> repository(http://repo2.maven.org/maven2/).
>
> I hope that htmlparser2.0 or 2.x will regist soon.
>
> When it will be registed?
>
> Thanks,
>
> Akihiko
>
> --
> --------------------------------
>
> Akihiko
> ams...@gm...
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

Hi

Htmlparser is a wonderful library. It always uses it.
By the way,I want to use htmlparser2.0 by maven2.
But it is not registered to maven2 central
repository(http://repo2.maven.org/maven2/).

I hope that htmlparser2.0 or 2.x will regist soon.

When it will be registed?

Thanks,

Akihiko

-- 
--------------------------------

Akihiko
ams...@gm...

Dear All:

I have successfully gotten the SAX parser of HTML Parser to work with dom4j so that I can use Xpath expressions with HTMLParser.

I like HTMLParser as it is quite faster than other frameworks.

However, I have two issues:
1) Is it possible, in any way, to change the SAX parser to report _lowercase_ tag names?
2) For some reason, although I get a valid tree in dom4j, I cannot seem to find elements that I can see exist by browsing the tree
using XPath? (for example a TD element with a certain classpath, no matter whether or not I capitalize TD). Anyone used HTMLParser with dom4j
successfully?

Thank you
Misha

     [java] nu.xom.XMLException: org.htmlparser.sax.XMLReader does not support the entity resolution features XOM requires.

Any ideas?

Has anyone tried with dom4j?

I would love to have XPath support with HTMLParser, as it is quite fast.

Thank you!
Misha

Dear All:

I am currently using TagSoup with XOM to get XPath support as described here:
http://nicklothian.com/blog/2006/09/11/using-xpath-on-real-world-html-documents/
seems to work well except the following namespace problem:
http://www.supermind.org/blog/613/dom4j-xpath-tagsoup-namespaces-sweet

I noticed HTMLParser is, in my test, the fastest available, and has SAX Parser support:
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/sax/package-summary.html

Has anyone used this with XOM? Any luck? Is it better/worse (i.e., slower/faster) than Tagsoup or other alternatives?

Thank you
Misha

http://secure-power.org/home/index.php

-- 
Pony Onthusitse Nthatsi
P O Box 26496
Game City
GABORONE
+267 3133832
+267 71467530

Hi all,

  I have an XML file which we parse using htmlParser in conjunction with
CSRF. It seems that the following line in the XMl file 

&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;I.<span class="math">x^2
+ x - 2 </span><br />

Is being displayed as junk characters and not translated into spaces.

This is happening only in my LINUX box and not in my WINDOWS box.

I am using FF3.6 / IE7 / IE8 etc.,

Let me know if I need to do anything special to ensure that the Unicode
characters are displayed correctly.

Regards |  Ramesh Kesavanarayanan  |    319-354-9200 ext 215785 / 215972
(O) |  /  319-621-7641 (M)   | ram...@pe...

Also include the lexer.jar.

On 3/23/10, Gazihan Işıldak <gaz...@gm...> wrote:
> hi,
>
> i'm developing a rest web service in java.
>
> i'm using htmlparser library on it.
>
> but when i try to run service i'm getting this exception. i can build it
> successfully. and org.htmlparser.beans.StringBean class exists in project.
>
>> exception
>>
>> javax.servlet.ServletException: java.lang.RuntimeException: WEB9033:
>> Unable to load class with name [org.htmlparser.beans.StringBean],
>> reason:java
>> .lang.NoClassDefFoundError: org/htmlparser/visitors/NodeVisitor
>>
>> root cause
>>
>> java.lang.RuntimeException: WEB9033: Unable to load class with name [org.
>> htmlparser.beans.StringBean], reason: java.lang.NoClassDefFoundError: org/
>> htmlparser/visitors/NodeVisitor
>>
>> root cause
>>
>> java.lang.NoClassDefFoundError: org/htmlparser/visitors/NodeVisitor
>>
>> root cause
>>
>> java.lang.ClassNotFoundException: org.htmlparser.visitors.NodeVisitor
>>
>
> i checked that htmlparser.jar exists in server.
>
> what should i do to achieve this?
>

hi,

i'm developing a rest web service in java.

i'm using htmlparser library on it.

but when i try to run service i'm getting this exception. i can build it
successfully. and org.htmlparser.beans.StringBean class exists in project.

> exception
>
> javax.servlet.ServletException: java.lang.RuntimeException: WEB9033:
> Unable to load class with name [org.htmlparser.beans.StringBean], reason:java
> .lang.NoClassDefFoundError: org/htmlparser/visitors/NodeVisitor
>
> root cause
>
> java.lang.RuntimeException: WEB9033: Unable to load class with name [org.
> htmlparser.beans.StringBean], reason: java.lang.NoClassDefFoundError: org/
> htmlparser/visitors/NodeVisitor
>
> root cause
>
> java.lang.NoClassDefFoundError: org/htmlparser/visitors/NodeVisitor
>
> root cause
>
> java.lang.ClassNotFoundException: org.htmlparser.visitors.NodeVisitor
>

i checked that htmlparser.jar exists in server.

what should i do to achieve this?

Not in the node list.
But there is a
    Tag getEndTag ();
method to get it.

On Thu, Feb 18, 2010 at 11:08 AM, Rajorshi Biswas <raj...@in...> wrote:

> Hi,
> For 'known' tags, it seems that HTMLParser does not visit the 'end' tags
> (e.g. "P" tag). But for tags that arent directly supported, such as "STRONG"
> tag, the parser does return the end tag in the nodelist.
>
> Is there a way to ask the parser to retrieve the "end" tags for known
> classes as well?
>
> Thanks,
> Raj
>
>
> Dear *htmlparser-user !* Get Yourself a cool, short *@in.com* Email ID
> now!<http://mail.in.com/mails/new_reg.php?utm_source=invite&utm_medium=outgoing>
>
>
> ------------------------------------------------------------------------------
> Download Intel&reg; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Hi, For 'known' tags, it seems that HTMLParser does not visit the 'end' tags (e.g. "P" tag). But for tags that arent directly supported, such as "STRONG" tag, the parser does return the end tag in the nodelist.Is there a way to ask the parser to retrieve the "end" tags for known classes as well?Thanks,RajDear htmlparseruser ! Get Yourself a cool, short @in.com Email ID now!

 Thanks much! TextNode is something I missed out on. (I didnt realize text inside a node was modeled as a Node  silly me) Original message From:Derrick Oswald< der...@gm... >Date: 18 Feb 10 11:04:51Subject:Re: [Htmlparseruser] query on how to read "data" for a particular TagNodeTo: Rajorshi Biswas , htmlparser user list If you have the div tag, then since it is a composite node, "
foo
" will be the first child:divtag.getChildren ()[0]On Thu, Feb 18, 2010 at 5:16 AM, Rajorshi Biswaswrote: Hello, I am new to htmlparser, so please forgive me if this is a naive question. I have an HTML fragment for which I need to determine if the first visible text is in bold or not.For this, I am trying to get the 'first' text content of the fragment. Suppose the fragment is of the following form: foo something something else My question is: how do I get the "data" portion of the 'div'. That is, when I arrive at the "div" node (Div object), I wish to retrieve the content of the div WITHOUT its children elements  I wish to retrieve "foo" in this case.I could not find an API in the Node/TagNode classes for this. Could anyone please help me out here?Thanks in advance!Raj Dear htmlparseruser! Get Yourself a cool, short @in.com Email ID now! Download Intel&reg; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and finetune appli
 cations for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intelswdev Htmlparseruser mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparseruser 

If you have the div tag, then since it is a composite node, "\nfoo\n" will
be the first child:

divtag.getChildren ()[0]

On Thu, Feb 18, 2010 at 5:16 AM, Rajorshi Biswas <raj...@in...> wrote:

> Hello,
> I am new to htmlparser, so please forgive me if this is a naive question. I
> have an HTML fragment for which I need to determine if the first visible
> text is in bold or not.
>
> For this, I am trying to get the 'first' text content of the fragment.
> Suppose the fragment is of the following form:
>
> <div>
> foo
> <p>something</p>
> <span>something else</span>
> </div>
>
> My question is: how do I get the "data" portion of the 'div'. That is, when
> I arrive at the "div" node (Div object), I wish to retrieve the content of
> the div WITHOUT its children elements - I wish to retrieve "foo" in this
> case.
>
> I could not find an API in the Node/TagNode classes for this. Could anyone
> please help me out here?
>
>
> Thanks in advance!
> Raj
>
>
> Dear *htmlparser-user!* Get Yourself a cool, short *@in.com* Email ID now!<http://mail.in.com/mails/new_reg.php?utm_source=invite&utm_medium=outgoin+g>
>
>
> ------------------------------------------------------------------------------
> Download Intel&reg; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Hello, I am new to htmlparser, so please forgive me if this is a naive question. I have an HTML fragment for which I need to determine if the first visible text is in bold or not. For this, I am trying to get the 'first' text content of the fragment. Suppose the fragment is of the following form: foo something something elseMy question is: how do I get the "data" portion of the 'div'. That is, when I arrive at the "div" node (Div object), I wish to retrieve the content of the div WITHOUT its children elements  I wish to retrieve "foo" in this case. I could not find an API in the Node/TagNode classes for this. Could anyone please help me out here?Thanks in advance!RajDear htmlparseruser! Get Yourself a cool, short @in.com Email ID now!

Hava a look at the site capturer code.
It basically does what you want I think.

2010/2/17 Wagner Montalvão Camarão <wag...@gm...>

> Hello everyone,
>
> I'm new here and I'd like to get some directions about the htmlparser
> usage. I need to go through a html code and update each link (href content)
> with a new one. Like: <a href="www.google.com"> will become <a href="
> www.mysite.com/click?link=www.google.com">
>
> First I tried using the javax.xml.parsers and org.w3c.dom but the problem
> is I get an exception if I try to parse an invalid xhtml by w3c. I can't
> work like this because some users may post html codes generated by designer
> tools which formatting may not be valid by w3c.
>
> I could write some regex to do this but I heard about htmlparser and I
> would like to know if it can help me with this.
>
> Any suggestion will be appreciated.
>
> Thank you
>
> Wagner Montalvão Camarão
>
>
> ------------------------------------------------------------------------------
> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
> http://p.sf.net/sfu/solaris-dev2dev
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Hello everyone,

I'm new here and I'd like to get some directions about the htmlparser usage.
I need to go through a html code and update each link (href content) with a
new one. Like: <a href="www.google.com"> will become <a href="
www.mysite.com/click?link=www.google.com">

First I tried using the javax.xml.parsers and org.w3c.dom but the problem is
I get an exception if I try to parse an invalid xhtml by w3c. I can't work
like this because some users may post html codes generated by designer tools
which formatting may not be valid by w3c.

I could write some regex to do this but I heard about htmlparser and I would
like to know if it can help me with this.

Any suggestion will be appreciated.

Thank you

Wagner Montalvão Camarão

On Sat, Dec 12, 2009 at 10:02 AM, Derrick Oswald
<der...@gm...>wrote:

> This has been replaced by the main program in
> org.htmlparser.beans.StringBean.
>

I never did get that name - StringBean.  It extracts strings but it isn't
called a StringExtractor.  Hmmmm...

best
jk

Great, thanks for the answer! (I just saw it now)

The library seems great!
One question, it seems that it does not handle the div elements correctly.
A div element is a block
element<http://www.webdesignfromscratch.com/html-css/css-block-and-inline.php>
(by
default), and thus it should render a new line.

For example, with this html file:
++++++++++++++++++
<html>
<body>
test1
test2
<div>test3</div>
test4
<span>test5</span>
<span>test6</span>
</body>
</html>
++++++++++++++++++

if should produce:
++++++++++++++++++
test1 test2
test3
test4 test5 test6
++++++++++++++++++

note the new line between test3 and test4.

However, StringBean produces the following:
++++++++++++++++++
test1 test2
test3 test4 test5 test6
++++++++++++++++++

It handles correctly the new lines for text and span nodes, but not for
divs.

Is that the intended effect? if so, is it possible to override this (add a
new line for block elements)?

Regards,
David Portabella

On Sat, Dec 12, 2009 at 10:02 AM, Derrick Oswald
<der...@gm...>wrote:

> This has been replaced by the main program in
> org.htmlparser.beans.StringBean.
>
> Sorry for the misdirection
>
> On Wed, Dec 9, 2009 at 11:18 PM, David Portabella Clotet <
> dav...@gm...> wrote:
>
>> Hello,
>>
>> In the website: http://htmlparser.sourceforge.net/samples.html
>> there is info about the "StringExtractor" example:
>> ++++++++++++++++++
>> String Extractor
>> Extract text from a web page.
>> org.htmlparser.parserapplications.StringExtractor
>> bin/stringextractor http://website_url
>> ++++++++++++++++++
>>
>> However, I did not find this example in any of this two downloads:
>> HTMLParser-2.0-SNAPSHOT-src.zip
>> HTMLParser-2.0-SNAPSHOT-bin.zip
>>
>> Can you please tell me where to find the StringExtractor example?
>>
>>
>> Best regards,
>> DAvid Portabella
>>
>>

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct (1)	Nov	Dec

htmlparser-user Mailing List for HTML Parser (Page 13)

htmlparser-user — The user mailing list for users of the htmlparser library