htmlparser-user Mailing List for HTML Parser (Page 42)
Brought to you by:
derrickoswald
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
| 2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
| 2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
| 2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
| 2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
| 2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
| 2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
| 2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
| 2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
| 2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
| 2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
| 2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
| 2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
| 2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
| 2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
| 2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
| 2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
| 2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
|
From: Ian M. <ian...@gm...> - 2006-02-23 15:12:19
|
This project is still alive, if under slow development. There are still are number of checkins being made fairly often, and we are possibly going to branch for a 1.6 release. The name LinkTag has indeed been taken for anchor tag, but we can't change it now due to backwards compatibility reasons. I think we might want to make LinkTag support <link> tags, and have a boolean method that says if it's an anchor or not. In fact, reading the W3C spec on this (http://www.w3.org/TR/REC-html40/struct/links.html) this seems like it might be the right thing to do. Can I get some feedback from some of the other devs on this? If it seems like a good idea to do it this way? It looks to me like it probably is the best way to do it semantically and practically. Other things that look like they should be done (devs: please shout if you don't want any of this done): - add support for the data: and view-source: protocols - deprecate setMailLink and setJavascriptLink in favour of setLink - add get/set for rel and rev attributes Ian On 23/02/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> wrote: > Hello, > > I cannot migrate all my work to the C#/.NET platform, although HTML > parsing is a core functionality of my project. > I'm coding a crawler to feed our natural language research group with > corpus from the web. Currently I'm still evaluating options for the > HTML parsing module. I have developed my own HTML scanner based on > Java regexps, but it is too much difficult to maintain and extend > (after all, it can be a project by itself). > > My needs are far beyond the simple link extraction/modification. I > must handle every single tag that may reference an external resource > (and that includes IFrame). This includes parsing embedded CSS > imports. Embedded Javascript is still a problem... > > Anyway, the BIG question is: is this project alive? > I know it is an open source project that is supported by people free > will, and I find that _very_ _meritorious_. > I'm putting this question because I will make a decision now. > > I still would appreciate some feedback on subject of this thread (the > original post follows) > > Lu=EDs > > On Feb 15, 2006, at 4:30 PM, Third Eye wrote: > > > Hi! > > We did implement IFrameTag and named the class as IFrameTag. Our > > implementation is .Net port of this library and we have added some of > > our own enhancements. > > If you are interested, you can download it from > > > > http://www.netomatix.com > > > > Naveen > > > > On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> > > wrote: > >> Hi everybody. > >> > >> This is my first post to this list. > >> I'm replacing my own html processing code (regex based) with > >> HTMLParser. > >> The examples have been a great help! > >> > >> I need to handle IFRAME and LINK tags. The link tag is often used to > >> include external CSS. > >> The name "LinkTag" has already been taken for the anchor tags! How > >> should I name the class to handle the LINK tags? > >> Have anybody implemented the IframeTag and the "TrueLinkTag" classes? > >> I could do this and would be glad to contribute it to the project. > >> I'm using the version 20051112. I've not checked out from CVS because > >> I need a stable package. > >> > >> Cheers! > >> > >> Lu=EDs Gomes > >> (from Portugal) > >> > >> > >> ------------------------------------------------------- > >> This SF.net email is sponsored by: Splunk Inc. Do you grep through > >> log files > >> for problems? Stop! Download the new AJAX search engine that makes > >> searching your log files as easy as surfing the web. DOWNLOAD > >> SPLUNK! > >> http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 > >> _______________________________________________ > >> Htmlparser-user mailing list > >> Htm...@li... > >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >> > > > > > > -- > > Naveen K Kohli > > http://www.netomatix.com > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. Do you grep through > > log files > > for problems? Stop! Download the new AJAX search engine that makes > > searching your log files as easy as surfing the web. DOWNLOAD > > SPLUNK! > > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=103432&bid#0486&dat=12164= 2 > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting langua= ge > that extends applications into web and mobile media. Attend the live webc= ast > and join the prime developer group breaking into this new coding territor= y! > http://sel.as-us.falkag.net/sel?cmdlnk&kid=110944&bid$1720&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
|
From: <lui...@gm...> - 2006-02-23 03:03:32
|
Hello, I cannot migrate all my work to the C#/.NET platform, although HTML =20 parsing is a core functionality of my project. I'm coding a crawler to feed our natural language research group with =20= corpus from the web. Currently I'm still evaluating options for the =20 HTML parsing module. I have developed my own HTML scanner based on =20 Java regexps, but it is too much difficult to maintain and extend =20 (after all, it can be a project by itself). My needs are far beyond the simple link extraction/modification. I =20 must handle every single tag that may reference an external resource =20 (and that includes IFrame). This includes parsing embedded CSS =20 imports. Embedded Javascript is still a problem... Anyway, the BIG question is: is this project alive? I know it is an open source project that is supported by people free =20 will, and I find that _very_ _meritorious_. I'm putting this question because I will make a decision now. I still would appreciate some feedback on subject of this thread (the =20= original post follows) Lu=EDs On Feb 15, 2006, at 4:30 PM, Third Eye wrote: > Hi! > We did implement IFrameTag and named the class as IFrameTag. Our > implementation is .Net port of this library and we have added some of > our own enhancements. > If you are interested, you can download it from > > http://www.netomatix.com > > Naveen > > On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> =20 > wrote: >> Hi everybody. >> >> This is my first post to this list. >> I'm replacing my own html processing code (regex based) with =20 >> HTMLParser. >> The examples have been a great help! >> >> I need to handle IFRAME and LINK tags. The link tag is often used to >> include external CSS. >> The name "LinkTag" has already been taken for the anchor tags! How >> should I name the class to handle the LINK tags? >> Have anybody implemented the IframeTag and the "TrueLinkTag" classes? >> I could do this and would be glad to contribute it to the project. >> I'm using the version 20051112. I've not checked out from CVS because >> I need a stable package. >> >> Cheers! >> >> Lu=EDs Gomes >> (from Portugal) >> >> >> ------------------------------------------------------- >> This SF.net email is sponsored by: Splunk Inc. Do you grep through =20= >> log files >> for problems? Stop! Download the new AJAX search engine that makes >> searching your log files as easy as surfing the web. DOWNLOAD =20 >> SPLUNK! >> http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> > > > -- > Naveen K Kohli > http://www.netomatix.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through =20 > log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD =20 > SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=103432&bid#0486&dat=121642= > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
|
From: Derrick O. <Der...@Ro...> - 2006-02-18 20:06:24
|
Ryan, There is no setter for the ids. You would instead subclass it as you did and override getIds() and getEnders() to return both names -- it only considers uppercase BTW. Then when you register your tag with the PrototypicalNodeFactory, objects of your class will be returned for both <IMG and <IMAGE tags, since the LinkTag registration will be overwritten. Derrick Ryan Smith wrote: > I have noticed that some HTML authors/HTML Editors are using <image > src=".... tags instead of the standard <IMG src=".... tags > I extended the ImageTag class to create a MyImageTag class but that is > ugly, > Is there s convenient way to add strings to the ImageTag's mIds array? > I would like to add "IMAGE", "Image", and "image" to the list of valid > ImageTag identifiers. > is there like a setter or some other way i can add tag ids? Thanks a > lot. > > -Ryan J. Smith > Live Data Group > Software Developer > > |
|
From: Ryan S. <rs...@li...> - 2006-02-17 19:18:41
|
I have noticed that some HTML authors/HTML Editors are using <image src=".... tags instead of the standard <IMG src=".... tags I extended the ImageTag class to create a MyImageTag class but that is ugly, Is there s convenient way to add strings to the ImageTag's mIds array? I would like to add "IMAGE", "Image", and "image" to the list of valid ImageTag identifiers. is there like a setter or some other way i can add tag ids? Thanks a lot. -Ryan J. Smith Live Data Group Software Developer |
|
From: Third E. <nav...@gm...> - 2006-02-15 16:31:33
|
Here is a sample for testing it out. This sample is in C#/.Net but you
should be able to adapt it to java code quickly.
static void TestLinkRegExFilterForPage(String strUrl)
=09=09{
=09=09=09Parser obParser =3D new Parser(new System.Uri(strUrl));
=09=09=09String strPatterns =3D "services*";
=09=09=09NodeFilter obLinkRegExFilter =3D new LinkRegexFilter(strPatterns);
=09=09=09NodeList nodes =3D obParser.ExtractAllNodesThatMatch(obLinkRegExFi=
lter);
=09=09=09if (nodes !=3D null)
=09=09=09{
=09=09=09=09for(Int32 i =3D 0; i < nodes.Count; i++)
=09=09=09=09{
=09=09=09=09=09INode obNode =3D nodes[i];
=09=09=09=09=09Console.WriteLine(obNode.GetText());
=09=09=09=09}
=09=09=09}
=09=09}
On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@dq...> wrot=
e:
> Hi Raj
>
> Check out the example applications bundled with the parser. They
> really help one to get acquainted to HTMLParser.
> In particular, for your problem check out this example:
>
> org.htmlparser.parserapplications.LinkExtractor
>
> It tells you how to extract links. Then you should use the class
>
> org.htmlparser.filters.LinkStringFilter
>
> or
>
> org.htmlparser.filters.LinkRegexFilter
>
> to get only links containning the string "sony" or "jvc".
>
> Hope this helps you.
>
> On Feb 15, 2006, at 5:29 AM, vraja sekaran wrote:
>
> > Hi guys
> > I am new to HTML parser. I am trying to extract the
> > links that corresponds to a search string.
> > For example
> > <dd><a
> > href=3D"/Camcorders--reviews--brand_sony">Sony</a> <tt>(415)</
> > tt><dd><a
> > href=3D"/Camcorders--reviews--jvc">JVC</a> <tt>(385)</tt><dd><a
> >
> > .....
> >
> > In the above source code I want to extract the link
> > corresponding to Sony or JVC according to the
> > requirement.
> >
> > Thank you guys
> > Raj
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log fi=
les
> for problems? Stop! Download the new AJAX search engine that makes
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D103432&bid=3D230486&dat=
=3D121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
--
Naveen K Kohli
http://www.netomatix.com
|
|
From: Third E. <nav...@gm...> - 2006-02-15 16:30:23
|
Hi! We did implement IFrameTag and named the class as IFrameTag. Our implementation is .Net port of this library and we have added some of our own enhancements. If you are interested, you can download it from http://www.netomatix.com Naveen On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> wrote: > Hi everybody. > > This is my first post to this list. > I'm replacing my own html processing code (regex based) with HTMLParser. > The examples have been a great help! > > I need to handle IFRAME and LINK tags. The link tag is often used to > include external CSS. > The name "LinkTag" has already been taken for the anchor tags! How > should I name the class to handle the LINK tags? > Have anybody implemented the IframeTag and the "TrueLinkTag" classes? > I could do this and would be glad to contribute it to the project. > I'm using the version 20051112. I've not checked out from CVS because > I need a stable package. > > Cheers! > > Lu=EDs Gomes > (from Portugal) > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log fi= les > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Naveen K Kohli http://www.netomatix.com |
|
From: <lui...@dq...> - 2006-02-15 14:23:51
|
Hi Raj Check out the example applications bundled with the parser. They really help one to get acquainted to HTMLParser. In particular, for your problem check out this example: org.htmlparser.parserapplications.LinkExtractor It tells you how to extract links. Then you should use the class org.htmlparser.filters.LinkStringFilter or org.htmlparser.filters.LinkRegexFilter to get only links containning the string "sony" or "jvc". Hope this helps you. On Feb 15, 2006, at 5:29 AM, vraja sekaran wrote: > Hi guys > I am new to HTML parser. I am trying to extract the > links that corresponds to a search string. > For example > <dd><a > href="/Camcorders--reviews--brand_sony">Sony</a> <tt>(415)</ > tt><dd><a > href="/Camcorders--reviews--jvc">JVC</a> <tt>(385)</tt><dd><a > > ..... > > In the above source code I want to extract the link > corresponding to Sony or JVC according to the > requirement. > > Thank you guys > Raj |
|
From: <lui...@gm...> - 2006-02-15 14:23:44
|
Hi everybody. This is my first post to this list. I'm replacing my own html processing code (regex based) with HTMLParser. The examples have been a great help! I need to handle IFRAME and LINK tags. The link tag is often used to =20 include external CSS. The name "LinkTag" has already been taken for the anchor tags! How =20 should I name the class to handle the LINK tags? Have anybody implemented the IframeTag and the "TrueLinkTag" classes? I could do this and would be glad to contribute it to the project. I'm using the version 20051112. I've not checked out from CVS because =20= I need a stable package. Cheers! Lu=EDs Gomes (from Portugal) |
|
From: vraja s. <vra...@ya...> - 2006-02-15 05:29:50
|
Hi guys I am new to HTML parser. I am trying to extract the links that corresponds to a search string. For example <dd><a href="/Camcorders--reviews--brand_sony">Sony</a> <tt>(415)</tt><dd><a href="/Camcorders--reviews--jvc">JVC</a> <tt>(385)</tt><dd><a ..... In the above source code I want to extract the link corresponding to Sony or JVC according to the requirement. Thank you guys Raj Rajasekaran Venkatachalam 3602 Spottswood Ave, Apt # 2 Memphis, TN 38111, USA Mobile # 901-246-4031 Work # 901-678-5323 |
|
From: Derrick O. <Der...@Ro...> - 2006-02-08 12:46:39
|
01) Looking at the source code, the SiteCapturer code goes through a
NodeIterator, but the Parser.parse (NodeFilter) method with a null
filter would do the same thing.
// fetch the page and gather the list of nodes
mParser.setURL (url);
try
{
list = new NodeList ();
for (NodeIterator e = mParser.elements ();
e.hasMoreNodes (); )
list.add (e.nextNode ()); // URL conversion occurs
in the tags
}
catch (EncodingChangeException ece)
{
// fix bug #998195 SiteCatpurer just crashed
// try again with the encoding now set correctly
// hopefully mPages, mImages, mCopied and mFinished
won't be corrupted
mParser.reset ();
list = new NodeList ();
for (NodeIterator e = mParser.elements ();
e.hasMoreNodes (); )
list.add (e.nextNode ());
}
02) No validation is done on the page. However, the heuristics built in
to the tag parsing will insert terminating nodes (identified by 0 ==
(tag.getEndPosition () - tag.getStartPosition ())) where end tags are
required.
03) After parsing the entire page, the source is available (as
characters characters or String) from the Page/Source, which is exposed
on the parser as getPage(). Strings in Java are UTF-16 encoded unicode.
Any errors in conversion (using the encoding specified by the HTTP
header or HTML meta tags) will already have been committed by then.
myer wrote:
>Hello dear users and developers,
>
> currently I write my bachelor's thesis, where I use the
> functionality of HTML Parser. In my program I need almost the same
> result as SiteCapturer does. So I've started to learn how it works
> and change it for my project. But some moments are not fully clear
> to me.
>
> 01) How does HTML Parser obtain source code of a web page before
> parsing? In the following I will speak about the SiteCapturer
> example. Does it start with a 'null' filter to get all the nodes of a
> web page for the very first time? And only then applies other
> filters indicated by user. Or it parses 'on the fly': gets the first
> node of a source, compares with node filter, and only if it
> successfully passes the filter check saves it into a data structure.
> Say, node list. What I need to do, is to get the whole 'untouched'
> source code of a web page before parsing. Should I go the way
> mentioned in this thread
> http://sourceforge.net/forum/message.php?msg_id=3005740
> or there are any other more intelligent solutions? Perhaps there
> exist any already implemented method? Something like
> page.getSource()? How does the SiteCapturer solve this problem?
>
> 02) Is the source code of a web page normalized anyhow before the
> actual parsing? Are there any attempts made to supply a parser with
> a validated HTML source? Or is it better to use products of other
> developers, e.g. JTidy?
>
> 03) Also I would like to save the source code of a web page in its
> original encoding or in Unicode. I do not want to lose any
> international character of the source. I need to save the source of a
> page into the database and be able to obtain it in its original form
> if necessary. Does HTML Parser supports source code convertion into
> Unicode?
>
>
>
|
|
From: myer <my...@o2...> - 2006-02-08 11:12:49
|
Hello dear users and developers,
currently I write my bachelor's thesis, where I use the
functionality of HTML Parser. In my program I need almost the same
result as SiteCapturer does. So I've started to learn how it works
and change it for my project. But some moments are not fully clear
to me.
01) How does HTML Parser obtain source code of a web page before
parsing? In the following I will speak about the SiteCapturer
example. Does it start with a 'null' filter to get all the nodes of a
web page for the very first time? And only then applies other
filters indicated by user. Or it parses 'on the fly': gets the first
node of a source, compares with node filter, and only if it
successfully passes the filter check saves it into a data structure.
Say, node list. What I need to do, is to get the whole 'untouched'
source code of a web page before parsing. Should I go the way
mentioned in this thread
http://sourceforge.net/forum/message.php?msg_id=3005740
or there are any other more intelligent solutions? Perhaps there
exist any already implemented method? Something like
page.getSource()? How does the SiteCapturer solve this problem?
02) Is the source code of a web page normalized anyhow before the
actual parsing? Are there any attempts made to supply a parser with
a validated HTML source? Or is it better to use products of other
developers, e.g. JTidy?
03) Also I would like to save the source code of a web page in its
original encoding or in Unicode. I do not want to lose any
international character of the source. I need to save the source of a
page into the database and be able to obtain it in its original form
if necessary. Does HTML Parser supports source code convertion into
Unicode?
--
Best regards, Myer
|
|
From: Derrick O. <Der...@Ro...> - 2006-02-07 18:34:13
|
Tags are omitted because heuristically the tighter rule that assumes all
tags are composite tags fails to parse correctly because of bad HTML out
in the wild.
You are welcome to try replacing the default tag (see
PrototypicalNodeFactory.setTagPrototype()) with a composite tag that
ends with a matching slash name,
but my guess is it will parse very poorly.
加藤 千典 wrote:
>Hi, all.
>
>I notice that correct way.
>
>I created a AddressTag.java that is almost copy of ParagraphTag.java
>
>And add same code like this.
>
> PrototypicalNodeFactory factory = new PrototypicalNodeFactory();
> factory.registerTag (new AddressTag());
> parser.setNodeFactory (factory);
>
>It's ok, but Should I create another alot of HTML tag classes ?
>
>I think that there are almost Html Tag classes already.
>How can I get ?
>
>Thank you, all.
>
>
>
>>Hi, all.
>>
>>I parsed a html, and create a dom , using
>>HTMLParser Version 1.6 (Integration Build Nov 12, 2005)
>>
>>The "P" tag has "P" END TAG as child.
>>(It's is same at "HEAD", "TITLE", "BODY", etc...)
>>
>>The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS")
>>on the same level in dom.
>>(It's the same thing at "CENTER" tag.)
>>
>>I expected that ADDRESS tag become like "P" tag, but not.
>>
>>Why the reason ?
>>
>>How can I that the paser recognize ADDRESS tag as a single
>>CompositeTag.
>>
>>Thank you, all. Sorry my poor english.
>>
>>
>>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
>for problems? Stop! Download the new AJAX search engine that makes
>searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
|
|
From: <ka...@ex...> - 2006-02-07 07:14:24
|
Hi, all.
I notice that correct way.
I created a AddressTag.java that is almost copy of ParagraphTag.java
And add same code like this.
PrototypicalNodeFactory factory = new PrototypicalNodeFactory();
factory.registerTag (new AddressTag());
parser.setNodeFactory (factory);
It's ok, but Should I create another alot of HTML tag classes ?
I think that there are almost Html Tag classes already.
How can I get ?
Thank you, all.
> Hi, all.
>
> I parsed a html, and create a dom , using
> HTMLParser Version 1.6 (Integration Build Nov 12, 2005)
>
> The "P" tag has "P" END TAG as child.
> (It's is same at "HEAD", "TITLE", "BODY", etc...)
>
> The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS")
> on the same level in dom.
> (It's the same thing at "CENTER" tag.)
>
> I expected that ADDRESS tag become like "P" tag, but not.
>
> Why the reason ?
>
> How can I that the paser recognize ADDRESS tag as a single
> CompositeTag.
>
> Thank you, all. Sorry my poor english.
>
|
|
From: <ka...@ex...> - 2006-02-07 06:46:36
|
Hi, all.
I parsed a html, and create a dom , using
HTMLParser Version 1.6 (Integration Build Nov 12, 2005)
The "P" tag has "P" END TAG as child.
(It's is same at "HEAD", "TITLE", "BODY", etc...)
The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS")
on the same level in dom.
(It's the same thing at "CENTER" tag.)
I expected that ADDRESS tag become like "P" tag, but not.
Why the reason ?
How can I that the paser recognize ADDRESS tag as a single
CompositeTag.
Thank you, all. Sorry my poor english.
---------code-----------
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
public class SampleHTMLParserJ {
/**
* HTMLParser sample
*
* @param args
*/
public static void main(String[] args) {
try {
Parser parser = new Parser(
"file:///D:/data/test03.html");
NodeList list = parser.parse(null);
Node node = list.elementAt(0);
System.out.println(node);
} catch (ParserException e) {
e.printStackTrace();
}
}
}
---------stdout-----------
Tag (0[0,0],57[0,57]): Html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja"
Txt (57[0,57],60[1,1]): \n
Tag (60[1,1],66[1,7]): head
Txt (66[1,7],70[2,2]): \n
Tag (70[2,2],77[2,9]): title
Txt (77[2,9],88[2,20]): title title
End (88[2,20],96[2,28]): /title
Txt (96[2,28],99[3,1]): \n
End (99[3,1],106[3,8]): /head
Txt (106[3,8],109[4,1]): \n
Tag (109[4,1],115[4,7]): body
Txt (115[4,7],121[6,2]): \n\n
Tag (121[6,2],130[6,11]): address
Txt (130[6,11],137[6,18]): My name
End (137[6,18],147[6,28]): /address
Txt (147[6,28],151[7,2]): \n
Tag (151[7,2],159[7,10]): CENTER
Txt (159[7,10],165[7,16]): CENTER
End (165[7,16],174[7,25]): /CENTER
Txt (174[7,25],178[8,2]): \n
Tag (178[8,2],181[8,5]): p
Tag (181[8,5],220[8,44]): img src="welcome.gif" alt="welcome" /
End (220[8,44],224[8,48]): /p
Txt (224[8,48],230[10,2]): \n\n
Tag (230[10,2],234[10,6]): h1
Txt (234[10,6],238[10,10]): main
End (238[10,10],243[10,15]): /h1
Txt (243[10,15],247[11,2]): \n
Tag (247[11,2],253[11,8]): hr /
Txt (253[11,8],256[12,1]): \n
End (256[12,1],263[12,8]): /body
Txt (263[12,8],265[13,0]): \n
End (265[13,0],272[13,7]): /html
---------html-----------
<Html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja">
<head>
<title>title title</title>
</head>
<body>
<address>My name</address>
<CENTER>CENTER</CENTER>
<p><img src="welcome.gif" alt="welcome" /></p>
<h1>main</h1>
<hr />
</body>
</html>
------------------
|
|
From: Derrick O. <Der...@Ro...> - 2006-02-05 13:09:28
|
The parser doesn't really deal in lines of text, since most HTML disregards linebreaks (the <pre> tag is the only exception I can think of). What you probably want is subsequent nodes. For this use the children of the parent of the node you have. Some methods were recently added on AbstractNode (which TextNode inherits from) to handle this... getPreviousSibling() and getNextSibling() These are only available in the latest Integration Build. If you really want lines of text, the Page object available from the parser, can be asked to fetch a line with GetLine(). This method has two overloads, one takes a cursor argument the other an integer position. The position is available from the node you have with getStartPosition() or getEndPosition(). That gets you the contents of the line in the HTML stream for the node you have. Subsequent lines are a little tougher to get a hold of. The line information is held in a PageIndex object which the Page doesn't expose. But it could if you added a method. If you had one of those you could step through the lines of the file. Derrick quanta veloce wrote: > Hi, > > Can HTMLParser allow one to extract into an array lines before or > after a search string? > > For instance: > > <CENTER> > <TABLE ALIGN="CENTER" BORDER=5> > <TR> > <TD width=150 align=center><B>Area</B></TD> > <TD width=120 align=center><B>Instantaneous Load</B></TD> > </TR> > <TR> > <TD>PJM MID ATLANTIC REGION</TD> > <TD align=right>33929</TD> > </TR> > <TR> > <TD>PJM WESTERN REGION</TD> > <TD align=right>39400</TD> > </TR> > <TR> > <TD>PJM SOUTHERN REGION</TD> > <TD align=right>9857</TD> > </TR> > <TR> > <TD>PJM RTO</TD> > <TD align=right>83186</TD> > </TR> > </TABLE> > </CENTER> > <P><CENTER>Loads are calculated from raw telemetry data and are > approximate.</CENTER> > <CENTER>The displayed values are NOT official PJM Loads.</CENTER> > <BR><BR><BR> > <P><CENTER><H2>Current PJM Transmission Limits</H2></CENTER> > <P align=center>None > > </BODY> > </HTML> > > In the following URL I matched the string "Current PJM Transmission > Limits" and I want to obtain any and all lines after this match...or > even the next 3 lines, etc., > > Any help would be appreciated! > Thanks, > > > ------------------------------------------------------------------------ > Relax. Yahoo! Mail virus scanning > <http://us.rd.yahoo.com/mail_us/taglines/viruscc/*http://communications.yahoo.com/features.php?page=221> > helps detect nasty viruses! |
|
From: quanta v. <qua...@ya...> - 2006-02-04 00:39:45
|
Hi, Can HTMLParser allow one to extract into an array lines before or after a search string? For instance: <CENTER> <TABLE ALIGN="CENTER" BORDER=5> <TR> <TD width=150 align=center><B>Area</B></TD> <TD width=120 align=center><B>Instantaneous Load</B></TD> </TR> <TR> <TD>PJM MID ATLANTIC REGION</TD> <TD align=right>33929</TD> </TR> <TR> <TD>PJM WESTERN REGION</TD> <TD align=right>39400</TD> </TR> <TR> <TD>PJM SOUTHERN REGION</TD> <TD align=right>9857</TD> </TR> <TR> <TD>PJM RTO</TD> <TD align=right>83186</TD> </TR> </TABLE> </CENTER> <P><CENTER>Loads are calculated from raw telemetry data and are approximate.</CENTER> <CENTER>The displayed values are NOT official PJM Loads.</CENTER> <BR><BR><BR> <P><CENTER><H2>Current PJM Transmission Limits</H2></CENTER> <P align=center>None </BODY> </HTML> In the following URL I matched the string "Current PJM Transmission Limits" and I want to obtain any and all lines after this match...or even the next 3 lines, etc., Any help would be appreciated! Thanks, --------------------------------- Relax. Yahoo! Mail virus scanning helps detect nasty viruses! |
|
From: Derrick O. <Der...@Ro...> - 2006-02-03 13:13:01
|
Java uses unicode. It stores characters in UTF-16 internally, i.e. char is 16 bits, String is an array of 16 bit values encoding Unicode in UTF-16. Character entity conversion is a way for HTML documents to contain Unicode characters outside there current encoding and also to avoid the reserved characters HTML is based on, like left angle bracket - < and ampersand &. These must be converted to Unicode to extract the semantic meaning of the page. So your question is, "Is there a java program that uses something besides the String type to store Unicode when parsing HTML"? I don't think so. You might want to look at the Translate class in the util package to see if it does what you want. Jan wrote: > Dear Experts and Users, > > Could anyone say for sure whether htmlparser is capable for html tag > stripping and html entity conversion, but without Unicode-conversion, > or not? > > If not, what Java-tool could I use? > > Thanks, > > Jan |
|
From: Jan <jan...@gm...> - 2006-02-03 05:48:12
|
Dear Experts and Users, Could anyone say for sure whether htmlparser is capable for html tag stripping and html entity conversion, but without Unicode-conversion, or not? If not, what Java-tool could I use? Thanks, Jan |
|
From: Riaz u. <ru...@ya...> - 2006-02-03 02:04:26
|
Hi, I was away for a long time...anyways here is the program that I had written. I know very little programming so dont get bored. Is there a better way to read from yahoo. I am sure there is: one more thing is that the program displays $ instead of $ symbol how do I overcome this. Please help anyone. the program follows: import java.io.*; import java.net.*; import java.net.URL; import org.htmlparser.*; import org.htmlparser.util.*; import org.htmlparser.Parser; import org.htmlparser.lexer.Lexer; import org.htmlparser.tags.Span; import org.htmlparser.tags.FormTag; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.StyleTag; import org.htmlparser.tags.ScriptTag; import org.htmlparser.tags.ParagraphTag; import org.htmlparser.tags.CompositeTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.nodes.TagNode; import org.htmlparser.nodes.TextNode; import org.htmlparser.filters.TagNameFilter; import org.htmlparser.filters.LinkStringFilter; /** * Extract plaintext strings from a web page. * Illustrative program to gather the textual contents of a web page. * Uses a {@link org.htmlparser.beans.StringBean StringBean} to accumulate * the user visible text (what a browser would display) into a single string. *Step 1. Parse the page *Step 2. Collect the HTML tags in the page as nodes in a list. *Step 3. Keep Only the SPAN tags in the list. * Links are continuously updated at yahoo page, they are in between the SPAN tag with * 'recenttimedate' attribute. *Step 4. */ public class StringExtract { public static void main(String args[]){ try{ int i=0,j=0,k=0; boolean endOfnewsinthisPage = false; String sourceURL = args[0]; //sourceURL is the argument to read news from // Step 1. Parsing the input page. Parser parser = new Parser (sourceURL);//parser will hold the tree of the url NodeList li_tags = new NodeList(); // Step 2. Collecting Tags in a list. NodeList list = parser.parse (null); //news links are at the span tag (time), spanList stores the span tags // Step 3. Keep only the SPAN tags in spanList. NodeList spanList = list.extractAllNodesThatMatch(new TagNameFilter ("SPAN"),true); // Step 4. Extract link from each span tag. while(i < spanList.size()) { Span spanTag = (Span)spanList.elementAt(i); // System.out.println(spanTag.getText()); // We only need SPAN tags with attribute "class = 'recenttimedate'" // Move to the link in the span tag if(spanTag.getText().equals("span class=recenttimedate")) { li_tags.add(spanList.elementAt(i).getParent()); } i++; } i=0; NodeList a_tags = new NodeList(); NodeFilter filter = new TagNameFilter ("P"); LinkTag validLink = new LinkTag(); CompositeTag comptag = new CompositeTag(); String linkTag = "http"; LinkStringFilter linkTagFilter= new LinkStringFilter(linkTag); // There are http links and also other links, a_tags will contain only http links for( NodeIterator e = li_tags.elements(); e.hasMoreNodes();) { e.nextNode().collectInto(a_tags, linkTagFilter); } // BufferedWriter out = new BufferedWriter(new FileWriter("output.txt", true)); // while( i < a_tags.size()) // { LinkTag linkAtag = (LinkTag)a_tags.elementAt(0); //Extract link from each a_tags element String interestingLink = linkAtag.extractLink(); boolean exists = false; j=0; // In Yahoo, there are few http links which lead to images, we dont need them, the following loop // filters out those links. while(j < interestingLink.length() && (!exists)) { exists = interestingLink.substring(j).startsWith("photos"); j++; } // Step 5. Parse Each link that was collected. if((linkAtag.isHTTPLink()) && (!linkAtag.getLinkText().equals("")) && (!exists)) { Parser parseIndividualURLs = new Parser(interestingLink); NodeList nodesLink = parseIndividualURLs.parse(null); String newString = new String(); TextNode textNode = new TextNode(newString); for(NodeIterator x = nodesLink.elements(); x.hasMoreNodes();) { Node cNode = x.nextNode(); if((cNode.getChildren() != null) && (!cNode.getText().equals("div"))) nodesLink.add(cNode.getChildren()); } // One link is one HTML document. nodesLink is the list of all nodes under one document. for(k = 0; k < nodesLink.size(); k++) { Node cNode = nodesLink.elementAt(k); Node prevNode = null; Node nextNode = null; TagNode nextTagNode = null; TagNode prevTagNode = null; TagNode dNode = null; // if(cNode instanceof LinkTag) // { // LinkTag lnkTag = (LinkTag)cNode; // System.out.println(lnkTag.getLinkText()); // } if(!((k-1) < 0)) { prevNode = nodesLink.elementAt(k-1); if(prevNode instanceof TagNode) prevTagNode = (TagNode)nodesLink.elementAt(k-1); } if(!((k+1) > nodesLink.size())) { nextNode = nodesLink.elementAt(k+1); if(nextNode instanceof TagNode) nextTagNode = (TagNode)nodesLink.elementAt(k+1); } TagNode tNode = (TagNode)cNode.getParent(); NodeList newList = new NodeList(); //Printing the title of the news if(cNode.getText().equals("title")) { // out.write(cNode.toPlainTextString()); System.out.println(cNode.toPlainTextString()); System.out.println(); } if(cNode.getText().startsWith("div")) { dNode = (TagNode)cNode; if(dNode.getAttribute("class") != null) if(dNode.getAttribute("class").equals("clearfix")) k = nodesLink.size() + 1; } if(cNode instanceof TextNode) { // This 'if block' prints each paragraph of the news if(prevNode.getText().equals("p")) { if(!(nextNode.getText().startsWith("span"))) { // out.write(cNode.toHtml().trim()); System.out.println(cNode.toHtml().trim()); // here } else if(nextNode instanceof TagNode) { if(nextTagNode.getAttribute("class") != null) if(!(nextTagNode.getAttribute("class").equals("clearfix"))) { // out.write(cNode.toHtml().trim()); System.out.println(cNode.toHtml().trim()); } else k = nodesLink.size(); } } // This 'else if' block prints the first paragraph of the news (Because the first paragraph //at a different place in the document. else if(prevNode.getText().equals("p/")) { // out.write(cNode.toHtml().trim()); System.out.println(cNode.toHtml().trim()); } // There are some words in the document where Yahoo provides search facility(for example, a person, // a country etc.) and it is in the form of link. This block extracts text from those links. else if(prevNode.getText().startsWith("span")) { newList.add(prevNode.getChildren()); for(NodeIterator x=newList.elements();x.hasMoreNodes();) { Node aNode = x.nextNode(); if(aNode instanceof TagNode) { prevTagNode = (TagNode)aNode; if(prevTagNode.getAttribute("href") != null) { // out.write(aNode.toPlainTextString()+" "+cNode.toHtml().trim()); System.out.println(aNode.toPlainTextString()+" "+cNode.toHtml().trim()); } } } } } } // System.out.println("Link:"+linkAtag.extractLink()+":Text:" + linkAtag.getLinkText()); // System.out.println(); // System.out.println(); } // i++; // } // out.close(); } catch (Exception ex) { System.out.println("Printing Exceptional Error"); ex.printStackTrace(); } } } --------------------------------- Yahoo! Mail - Helps protect you from nasty viruses. |
|
From: Derrick O. <Der...@Ro...> - 2006-02-01 23:18:59
|
The StringBean does a decode on the text, perhaps that is what you need:
public void visitStringNode (Text string)
{
if (!mIsScript && !mIsStyle)
{
String text = string.getText ();
if (!mIsPre)
{
text = Translate.decode (text);
HuangGehua wrote:
> I parser a html resource file which has some Chinese words.When i use
> TextExtractingVisitor.getExtractedText() method to get the text,the
> Chinese words displays well.But if i get a TextNode and use
> TextNode.getText() method to get the Chinese words it can't displayed
> correctly.
>
> How could let TextNode.getText() method work correctly!!!
> Thank you!!!!
>
|
|
From: HuangGehua <bo...@gm...> - 2006-01-31 11:42:56
|
I parser a html resource file which has some Chinese words.When i use TextExtractingVisitor.getExtractedText() method to get the text,the Chinese words displays well.But if i get a TextNode and use TextNode.getText() method to get the Chinese words it can't displayed correctly. How could let TextNode.getText() method work correctly!!! Thank you!!!! |
|
From: HuangGehua <bo...@gm...> - 2006-01-30 16:31:56
|
I want to parser a html file with encoding GB2312 or GBK and then write
a xml file with encoding UTF-8.I use jdom to write the XML file.The
resource html file didn't have a <meta> to identify the chareset,for
exmaple:
========================
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
Do Not Edit! -->
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT><H3 FOLDED ADD_DATE="1120124714">链接</H3>
<DL><p>
<DT><A
HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&ar=windowsmedia">Windows
Media</A>
<DT><A
HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&ar=windows">Windows</A>
<DT><A
HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&ar=hotmail"> 免费
Hotmail</A>
<DT><A
HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&pver=6&ar=CLinks">
自定义链接</A>
</DL><p>
<DT><A HREF="http://www.yxcard.com/download.htm">..远兴科技..</A>
<DT><A
HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&pver=6&ar=IStart">MSN</A>
<DT><A HREF="http://www.yesure.com/storm/sort.php/1">暴风影音</A>
<DT><A
HREF="http://www.yesky.com/SoftChannel/72348977504190464/20050411/1934159.shtml">Eclipse
Yesky</A>
</DL><p>
=======================
the java source code is:
=============================================
package html;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;
import java.util.List;
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.nodes.TextNode;
import org.htmlparser.tags.DefinitionList;
import org.htmlparser.tags.DefinitionListBullet;
import org.htmlparser.tags.HeadingTag;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.util.SimpleNodeIterator;
import org.htmlparser.visitors.TagFindingVisitor;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.output.Format;
import org.jdom.output.XMLOutputter;
public class ChangeHtml2XML {
private String htmlPath="d:/bookmark.htm";
private String xmlPath="d:/toXML.xml";
public Document getFirstMark() throws ParserException{
Parser parser=new Parser(htmlPath);
parser.setEncoding("GB2312");
String [] tagsToBeFound = {"DL"};
TagFindingVisitor visitor = new TagFindingVisitor
(tagsToBeFound);
parser.visitAllNodesWith(visitor);
Node [] nodes=visitor.getTags(0);
DefinitionList dl=(DefinitionList)nodes[0];
Element rootElement=new Element("favorite");
Document userDocument=new Document(rootElement);
visitEachAndBuild(userDocument,rootElement,dl);
System.out.println(parser.getEncoding());
return userDocument;
}
public void visitEachAndBuild(Document document,Element
parentElement,DefinitionList parentDL){
SimpleNodeIterator iteratorParentDlChildren=parentDL.children();
while(iteratorParentDlChildren.hasMoreNodes()){
Node node=iteratorParentDlChildren.nextNode();
if
(node.getClass().getName().equals(DefinitionListBullet.class.getName())){
DefinitionListBullet dt=(DefinitionListBullet)node;
Node justNode=dt.getChild(0);
if
(justNode.getClass().getName().equals(HeadingTag.class.getName())){
TextNode tn=(TextNode)dt.getChild(1);
Element newElement=new Element("folder");
newElement.setAttribute("label",tn.getText());
System.out.println(tn.getText());
parentElement.addContent(newElement);
DefinitionList findTheDL=null;
SimpleNodeIterator
forChildDefinitionList=dt.getChildren().elements();
while(forChildDefinitionList.hasMoreNodes()){
Node n=forChildDefinitionList.nextNode();
if
(n.getClass().getName().equals(DefinitionList.class.getName())){
findTheDL=(DefinitionList)n;
break;
}
}
if (findTheDL!=null)
visitEachAndBuild(document,newElement,findTheDL);
}else{
TextNode tn=(TextNode)dt.getChild(1);
LinkTag link=(LinkTag)dt.getChild(0);
Element newElement=new Element("address");
newElement.setAttribute("lable",tn.getText());
System.out.println(tn.getText());
newElement.setAttribute("url",link.getLink());
newElement.setAttribute("target","blank");
parentElement.addContent(newElement);
}
}
}
}
public void saveDocument(Document doc){
StringBuffer buff = new StringBuffer();
buff.append(xmlPath);
try {
XMLOutputter outputter = new
XMLOutputter(Format.getPrettyFormat());
Format format=outputter.getFormat();
format.setEncoding("UTF-8");
format.setExpandEmptyElements(true);
outputter.setFormat(format);
FileOutputStream fos=new FileOutputStream(buff.toString());
Writer output=new OutputStreamWriter(fos,"UTF-8");
outputter.output(doc, output);
output.close();
//return true;
} catch (java.io.IOException e) {
System.out.println("cant write to file system");
//throw new Exception(e);
}
}
}
===========================
The result XML file cant display the Chinese words correctly,it looks
like this "
"
What's wrong with me? By the way how to detect a file's charset without
MetaTag?
Any positive suggestion is welcome.
Thank you!!!!!!!!!!!
|
|
From: Jan <jan...@gm...> - 2006-01-28 15:26:39
|
Dear Derrick, Really thank for the quick reply. I wrote the string into a file, and the file contains the question marks. I would like to have the original html text but without the html tags and html entities. Any conversion toward Unicode is undesired for my problem. (I would like to use the plain text for language/encoding identification). If htmlparser does not fit my problem, could you recommend something? Thank you! Jan On 1/28/06, Derrick Oswald <Der...@ro...> wrote: > > Jan, > > In general, a lot of care has been taken to ensure that the correct > character set (according to the web page meta data) is being used. > The appearance of question marks may be just a function of the > System.out.println() that it's doing. > Have you tried examining the errant characters in a debugger or writing > the strings returned from the StringBean (used by the stringextractor > command) to a PrintWriter with an encoding that can handle those > characters? > > Derrick > > Jan wrote: > > > Dear Members! > > > > Is it possible using htmlparser to extract plain text in original > > encoding/charset? > > > > I tried the sample stringextractor.cmd. > > It worked nicely, but non-common characters are replaced with question > > marks (?). I would like to keep the original byte sequence. > > > > Thanks, > > > > Jan > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D103432&bid=3D230486&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
|
From: Derrick O. <Der...@Ro...> - 2006-01-28 14:42:27
|
Jan, In general, a lot of care has been taken to ensure that the correct character set (according to the web page meta data) is being used. The appearance of question marks may be just a function of the System.out.println() that it's doing. Have you tried examining the errant characters in a debugger or writing the strings returned from the StringBean (used by the stringextractor command) to a PrintWriter with an encoding that can handle those characters? Derrick Jan wrote: > Dear Members! > > Is it possible using htmlparser to extract plain text in original > encoding/charset? > > I tried the sample stringextractor.cmd. > It worked nicely, but non-common characters are replaced with question > marks (?). I would like to keep the original byte sequence. > > Thanks, > > Jan |
|
From: Jan <jan...@gm...> - 2006-01-26 06:43:38
|
Dear Members! Is it possible using htmlparser to extract plain text in original encoding/charset? I tried the sample stringextractor.cmd. It worked nicely, but non-common characters are replaced with question mark= s (?). I would like to keep the original byte sequence. Thanks, Jan |