htmlparser-user Mailing List for HTML Parser (Page 42)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: <lui...@gm...> - 2006-02-23 03:03:32
|
Hello, I cannot migrate all my work to the C#/.NET platform, although HTML =20 parsing is a core functionality of my project. I'm coding a crawler to feed our natural language research group with =20= corpus from the web. Currently I'm still evaluating options for the =20 HTML parsing module. I have developed my own HTML scanner based on =20 Java regexps, but it is too much difficult to maintain and extend =20 (after all, it can be a project by itself). My needs are far beyond the simple link extraction/modification. I =20 must handle every single tag that may reference an external resource =20 (and that includes IFrame). This includes parsing embedded CSS =20 imports. Embedded Javascript is still a problem... Anyway, the BIG question is: is this project alive? I know it is an open source project that is supported by people free =20 will, and I find that _very_ _meritorious_. I'm putting this question because I will make a decision now. I still would appreciate some feedback on subject of this thread (the =20= original post follows) Lu=EDs On Feb 15, 2006, at 4:30 PM, Third Eye wrote: > Hi! > We did implement IFrameTag and named the class as IFrameTag. Our > implementation is .Net port of this library and we have added some of > our own enhancements. > If you are interested, you can download it from > > http://www.netomatix.com > > Naveen > > On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> =20 > wrote: >> Hi everybody. >> >> This is my first post to this list. >> I'm replacing my own html processing code (regex based) with =20 >> HTMLParser. >> The examples have been a great help! >> >> I need to handle IFRAME and LINK tags. The link tag is often used to >> include external CSS. >> The name "LinkTag" has already been taken for the anchor tags! How >> should I name the class to handle the LINK tags? >> Have anybody implemented the IframeTag and the "TrueLinkTag" classes? >> I could do this and would be glad to contribute it to the project. >> I'm using the version 20051112. I've not checked out from CVS because >> I need a stable package. >> >> Cheers! >> >> Lu=EDs Gomes >> (from Portugal) >> >> >> ------------------------------------------------------- >> This SF.net email is sponsored by: Splunk Inc. Do you grep through =20= >> log files >> for problems? Stop! Download the new AJAX search engine that makes >> searching your log files as easy as surfing the web. DOWNLOAD =20 >> SPLUNK! >> http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> > > > -- > Naveen K Kohli > http://www.netomatix.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through =20 > log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD =20 > SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=103432&bid#0486&dat=121642= > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-02-18 20:06:24
|
Ryan, There is no setter for the ids. You would instead subclass it as you did and override getIds() and getEnders() to return both names -- it only considers uppercase BTW. Then when you register your tag with the PrototypicalNodeFactory, objects of your class will be returned for both <IMG and <IMAGE tags, since the LinkTag registration will be overwritten. Derrick Ryan Smith wrote: > I have noticed that some HTML authors/HTML Editors are using <image > src=".... tags instead of the standard <IMG src=".... tags > I extended the ImageTag class to create a MyImageTag class but that is > ugly, > Is there s convenient way to add strings to the ImageTag's mIds array? > I would like to add "IMAGE", "Image", and "image" to the list of valid > ImageTag identifiers. > is there like a setter or some other way i can add tag ids? Thanks a > lot. > > -Ryan J. Smith > Live Data Group > Software Developer > > |
From: Ryan S. <rs...@li...> - 2006-02-17 19:18:41
|
I have noticed that some HTML authors/HTML Editors are using <image src=".... tags instead of the standard <IMG src=".... tags I extended the ImageTag class to create a MyImageTag class but that is ugly, Is there s convenient way to add strings to the ImageTag's mIds array? I would like to add "IMAGE", "Image", and "image" to the list of valid ImageTag identifiers. is there like a setter or some other way i can add tag ids? Thanks a lot. -Ryan J. Smith Live Data Group Software Developer |
From: Third E. <nav...@gm...> - 2006-02-15 16:31:33
|
Here is a sample for testing it out. This sample is in C#/.Net but you should be able to adapt it to java code quickly. static void TestLinkRegExFilterForPage(String strUrl) =09=09{ =09=09=09Parser obParser =3D new Parser(new System.Uri(strUrl)); =09=09=09String strPatterns =3D "services*"; =09=09=09NodeFilter obLinkRegExFilter =3D new LinkRegexFilter(strPatterns); =09=09=09NodeList nodes =3D obParser.ExtractAllNodesThatMatch(obLinkRegExFi= lter); =09=09=09if (nodes !=3D null) =09=09=09{ =09=09=09=09for(Int32 i =3D 0; i < nodes.Count; i++) =09=09=09=09{ =09=09=09=09=09INode obNode =3D nodes[i]; =09=09=09=09=09Console.WriteLine(obNode.GetText()); =09=09=09=09} =09=09=09} =09=09} On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@dq...> wrot= e: > Hi Raj > > Check out the example applications bundled with the parser. They > really help one to get acquainted to HTMLParser. > In particular, for your problem check out this example: > > org.htmlparser.parserapplications.LinkExtractor > > It tells you how to extract links. Then you should use the class > > org.htmlparser.filters.LinkStringFilter > > or > > org.htmlparser.filters.LinkRegexFilter > > to get only links containning the string "sony" or "jvc". > > Hope this helps you. > > On Feb 15, 2006, at 5:29 AM, vraja sekaran wrote: > > > Hi guys > > I am new to HTML parser. I am trying to extract the > > links that corresponds to a search string. > > For example > > <dd><a > > href=3D"/Camcorders--reviews--brand_sony">Sony</a> <tt>(415)</ > > tt><dd><a > > href=3D"/Camcorders--reviews--jvc">JVC</a> <tt>(385)</tt><dd><a > > > > ..... > > > > In the above source code I want to extract the link > > corresponding to Sony or JVC according to the > > requirement. > > > > Thank you guys > > Raj > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log fi= les > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D103432&bid=3D230486&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Naveen K Kohli http://www.netomatix.com |
From: Third E. <nav...@gm...> - 2006-02-15 16:30:23
|
Hi! We did implement IFrameTag and named the class as IFrameTag. Our implementation is .Net port of this library and we have added some of our own enhancements. If you are interested, you can download it from http://www.netomatix.com Naveen On 2/15/06, Lu=EDs Manuel dos Santos Gomes <lui...@gm...> wrote: > Hi everybody. > > This is my first post to this list. > I'm replacing my own html processing code (regex based) with HTMLParser. > The examples have been a great help! > > I need to handle IFRAME and LINK tags. The link tag is often used to > include external CSS. > The name "LinkTag" has already been taken for the anchor tags! How > should I name the class to handle the LINK tags? > Have anybody implemented the IframeTag and the "TrueLinkTag" classes? > I could do this and would be glad to contribute it to the project. > I'm using the version 20051112. I've not checked out from CVS because > I need a stable package. > > Cheers! > > Lu=EDs Gomes > (from Portugal) > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log fi= les > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Naveen K Kohli http://www.netomatix.com |
From: <lui...@dq...> - 2006-02-15 14:23:51
|
Hi Raj Check out the example applications bundled with the parser. They really help one to get acquainted to HTMLParser. In particular, for your problem check out this example: org.htmlparser.parserapplications.LinkExtractor It tells you how to extract links. Then you should use the class org.htmlparser.filters.LinkStringFilter or org.htmlparser.filters.LinkRegexFilter to get only links containning the string "sony" or "jvc". Hope this helps you. On Feb 15, 2006, at 5:29 AM, vraja sekaran wrote: > Hi guys > I am new to HTML parser. I am trying to extract the > links that corresponds to a search string. > For example > <dd><a > href="/Camcorders--reviews--brand_sony">Sony</a> <tt>(415)</ > tt><dd><a > href="/Camcorders--reviews--jvc">JVC</a> <tt>(385)</tt><dd><a > > ..... > > In the above source code I want to extract the link > corresponding to Sony or JVC according to the > requirement. > > Thank you guys > Raj |
From: <lui...@gm...> - 2006-02-15 14:23:44
|
Hi everybody. This is my first post to this list. I'm replacing my own html processing code (regex based) with HTMLParser. The examples have been a great help! I need to handle IFRAME and LINK tags. The link tag is often used to =20 include external CSS. The name "LinkTag" has already been taken for the anchor tags! How =20 should I name the class to handle the LINK tags? Have anybody implemented the IframeTag and the "TrueLinkTag" classes? I could do this and would be glad to contribute it to the project. I'm using the version 20051112. I've not checked out from CVS because =20= I need a stable package. Cheers! Lu=EDs Gomes (from Portugal) |
From: vraja s. <vra...@ya...> - 2006-02-15 05:29:50
|
Hi guys I am new to HTML parser. I am trying to extract the links that corresponds to a search string. For example <dd><a href="/Camcorders--reviews--brand_sony">Sony</a> <tt>(415)</tt><dd><a href="/Camcorders--reviews--jvc">JVC</a> <tt>(385)</tt><dd><a ..... In the above source code I want to extract the link corresponding to Sony or JVC according to the requirement. Thank you guys Raj Rajasekaran Venkatachalam 3602 Spottswood Ave, Apt # 2 Memphis, TN 38111, USA Mobile # 901-246-4031 Work # 901-678-5323 |
From: Derrick O. <Der...@Ro...> - 2006-02-08 12:46:39
|
01) Looking at the source code, the SiteCapturer code goes through a NodeIterator, but the Parser.parse (NodeFilter) method with a null filter would do the same thing. // fetch the page and gather the list of nodes mParser.setURL (url); try { list = new NodeList (); for (NodeIterator e = mParser.elements (); e.hasMoreNodes (); ) list.add (e.nextNode ()); // URL conversion occurs in the tags } catch (EncodingChangeException ece) { // fix bug #998195 SiteCatpurer just crashed // try again with the encoding now set correctly // hopefully mPages, mImages, mCopied and mFinished won't be corrupted mParser.reset (); list = new NodeList (); for (NodeIterator e = mParser.elements (); e.hasMoreNodes (); ) list.add (e.nextNode ()); } 02) No validation is done on the page. However, the heuristics built in to the tag parsing will insert terminating nodes (identified by 0 == (tag.getEndPosition () - tag.getStartPosition ())) where end tags are required. 03) After parsing the entire page, the source is available (as characters characters or String) from the Page/Source, which is exposed on the parser as getPage(). Strings in Java are UTF-16 encoded unicode. Any errors in conversion (using the encoding specified by the HTTP header or HTML meta tags) will already have been committed by then. myer wrote: >Hello dear users and developers, > > currently I write my bachelor's thesis, where I use the > functionality of HTML Parser. In my program I need almost the same > result as SiteCapturer does. So I've started to learn how it works > and change it for my project. But some moments are not fully clear > to me. > > 01) How does HTML Parser obtain source code of a web page before > parsing? In the following I will speak about the SiteCapturer > example. Does it start with a 'null' filter to get all the nodes of a > web page for the very first time? And only then applies other > filters indicated by user. Or it parses 'on the fly': gets the first > node of a source, compares with node filter, and only if it > successfully passes the filter check saves it into a data structure. > Say, node list. What I need to do, is to get the whole 'untouched' > source code of a web page before parsing. Should I go the way > mentioned in this thread > http://sourceforge.net/forum/message.php?msg_id=3005740 > or there are any other more intelligent solutions? Perhaps there > exist any already implemented method? Something like > page.getSource()? How does the SiteCapturer solve this problem? > > 02) Is the source code of a web page normalized anyhow before the > actual parsing? Are there any attempts made to supply a parser with > a validated HTML source? Or is it better to use products of other > developers, e.g. JTidy? > > 03) Also I would like to save the source code of a web page in its > original encoding or in Unicode. I do not want to lose any > international character of the source. I need to save the source of a > page into the database and be able to obtain it in its original form > if necessary. Does HTML Parser supports source code convertion into > Unicode? > > > |
From: myer <my...@o2...> - 2006-02-08 11:12:49
|
Hello dear users and developers, currently I write my bachelor's thesis, where I use the functionality of HTML Parser. In my program I need almost the same result as SiteCapturer does. So I've started to learn how it works and change it for my project. But some moments are not fully clear to me. 01) How does HTML Parser obtain source code of a web page before parsing? In the following I will speak about the SiteCapturer example. Does it start with a 'null' filter to get all the nodes of a web page for the very first time? And only then applies other filters indicated by user. Or it parses 'on the fly': gets the first node of a source, compares with node filter, and only if it successfully passes the filter check saves it into a data structure. Say, node list. What I need to do, is to get the whole 'untouched' source code of a web page before parsing. Should I go the way mentioned in this thread http://sourceforge.net/forum/message.php?msg_id=3005740 or there are any other more intelligent solutions? Perhaps there exist any already implemented method? Something like page.getSource()? How does the SiteCapturer solve this problem? 02) Is the source code of a web page normalized anyhow before the actual parsing? Are there any attempts made to supply a parser with a validated HTML source? Or is it better to use products of other developers, e.g. JTidy? 03) Also I would like to save the source code of a web page in its original encoding or in Unicode. I do not want to lose any international character of the source. I need to save the source of a page into the database and be able to obtain it in its original form if necessary. Does HTML Parser supports source code convertion into Unicode? -- Best regards, Myer |
From: Derrick O. <Der...@Ro...> - 2006-02-07 18:34:13
|
Tags are omitted because heuristically the tighter rule that assumes all tags are composite tags fails to parse correctly because of bad HTML out in the wild. You are welcome to try replacing the default tag (see PrototypicalNodeFactory.setTagPrototype()) with a composite tag that ends with a matching slash name, but my guess is it will parse very poorly. 加藤 千典 wrote: >Hi, all. > >I notice that correct way. > >I created a AddressTag.java that is almost copy of ParagraphTag.java > >And add same code like this. > > PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); > factory.registerTag (new AddressTag()); > parser.setNodeFactory (factory); > >It's ok, but Should I create another alot of HTML tag classes ? > >I think that there are almost Html Tag classes already. >How can I get ? > >Thank you, all. > > > >>Hi, all. >> >>I parsed a html, and create a dom , using >>HTMLParser Version 1.6 (Integration Build Nov 12, 2005) >> >>The "P" tag has "P" END TAG as child. >>(It's is same at "HEAD", "TITLE", "BODY", etc...) >> >>The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS") >>on the same level in dom. >>(It's the same thing at "CENTER" tag.) >> >>I expected that ADDRESS tag become like "P" tag, but not. >> >>Why the reason ? >> >>How can I that the paser recognize ADDRESS tag as a single >>CompositeTag. >> >>Thank you, all. Sorry my poor english. >> >> >> > > > >------------------------------------------------------- >This SF.net email is sponsored by: Splunk Inc. Do you grep through log files >for problems? Stop! Download the new AJAX search engine that makes >searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: <ka...@ex...> - 2006-02-07 07:14:24
|
Hi, all. I notice that correct way. I created a AddressTag.java that is almost copy of ParagraphTag.java And add same code like this. PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.registerTag (new AddressTag()); parser.setNodeFactory (factory); It's ok, but Should I create another alot of HTML tag classes ? I think that there are almost Html Tag classes already. How can I get ? Thank you, all. > Hi, all. > > I parsed a html, and create a dom , using > HTMLParser Version 1.6 (Integration Build Nov 12, 2005) > > The "P" tag has "P" END TAG as child. > (It's is same at "HEAD", "TITLE", "BODY", etc...) > > The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS") > on the same level in dom. > (It's the same thing at "CENTER" tag.) > > I expected that ADDRESS tag become like "P" tag, but not. > > Why the reason ? > > How can I that the paser recognize ADDRESS tag as a single > CompositeTag. > > Thank you, all. Sorry my poor english. > |
From: <ka...@ex...> - 2006-02-07 06:46:36
|
Hi, all. I parsed a html, and create a dom , using HTMLParser Version 1.6 (Integration Build Nov 12, 2005) The "P" tag has "P" END TAG as child. (It's is same at "HEAD", "TITLE", "BODY", etc...) The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS") on the same level in dom. (It's the same thing at "CENTER" tag.) I expected that ADDRESS tag become like "P" tag, but not. Why the reason ? How can I that the paser recognize ADDRESS tag as a single CompositeTag. Thank you, all. Sorry my poor english. ---------code----------- import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; public class SampleHTMLParserJ { /** * HTMLParser sample * * @param args */ public static void main(String[] args) { try { Parser parser = new Parser( "file:///D:/data/test03.html"); NodeList list = parser.parse(null); Node node = list.elementAt(0); System.out.println(node); } catch (ParserException e) { e.printStackTrace(); } } } ---------stdout----------- Tag (0[0,0],57[0,57]): Html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja" Txt (57[0,57],60[1,1]): \n Tag (60[1,1],66[1,7]): head Txt (66[1,7],70[2,2]): \n Tag (70[2,2],77[2,9]): title Txt (77[2,9],88[2,20]): title title End (88[2,20],96[2,28]): /title Txt (96[2,28],99[3,1]): \n End (99[3,1],106[3,8]): /head Txt (106[3,8],109[4,1]): \n Tag (109[4,1],115[4,7]): body Txt (115[4,7],121[6,2]): \n\n Tag (121[6,2],130[6,11]): address Txt (130[6,11],137[6,18]): My name End (137[6,18],147[6,28]): /address Txt (147[6,28],151[7,2]): \n Tag (151[7,2],159[7,10]): CENTER Txt (159[7,10],165[7,16]): CENTER End (165[7,16],174[7,25]): /CENTER Txt (174[7,25],178[8,2]): \n Tag (178[8,2],181[8,5]): p Tag (181[8,5],220[8,44]): img src="welcome.gif" alt="welcome" / End (220[8,44],224[8,48]): /p Txt (224[8,48],230[10,2]): \n\n Tag (230[10,2],234[10,6]): h1 Txt (234[10,6],238[10,10]): main End (238[10,10],243[10,15]): /h1 Txt (243[10,15],247[11,2]): \n Tag (247[11,2],253[11,8]): hr / Txt (253[11,8],256[12,1]): \n End (256[12,1],263[12,8]): /body Txt (263[12,8],265[13,0]): \n End (265[13,0],272[13,7]): /html ---------html----------- <Html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja"> <head> <title>title title</title> </head> <body> <address>My name</address> <CENTER>CENTER</CENTER> <p><img src="welcome.gif" alt="welcome" /></p> <h1>main</h1> <hr /> </body> </html> ------------------ |
From: Derrick O. <Der...@Ro...> - 2006-02-05 13:09:28
|
The parser doesn't really deal in lines of text, since most HTML disregards linebreaks (the <pre> tag is the only exception I can think of). What you probably want is subsequent nodes. For this use the children of the parent of the node you have. Some methods were recently added on AbstractNode (which TextNode inherits from) to handle this... getPreviousSibling() and getNextSibling() These are only available in the latest Integration Build. If you really want lines of text, the Page object available from the parser, can be asked to fetch a line with GetLine(). This method has two overloads, one takes a cursor argument the other an integer position. The position is available from the node you have with getStartPosition() or getEndPosition(). That gets you the contents of the line in the HTML stream for the node you have. Subsequent lines are a little tougher to get a hold of. The line information is held in a PageIndex object which the Page doesn't expose. But it could if you added a method. If you had one of those you could step through the lines of the file. Derrick quanta veloce wrote: > Hi, > > Can HTMLParser allow one to extract into an array lines before or > after a search string? > > For instance: > > <CENTER> > <TABLE ALIGN="CENTER" BORDER=5> > <TR> > <TD width=150 align=center><B>Area</B></TD> > <TD width=120 align=center><B>Instantaneous Load</B></TD> > </TR> > <TR> > <TD>PJM MID ATLANTIC REGION</TD> > <TD align=right>33929</TD> > </TR> > <TR> > <TD>PJM WESTERN REGION</TD> > <TD align=right>39400</TD> > </TR> > <TR> > <TD>PJM SOUTHERN REGION</TD> > <TD align=right>9857</TD> > </TR> > <TR> > <TD>PJM RTO</TD> > <TD align=right>83186</TD> > </TR> > </TABLE> > </CENTER> > <P><CENTER>Loads are calculated from raw telemetry data and are > approximate.</CENTER> > <CENTER>The displayed values are NOT official PJM Loads.</CENTER> > <BR><BR><BR> > <P><CENTER><H2>Current PJM Transmission Limits</H2></CENTER> > <P align=center>None > > </BODY> > </HTML> > > In the following URL I matched the string "Current PJM Transmission > Limits" and I want to obtain any and all lines after this match...or > even the next 3 lines, etc., > > Any help would be appreciated! > Thanks, > > > ------------------------------------------------------------------------ > Relax. Yahoo! Mail virus scanning > <http://us.rd.yahoo.com/mail_us/taglines/viruscc/*http://communications.yahoo.com/features.php?page=221> > helps detect nasty viruses! |
From: quanta v. <qua...@ya...> - 2006-02-04 00:39:45
|
Hi, Can HTMLParser allow one to extract into an array lines before or after a search string? For instance: <CENTER> <TABLE ALIGN="CENTER" BORDER=5> <TR> <TD width=150 align=center><B>Area</B></TD> <TD width=120 align=center><B>Instantaneous Load</B></TD> </TR> <TR> <TD>PJM MID ATLANTIC REGION</TD> <TD align=right>33929</TD> </TR> <TR> <TD>PJM WESTERN REGION</TD> <TD align=right>39400</TD> </TR> <TR> <TD>PJM SOUTHERN REGION</TD> <TD align=right>9857</TD> </TR> <TR> <TD>PJM RTO</TD> <TD align=right>83186</TD> </TR> </TABLE> </CENTER> <P><CENTER>Loads are calculated from raw telemetry data and are approximate.</CENTER> <CENTER>The displayed values are NOT official PJM Loads.</CENTER> <BR><BR><BR> <P><CENTER><H2>Current PJM Transmission Limits</H2></CENTER> <P align=center>None </BODY> </HTML> In the following URL I matched the string "Current PJM Transmission Limits" and I want to obtain any and all lines after this match...or even the next 3 lines, etc., Any help would be appreciated! Thanks, --------------------------------- Relax. Yahoo! Mail virus scanning helps detect nasty viruses! |
From: Derrick O. <Der...@Ro...> - 2006-02-03 13:13:01
|
Java uses unicode. It stores characters in UTF-16 internally, i.e. char is 16 bits, String is an array of 16 bit values encoding Unicode in UTF-16. Character entity conversion is a way for HTML documents to contain Unicode characters outside there current encoding and also to avoid the reserved characters HTML is based on, like left angle bracket - < and ampersand &. These must be converted to Unicode to extract the semantic meaning of the page. So your question is, "Is there a java program that uses something besides the String type to store Unicode when parsing HTML"? I don't think so. You might want to look at the Translate class in the util package to see if it does what you want. Jan wrote: > Dear Experts and Users, > > Could anyone say for sure whether htmlparser is capable for html tag > stripping and html entity conversion, but without Unicode-conversion, > or not? > > If not, what Java-tool could I use? > > Thanks, > > Jan |
From: Jan <jan...@gm...> - 2006-02-03 05:48:12
|
Dear Experts and Users, Could anyone say for sure whether htmlparser is capable for html tag stripping and html entity conversion, but without Unicode-conversion, or not? If not, what Java-tool could I use? Thanks, Jan |
From: Riaz u. <ru...@ya...> - 2006-02-03 02:04:26
|
Hi, I was away for a long time...anyways here is the program that I had written. I know very little programming so dont get bored. Is there a better way to read from yahoo. I am sure there is: one more thing is that the program displays $ instead of $ symbol how do I overcome this. Please help anyone. the program follows: import java.io.*; import java.net.*; import java.net.URL; import org.htmlparser.*; import org.htmlparser.util.*; import org.htmlparser.Parser; import org.htmlparser.lexer.Lexer; import org.htmlparser.tags.Span; import org.htmlparser.tags.FormTag; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.StyleTag; import org.htmlparser.tags.ScriptTag; import org.htmlparser.tags.ParagraphTag; import org.htmlparser.tags.CompositeTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.nodes.TagNode; import org.htmlparser.nodes.TextNode; import org.htmlparser.filters.TagNameFilter; import org.htmlparser.filters.LinkStringFilter; /** * Extract plaintext strings from a web page. * Illustrative program to gather the textual contents of a web page. * Uses a {@link org.htmlparser.beans.StringBean StringBean} to accumulate * the user visible text (what a browser would display) into a single string. *Step 1. Parse the page *Step 2. Collect the HTML tags in the page as nodes in a list. *Step 3. Keep Only the SPAN tags in the list. * Links are continuously updated at yahoo page, they are in between the SPAN tag with * 'recenttimedate' attribute. *Step 4. */ public class StringExtract { public static void main(String args[]){ try{ int i=0,j=0,k=0; boolean endOfnewsinthisPage = false; String sourceURL = args[0]; //sourceURL is the argument to read news from // Step 1. Parsing the input page. Parser parser = new Parser (sourceURL);//parser will hold the tree of the url NodeList li_tags = new NodeList(); // Step 2. Collecting Tags in a list. NodeList list = parser.parse (null); //news links are at the span tag (time), spanList stores the span tags // Step 3. Keep only the SPAN tags in spanList. NodeList spanList = list.extractAllNodesThatMatch(new TagNameFilter ("SPAN"),true); // Step 4. Extract link from each span tag. while(i < spanList.size()) { Span spanTag = (Span)spanList.elementAt(i); // System.out.println(spanTag.getText()); // We only need SPAN tags with attribute "class = 'recenttimedate'" // Move to the link in the span tag if(spanTag.getText().equals("span class=recenttimedate")) { li_tags.add(spanList.elementAt(i).getParent()); } i++; } i=0; NodeList a_tags = new NodeList(); NodeFilter filter = new TagNameFilter ("P"); LinkTag validLink = new LinkTag(); CompositeTag comptag = new CompositeTag(); String linkTag = "http"; LinkStringFilter linkTagFilter= new LinkStringFilter(linkTag); // There are http links and also other links, a_tags will contain only http links for( NodeIterator e = li_tags.elements(); e.hasMoreNodes();) { e.nextNode().collectInto(a_tags, linkTagFilter); } // BufferedWriter out = new BufferedWriter(new FileWriter("output.txt", true)); // while( i < a_tags.size()) // { LinkTag linkAtag = (LinkTag)a_tags.elementAt(0); //Extract link from each a_tags element String interestingLink = linkAtag.extractLink(); boolean exists = false; j=0; // In Yahoo, there are few http links which lead to images, we dont need them, the following loop // filters out those links. while(j < interestingLink.length() && (!exists)) { exists = interestingLink.substring(j).startsWith("photos"); j++; } // Step 5. Parse Each link that was collected. if((linkAtag.isHTTPLink()) && (!linkAtag.getLinkText().equals("")) && (!exists)) { Parser parseIndividualURLs = new Parser(interestingLink); NodeList nodesLink = parseIndividualURLs.parse(null); String newString = new String(); TextNode textNode = new TextNode(newString); for(NodeIterator x = nodesLink.elements(); x.hasMoreNodes();) { Node cNode = x.nextNode(); if((cNode.getChildren() != null) && (!cNode.getText().equals("div"))) nodesLink.add(cNode.getChildren()); } // One link is one HTML document. nodesLink is the list of all nodes under one document. for(k = 0; k < nodesLink.size(); k++) { Node cNode = nodesLink.elementAt(k); Node prevNode = null; Node nextNode = null; TagNode nextTagNode = null; TagNode prevTagNode = null; TagNode dNode = null; // if(cNode instanceof LinkTag) // { // LinkTag lnkTag = (LinkTag)cNode; // System.out.println(lnkTag.getLinkText()); // } if(!((k-1) < 0)) { prevNode = nodesLink.elementAt(k-1); if(prevNode instanceof TagNode) prevTagNode = (TagNode)nodesLink.elementAt(k-1); } if(!((k+1) > nodesLink.size())) { nextNode = nodesLink.elementAt(k+1); if(nextNode instanceof TagNode) nextTagNode = (TagNode)nodesLink.elementAt(k+1); } TagNode tNode = (TagNode)cNode.getParent(); NodeList newList = new NodeList(); //Printing the title of the news if(cNode.getText().equals("title")) { // out.write(cNode.toPlainTextString()); System.out.println(cNode.toPlainTextString()); System.out.println(); } if(cNode.getText().startsWith("div")) { dNode = (TagNode)cNode; if(dNode.getAttribute("class") != null) if(dNode.getAttribute("class").equals("clearfix")) k = nodesLink.size() + 1; } if(cNode instanceof TextNode) { // This 'if block' prints each paragraph of the news if(prevNode.getText().equals("p")) { if(!(nextNode.getText().startsWith("span"))) { // out.write(cNode.toHtml().trim()); System.out.println(cNode.toHtml().trim()); // here } else if(nextNode instanceof TagNode) { if(nextTagNode.getAttribute("class") != null) if(!(nextTagNode.getAttribute("class").equals("clearfix"))) { // out.write(cNode.toHtml().trim()); System.out.println(cNode.toHtml().trim()); } else k = nodesLink.size(); } } // This 'else if' block prints the first paragraph of the news (Because the first paragraph //at a different place in the document. else if(prevNode.getText().equals("p/")) { // out.write(cNode.toHtml().trim()); System.out.println(cNode.toHtml().trim()); } // There are some words in the document where Yahoo provides search facility(for example, a person, // a country etc.) and it is in the form of link. This block extracts text from those links. else if(prevNode.getText().startsWith("span")) { newList.add(prevNode.getChildren()); for(NodeIterator x=newList.elements();x.hasMoreNodes();) { Node aNode = x.nextNode(); if(aNode instanceof TagNode) { prevTagNode = (TagNode)aNode; if(prevTagNode.getAttribute("href") != null) { // out.write(aNode.toPlainTextString()+" "+cNode.toHtml().trim()); System.out.println(aNode.toPlainTextString()+" "+cNode.toHtml().trim()); } } } } } } // System.out.println("Link:"+linkAtag.extractLink()+":Text:" + linkAtag.getLinkText()); // System.out.println(); // System.out.println(); } // i++; // } // out.close(); } catch (Exception ex) { System.out.println("Printing Exceptional Error"); ex.printStackTrace(); } } } --------------------------------- Yahoo! Mail - Helps protect you from nasty viruses. |
From: Derrick O. <Der...@Ro...> - 2006-02-01 23:18:59
|
The StringBean does a decode on the text, perhaps that is what you need: public void visitStringNode (Text string) { if (!mIsScript && !mIsStyle) { String text = string.getText (); if (!mIsPre) { text = Translate.decode (text); HuangGehua wrote: > I parser a html resource file which has some Chinese words.When i use > TextExtractingVisitor.getExtractedText() method to get the text,the > Chinese words displays well.But if i get a TextNode and use > TextNode.getText() method to get the Chinese words it can't displayed > correctly. > > How could let TextNode.getText() method work correctly!!! > Thank you!!!! > |
From: HuangGehua <bo...@gm...> - 2006-01-31 11:42:56
|
I parser a html resource file which has some Chinese words.When i use TextExtractingVisitor.getExtractedText() method to get the text,the Chinese words displays well.But if i get a TextNode and use TextNode.getText() method to get the Chinese words it can't displayed correctly. How could let TextNode.getText() method work correctly!!! Thank you!!!! |
From: HuangGehua <bo...@gm...> - 2006-01-30 16:31:56
|
I want to parser a html file with encoding GB2312 or GBK and then write a xml file with encoding UTF-8.I use jdom to write the XML file.The resource html file didn't have a <meta> to identify the chareset,for exmaple: ======================== <!DOCTYPE NETSCAPE-Bookmark-file-1> <!-- This is an automatically generated file. It will be read and overwritten. Do Not Edit! --> <TITLE>Bookmarks</TITLE> <H1>Bookmarks</H1> <DL><p> <DT><H3 FOLDED ADD_DATE="1120124714">链接</H3> <DL><p> <DT><A HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&ar=windowsmedia">Windows Media</A> <DT><A HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&ar=windows">Windows</A> <DT><A HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&ar=hotmail"> 免费 Hotmail</A> <DT><A HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&pver=6&ar=CLinks"> 自定义链接</A> </DL><p> <DT><A HREF="http://www.yxcard.com/download.htm">..远兴科技..</A> <DT><A HREF="http://www.microsoft.com/isapi/redir.dll?prd=ie&pver=6&ar=IStart">MSN</A> <DT><A HREF="http://www.yesure.com/storm/sort.php/1">暴风影音</A> <DT><A HREF="http://www.yesky.com/SoftChannel/72348977504190464/20050411/1934159.shtml">Eclipse Yesky</A> </DL><p> ======================= the java source code is: ============================================= package html; import java.io.FileOutputStream; import java.io.OutputStreamWriter; import java.io.UnsupportedEncodingException; import java.io.Writer; import java.util.List; import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.nodes.TextNode; import org.htmlparser.tags.DefinitionList; import org.htmlparser.tags.DefinitionListBullet; import org.htmlparser.tags.HeadingTag; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.util.SimpleNodeIterator; import org.htmlparser.visitors.TagFindingVisitor; import org.jdom.Document; import org.jdom.Element; import org.jdom.output.Format; import org.jdom.output.XMLOutputter; public class ChangeHtml2XML { private String htmlPath="d:/bookmark.htm"; private String xmlPath="d:/toXML.xml"; public Document getFirstMark() throws ParserException{ Parser parser=new Parser(htmlPath); parser.setEncoding("GB2312"); String [] tagsToBeFound = {"DL"}; TagFindingVisitor visitor = new TagFindingVisitor (tagsToBeFound); parser.visitAllNodesWith(visitor); Node [] nodes=visitor.getTags(0); DefinitionList dl=(DefinitionList)nodes[0]; Element rootElement=new Element("favorite"); Document userDocument=new Document(rootElement); visitEachAndBuild(userDocument,rootElement,dl); System.out.println(parser.getEncoding()); return userDocument; } public void visitEachAndBuild(Document document,Element parentElement,DefinitionList parentDL){ SimpleNodeIterator iteratorParentDlChildren=parentDL.children(); while(iteratorParentDlChildren.hasMoreNodes()){ Node node=iteratorParentDlChildren.nextNode(); if (node.getClass().getName().equals(DefinitionListBullet.class.getName())){ DefinitionListBullet dt=(DefinitionListBullet)node; Node justNode=dt.getChild(0); if (justNode.getClass().getName().equals(HeadingTag.class.getName())){ TextNode tn=(TextNode)dt.getChild(1); Element newElement=new Element("folder"); newElement.setAttribute("label",tn.getText()); System.out.println(tn.getText()); parentElement.addContent(newElement); DefinitionList findTheDL=null; SimpleNodeIterator forChildDefinitionList=dt.getChildren().elements(); while(forChildDefinitionList.hasMoreNodes()){ Node n=forChildDefinitionList.nextNode(); if (n.getClass().getName().equals(DefinitionList.class.getName())){ findTheDL=(DefinitionList)n; break; } } if (findTheDL!=null) visitEachAndBuild(document,newElement,findTheDL); }else{ TextNode tn=(TextNode)dt.getChild(1); LinkTag link=(LinkTag)dt.getChild(0); Element newElement=new Element("address"); newElement.setAttribute("lable",tn.getText()); System.out.println(tn.getText()); newElement.setAttribute("url",link.getLink()); newElement.setAttribute("target","blank"); parentElement.addContent(newElement); } } } } public void saveDocument(Document doc){ StringBuffer buff = new StringBuffer(); buff.append(xmlPath); try { XMLOutputter outputter = new XMLOutputter(Format.getPrettyFormat()); Format format=outputter.getFormat(); format.setEncoding("UTF-8"); format.setExpandEmptyElements(true); outputter.setFormat(format); FileOutputStream fos=new FileOutputStream(buff.toString()); Writer output=new OutputStreamWriter(fos,"UTF-8"); outputter.output(doc, output); output.close(); //return true; } catch (java.io.IOException e) { System.out.println("cant write to file system"); //throw new Exception(e); } } } =========================== The result XML file cant display the Chinese words correctly,it looks like this "
" What's wrong with me? By the way how to detect a file's charset without MetaTag? Any positive suggestion is welcome. Thank you!!!!!!!!!!! |
From: Jan <jan...@gm...> - 2006-01-28 15:26:39
|
Dear Derrick, Really thank for the quick reply. I wrote the string into a file, and the file contains the question marks. I would like to have the original html text but without the html tags and html entities. Any conversion toward Unicode is undesired for my problem. (I would like to use the plain text for language/encoding identification). If htmlparser does not fit my problem, could you recommend something? Thank you! Jan On 1/28/06, Derrick Oswald <Der...@ro...> wrote: > > Jan, > > In general, a lot of care has been taken to ensure that the correct > character set (according to the web page meta data) is being used. > The appearance of question marks may be just a function of the > System.out.println() that it's doing. > Have you tried examining the errant characters in a debugger or writing > the strings returned from the StringBean (used by the stringextractor > command) to a PrintWriter with an encoding that can handle those > characters? > > Derrick > > Jan wrote: > > > Dear Members! > > > > Is it possible using htmlparser to extract plain text in original > > encoding/charset? > > > > I tried the sample stringextractor.cmd. > > It worked nicely, but non-common characters are replaced with question > > marks (?). I would like to keep the original byte sequence. > > > > Thanks, > > > > Jan > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D103432&bid=3D230486&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2006-01-28 14:42:27
|
Jan, In general, a lot of care has been taken to ensure that the correct character set (according to the web page meta data) is being used. The appearance of question marks may be just a function of the System.out.println() that it's doing. Have you tried examining the errant characters in a debugger or writing the strings returned from the StringBean (used by the stringextractor command) to a PrintWriter with an encoding that can handle those characters? Derrick Jan wrote: > Dear Members! > > Is it possible using htmlparser to extract plain text in original > encoding/charset? > > I tried the sample stringextractor.cmd. > It worked nicely, but non-common characters are replaced with question > marks (?). I would like to keep the original byte sequence. > > Thanks, > > Jan |
From: Jan <jan...@gm...> - 2006-01-26 06:43:38
|
Dear Members! Is it possible using htmlparser to extract plain text in original encoding/charset? I tried the sample stringextractor.cmd. It worked nicely, but non-common characters are replaced with question mark= s (?). I would like to keep the original byte sequence. Thanks, Jan |
From: Derrick O. <Der...@Ro...> - 2006-01-24 12:45:29
|
You have only addressed the top level nodes in this code (the nodes in list). The toHtml() calls are recursive, so you need to put this logic in the definition of toHtml(), probably only in TagNode.java and maybe CompositeTag.java. Marc Candle wrote: > Hi, > > Thanks for your response. I tried this code below in an attempt to see > if it would work given your comment: > > StringBuffer finalContents = new StringBuffer(); > > //Generate final output > > for (NodeIterator e = list.elements (); e.hasMoreNodes (); ) { > > Node node = e.nextNode (); > > if ( node.getEndPosition() == node.getStartPosition() ) { > > log.debug ( " IGNORED node : " + node.toHtml()); > > continue; > > } > > if (node instanceof TagNode) { > > if ( ((TagNode)node).getTagEnd() == ((TagNode)node). getTagBegin() ) { > > log.debug ( " IGNORED node : " + node.toHtml()); > > continue; > > } > > } > > finalContents.append(node.toHtml()); > > } > > This didn't seem to make any different. The positions of the virtual > tags must’ve been corrected at an earlier stage in htmlparser. I have > started looking at the htmlparser source to see where this occurs. > > Kind Regards, > > Mark > > -----Original Message----- > From: htm...@li... > [mailto:htm...@li...] On Behalf Of > Derrick Oswald > Sent: 23 January 2006 12:37 > To: htm...@li... > Subject: Re: [Htmlparser-user] Parsing malformed HTML whilst still > leaving it intact > > This has been a requested task for two years now: > > http://sourceforge.net/pm/task.php?group_project_id=21601&group_id=24399&func=browse > > The virtual tags that are added have the start position the same as the > > end position, so a smarter toHtml() could recognize them that way and > > avoid outputting them. > > Marc Candle wrote: > >>Hi, > >> > >>I'm parsing snippets of HTML pages at a time, making some changes and then > >>outputting back to HTML. The problem with HTML snippets is that they > will be > >>malformed since some closing tags, for example, will be missing. > >> > >>The Parser seems to automatically correct the malformed HTML by adding > >>closing tags. Is it possible to prevent it from doing so? Or at least > it can > >>notify me when it does so, so that before reconstructing the modified HTML > >>output I can simply delete them. > >> > >>An alternative would be to use the Lexer but then I loose all the > >>hierarchical features of the Parser, which not an option. > >> > >>This is similar to the general problem brought up in > >> <http://sourceforge.net/mailarchive/message.php?msg_id=12635550> > >>http://sourceforge.net/mailarchive/message.php?msg_id=12635550 . > >> > >>Kind Regards > >> > >>Mark > >> > >> > >> > >> > > ------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files > > for problems? Stop! Download the new AJAX search engine that makes > > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |