htmlparser-user Mailing List for HTML Parser (Page 82)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Devin G. <obi...@ho...> - 2003-03-10 09:42:03
|
Hi, I am also trying to parse a string containing html tags. I am just trying to pull the text from the string but I have been unsuccessful at it. I've tried creating a URL from the string and trying to use a HTMLReader or Reader to get at the information. I suppose I could write it to a file, but I would prefer not to have to go through all of that a short simple string. Nothing has worked for me yet. I am sure there is a simple way, but I can't seem to find it. Any help would be appreciated. Thanks ahead of time, Devin Gillman _________________________________________________________________ Add photos to your messages with MSN 8. Get 2 months FREE*. http://join.msn.com/?page=features/featuredemail |
From: Joe L. <gu...@ya...> - 2003-03-08 09:32:43
|
Hi, It seems that the parser has problem handling Chinese chracters. I experiment with a simple web page as follows (I saved it as "test.html"): <HTML> <HEAD> <TITLE>Hello</TITLE> <META http-equiv=Content-Type content="text/html; charset=gb2312"> </HEAD> <BODY bgColor=#ffffff> <h1>Hello</h1><br> </body> </html> I then run the parser as java -jar htmlparser.jar file:test.html. The parser output nothing but: HTMLParser v1.3 (Integration Build Mar 02, 2003) Parsing file:test.html INFO: detected charset "gb2312", using "EUC-CN" Thanks for any help. Joe __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-03-08 05:12:12
|
Hi Richard, > Could someone clarify the licensing situation / fulfilment requirements > of HTMLParser with regard to its inclusion as part of an otherwise > closed-source commercial app. Thanks for bringing up this question. The parser is licensed under LGPL. This means, applications that USE it dont have to be open-source. But, here are two restrictions that apply: [1] Any modifications made to the library itself must be kept open-source or made available. [2] Your app source code does not live with the parser source code, but the object code does. That means - people should either be able to reverse engineer your product so as to be able to remove the parser library and put a newer version in (gasp!) or - simply provide an external linkage to the parser - whereby folks can swap out the current version with a later version (the idea is to let them have the benefit of the open-source library). That reverse engineering stuff is actually a cryptic interpretation of the clause - applicable only if you want to provide a single executable in your application (it can be bypassed, but I dont want to further complicate the interpretation for you - let me know if this is the case and I can advise you accordingly). Bytway, if you are not distributing your application, and only using it internally, none of the above applies. Let me know if that answers your question. Regards, Somik ******************************************** Somik Raha Extreme Programmer and Coach Industrial Logic, Inc. so...@in... http://industriallogic.com Voice : 510-540-8336 Fax : 510-540-8936 ******************************************** Periodic reassessment means looking at things which are taken for granted, things which seem beyond doubt. Periodic reassessment means challenging all assumptions. It is not a matter of reassessing something because there is a need to reassess it; there may be no need at all. It is a matter of reassessing something simply because it is there and has not been assessed for a long time. It is a deliberate and quite unjustified attempt to look at things in a new way. --- Edward De Bono in Lateral Thinking, Chapter 5, The Use of Lateral Thinking |
From: Joe L. <gu...@ya...> - 2003-03-05 05:42:10
|
Hi, I need to change links embedded inside the code of a script tag such as: <script language="Javascript"> window.open("http://mysite/index.html"); </script> There's only getScriptCode() in ScriptTag and no setScriptCode() available. Has anyone done changing links inside Javascript? Can you please suggest a good way to do this? Also, how about inline Java script such as <form ....> <input type="button" onClick="<script window.open..../>"> </form> Thanks so much for the help! Joe __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-03-04 14:52:23
|
You can register the HtmlScanner, and BodyScanner, along with the other = scanners.. parser.registerScanners(); parser.addScanner(new HtmlScanner()); parser.addScanner(new BodyScanner()); Then, go ahead and parse.=20 Html html =3D (Html)parser.extractAllNodesThatAre(Html.class)[0]; This is your DOM object - you can do searches, get the children = (children()), etc... on this. To make it more powerful, register other = scanners like Table, Div, Span, etc.. (currently we dont do this = automatically, but you can expect a method - registerDomScanners() in = the near future). Regards, Somik ----- Original Message -----=20 From: deva=20 To: htm...@li...=20 Sent: Monday, March 03, 2003 2:20 PM Subject: [Htmlparser-user] parse from a string Hi, I'm trying to parse a string containing html tags. Can u please = guide me how can I do this using htmlparser. My output will b a DOM = document or DOM DocumentFragment. The String of html tags is coming from = a database.=20 Thanks Deva kamatham |
From: deva <de...@ed...> - 2003-03-03 22:16:53
|
Hi, I'm trying to parse a string containing html tags. Can u please guide me how can I do this using htmlparser. My output will b a DOM document or DOM DocumentFragment. The String of html tags is coming from a database. Thanks Deva kamatham |
From: Somik R. <so...@ya...> - 2003-03-03 03:52:39
|
Hi Folks, In this week's release, the change log is : Integration build 1.3 - 20030302 -------------------------------- [1] Fixed bug in LinkScanner [2] Cleaned up StringNode interface [3] Cleaned up RemarkNode interface [4] Refactored Parser, created ParserHelper Regards, Somik |
From: Rich W. <ri...@wi...> - 2003-03-03 01:30:38
|
Hi Somik, Many websites require that have the ability to give and recieve = cookies to properly navigate the site. The cookie information is used to = tell the website things like who you are and where you are. It's often = used as a sercurity measure as well to prevent spidering or data = parsing. A cookie jar allows for total browser emulation. Here is a similar Perl = module that accomplishes this: http://search.cpan.org/author/GAAS/libwww-perl-5.69/lib/HTTP/Cookies.pm Thanks for the reply! rw ----- Original Message -----=20 From: Somik Raha=20 To: htm...@li...=20 Sent: Sunday, March 02, 2003 7:05 PM Subject: Re: [Htmlparser-user] Cookies Do these libraries have anything that will handle cookies? CookieJar = functionality? Not yet. Can you give us a use case - we could look into adding it. Regards, Somik ----- Original Message -----=20 From: Rich Williams=20 To: htm...@li...=20 Sent: Sunday, March 02, 2003 1:35 AM Subject: [Htmlparser-user] Cookies Hi all, =20 Do these libraries have anything that will handle cookies? = CookieJar functionality? thanks rw |
From: Somik R. <so...@ya...> - 2003-03-03 00:09:26
|
Joe Lin wrote: > Anoter question regarding the collectInto(NodeList > collectionList, java.lang.String filter) method, I > could not seem to find the filter constants for > different Node type. Can anyone point me where these > are? After moving to the class parameters, this method has become redundant. We're planning to take it out. You're better off using the other techniques (the other collectInto or TagFindingVisitor). > BTW, I think HTMLParser is a great software. I have > being looking for Java html parser high and low. > HTMLParser represent a best architecture and user API > to me. I especially like that it is in a sense a > steaming parser. This means performance and optimal > memory usage for me. Thanks for the kind words. We've got a diverse and talented set of people who've been making contributions over a period of time. Kind words always help inspire us to serve the community better. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-03 00:06:02
|
Joe Lin wrote: > Are operations on NodeList thread-safe? I would prefer > they are not for performance concern. That wat we can > synchronize on the operations if needed. No, NodeList is not thread-safe. It was written as an alternative to Vector, and is better than using the existing collections, bcos there is no downcasting. So, I doubt if you can get better performance than this.. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-03 00:04:11
|
Do these libraries have anything that will handle cookies? CookieJar = functionality? Not yet. Can you give us a use case - we could look into adding it. Regards, Somik ----- Original Message -----=20 From: Rich Williams=20 To: htm...@li...=20 Sent: Sunday, March 02, 2003 1:35 AM Subject: [Htmlparser-user] Cookies Hi all, =20 Do these libraries have anything that will handle cookies? = CookieJar functionality? thanks rw |
From: Somik R. <so...@ya...> - 2003-03-03 00:03:22
|
Avi Bentov wrote: > Since I was unable to find the answer to the question I am looking for, in all the places you > suggested. > I hope you will be able to answer me and I thank you in advance: I am trying to figure out > how to use the code to write help system for an Application. > My first problem is how to incorporate the htmlparser's packages, for instance > org.htmlparser, with the jdk (j2sdk1.4.1_01), so that I will be able to compile the classes I > write. Put htmlparser.jar in your classpath. Look at http://htmlparser.sourceforge.net/docs/ to get started with the parser. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-03 00:02:11
|
> How to get to the the Htmlparser-user Archives > <http://sourceforge.net/mailarchive/forum.php?forum=htmlparser-user>. > I receive an Error instead of the page ???? Strange - I clicked the link and could view the archives. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-03 00:01:22
|
> I was looking into the example code provided with the > 1.3 version distribution regrading extracting embedded > links/images. I notice that I would have to specify a > class filter in the collectInto method of Node class. > Suppose that I want to collect several types of > embedded tags in the same call, this method won't > allow me to do so. Is it possible that I pass a > Class[] type as the filter to this method? You could do it a different way - look at http://htmlparser.sourceforge.net/docs/index.php/CustomTagExtraction Regards, Somik |
From: Joe L. <gu...@ya...> - 2003-03-02 23:43:41
|
Hi, I was looking into the example code provided with the 1.3 version distribution regrading extracting embedded links/images. I notice that I would have to specify a class filter in the collectInto method of Node class. Suppose that I want to collect several types of embedded tags in the same call, this method won't allow me to do so. Is it possible that I pass a Class[] type as the filter to this method? Anoter question regarding the collectInto(NodeList collectionList, java.lang.String filter) method, I could not seem to find the filter constants for different Node type. Can anyone point me where these are? BTW, I think HTMLParser is a great software. I have being looking for Java html parser high and low. HTMLParser represent a best architecture and user API to me. I especially like that it is in a sense a steaming parser. This means performance and optimal memory usage for me. Thanks. Joe __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Joe L. <gu...@ya...> - 2003-03-02 23:27:12
|
Hi, Are operations on NodeList thread-safe? I would prefer they are not for performance concern. That wat we can synchronize on the operations if needed. Thanks. Joe __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Rich W. <ri...@wi...> - 2003-03-02 11:46:57
|
Hi all, =20 Do these libraries have anything that will handle cookies? = CookieJar functionality? thanks rw |
From: Avi B. <to...@ya...> - 2003-03-02 09:50:53
|
Since I was unable to find the answer to the question I am looking for, in all the places you suggested. I hope you will be able to answer me and I thank you in advance: I am trying to figure out how to use the code to write help system for an Application. My first problem is how to incorporate the htmlparsers packages, for instance org.htmlparser, with the jdk (j2sdk1.4.1_01), so that I will be able to compile the classes I write. Thank you again : A. B. __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Avi B. <to...@ya...> - 2003-03-02 09:08:32
|
How to get to the the Htmlparser-user Archives <http://sourceforge.net/mailarchive/forum.php?forum=htmlparser-user>. I receive an Error instead of the page ???? __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Derrick O. <Der...@ro...> - 2003-02-27 02:54:12
|
If I recall correctly, implementing this feature would require deferring not only the connect but also the determination of the character set (from the header returned by the connect) and creation of the reader (because it needs the character set, and an input stream) until elements() is called. elements() would need to check for a null reader and do the work. Then getReader() and getEncoding() would also have to handle a null reader or null character_set too. Are there other subtleties? Maybe tricky, but probably do-able. I think all the constructors have test cases. But then, all that's really being saved is the user coding: Parser parser = new Parser ("http://yadda"); URL url = parser.getConnection (); ...process the url as appropriate ... parser.elements () instead of: URL url = new URL ("http://yadda"); url.openConnection (); ...process the url as appropriate Parser parser = new Parser (url); ... parser.elements () So it's probably not really worth the convoluted coding, unless I'm missing something in the use-case. Derrick htm...@li... wrote: > >Also, on another note, if I try to initialize the >parser directly, I am unable to work with the >URLConnection. For example: > > HttpURLConnection urlConn = null; > HTMLParser parser = new >HTMLParser("http://somedomain/somepath"); > urlConn = >(HttpURLConnection)parser.getConnection(); > urlConn.setDoInput(true); > // ... > >This code throws an exception because the HTTP request >has already been made. > >Exception in thread "main" >java.lang.IllegalAccessError: Already connected > at >java.net.URLConnection.setDoInput(URLConnection.java:677) > >--- Bob Lewis <bob...@ya...> wrote: > > <snip> > >--__--__-- > >Message: 3 >From: "Somik Raha" <so...@ya...> >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Malformed Input Exception >Date: Tue, 25 Feb 2003 22:46:16 -0800 >Reply-To: htm...@li... > >That sounds like a good feature request. Derrick ->what do you think ? > >Regards, >Somik > > > > > > |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-27 00:21:15
|
hi there, I think a formtag should end when it sees another formtag although it is not an endtag. Another way of determining the endtag of formtag is to check wether it is the end of the html page by checking the endtag of hmtltag. This is because the in formtag, it's consist of inputtag and the importants information about a form is its method, action, and inputtag, therefore when the parser first see a formtag it will parse the node until it sees the endtag of the formtag, another formtag or the end of html document. therefore, we can logically group Vector of inputtag and other attributes to the appropriate formtag (if there is more than one formtag). I hope my explaination can help us improve htmlparser. thank you. Quoting Somik Raha <so...@ya...>: > This is a known limitation. The problem is in guessing > when a form tag really should have ended. Can you > suggest something looking at the page that failed ? > > Regards, > Somik > --- Mohd-Taqiyuddin Zalfan <mt...@ec...> > wrote: > > Hi, > > > > I'm doing my harvester to harvest information in the > > formtag. It works find > > when I parse to any html pages that I need to parse > > except for this URL > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. > > It seems that the page that gives the error does not > > have an endtag for the > > formtag and the parser loopback to find the endtag > > for the formtag. Is this > > a bug? Do you know a solution that I can still parse > > the page and still get > > the Vector FormInput for further processing. Hope > > you can help me on this. > > below is the generated error. > > " > > ERROR: HTMLReader.readElement() : Error occurred > > while trying to decipher > > the tag using scanners > > Tag being processed : FORM > > Current Tag Line : <form > > action="earlyadopterjxtaanswers.jsp" > > method="POST"> > > at Line 690 : null > > Previous Line 689 : </HTML> > > ERROR: HTMLReader.readElement() : Error occurred > > while trying to read the > > next element, > > at Line 690 : null > > Previous Line 689 : </HTML> > > ERROR: Unexpected Exception occurred while reading > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, > > > > in nextHTMLNode > > at Line 690 : null > > Previous Line 689 : </HTML> > > org.htmlparser.util.ParserException: Unexpected > > Exception occurred while > > reading > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta > > .html, in nextHTMLNode > > at Line 690 : null > > Previous Line 689 : </HTML>" > > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Scholarships for > > Techies! > > Can't afford IT training? All 2003 ictp students > > receive scholarships. > > Get hands-on training in Microsoft, Cisco, Sun, > > Linux/UNIX, and more. > > www.ictp.com/training/sourceforge.asp > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This SF.net email is sponsored by: Scholarships for Techies! > Can't afford IT training? All 2003 ictp students receive scholarships. > Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more. > www.ictp.com/training/sourceforge.asp > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Somik R. <so...@ya...> - 2003-02-26 18:05:06
|
This is a known limitation. The problem is in guessing when a form tag really should have ended. Can you suggest something looking at the page that failed ? Regards, Somik --- Mohd-Taqiyuddin Zalfan <mt...@ec...> wrote: > Hi, > > I'm doing my harvester to harvest information in the > formtag. It works find > when I parse to any html pages that I need to parse > except for this URL > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. > It seems that the page that gives the error does not > have an endtag for the > formtag and the parser loopback to find the endtag > for the formtag. Is this > a bug? Do you know a solution that I can still parse > the page and still get > the Vector FormInput for further processing. Hope > you can help me on this. > below is the generated error. > " > ERROR: HTMLReader.readElement() : Error occurred > while trying to decipher > the tag using scanners > Tag being processed : FORM > Current Tag Line : <form > action="earlyadopterjxtaanswers.jsp" > method="POST"> > at Line 690 : null > Previous Line 689 : </HTML> > ERROR: HTMLReader.readElement() : Error occurred > while trying to read the > next element, > at Line 690 : null > Previous Line 689 : </HTML> > ERROR: Unexpected Exception occurred while reading > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, > > in nextHTMLNode > at Line 690 : null > Previous Line 689 : </HTML> > org.htmlparser.util.ParserException: Unexpected > Exception occurred while > reading > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta > .html, in nextHTMLNode > at Line 690 : null > Previous Line 689 : </HTML>" > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Scholarships for > Techies! > Can't afford IT training? All 2003 ictp students > receive scholarships. > Get hands-on training in Microsoft, Cisco, Sun, > Linux/UNIX, and more. > www.ictp.com/training/sourceforge.asp > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-26 16:34:06
|
Hi, I'm doing my harvester to harvest information in the formtag. It works find when I parse to any html pages that I need to parse except for this URL http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. It seems that the page that gives the error does not have an endtag for the formtag and the parser loopback to find the endtag for the formtag. Is this a bug? Do you know a solution that I can still parse the page and still get the Vector FormInput for further processing. Hope you can help me on this. below is the generated error. " ERROR: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scanners Tag being processed : FORM Current Tag Line : <form action="earlyadopterjxtaanswers.jsp" method="POST"> at Line 690 : null Previous Line 689 : </HTML> ERROR: HTMLReader.readElement() : Error occurred while trying to read the next element, at Line 690 : null Previous Line 689 : </HTML> ERROR: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, in nextHTMLNode at Line 690 : null Previous Line 689 : </HTML> org.htmlparser.util.ParserException: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta .html, in nextHTMLNode at Line 690 : null Previous Line 689 : </HTML>" |
From: Bob L. <bob...@ya...> - 2003-02-26 16:16:41
|
Hi, I tried this, as you suggested, and received the same Exception while reading the InputStream. Which led me to discover that I was setting the wrong character set in the InputStreamReader. My app was erroneously using the system default character set (UTF8 in this case), but the actual stream was using ISO-8859-1. The getCharset and getCharacterSet methods in Parser are very useful here. You may want to consider making them static and public, or moving them to a Utility class. That way they can be used by applications which construct their own Readers. Thanks for the help, Bob Lewis --- Somik Raha <so...@ya...> wrote: > Hi Bob, > Can you try this - get the data from the url in > question into a file > (using a post request). Then try to parse the file. > If it breaks, we would > know why. > > Regards, > Somik > ----- Original Message ----- > From: "Bob Lewis" <bob...@ya...> > To: <htm...@li...> > Sent: Tuesday, February 25, 2003 12:07 PM > Subject: Re: [Htmlparser-user] Malformed Input > Exception > > > > > > I tried using the parser directly, as you > suggested, > > and it seems to work. However, I need to be able > work > > with the URLConnection to set headers, cookies and > > send POST data. > > > > Typically, this is what I'm doing: > > > > //create and initialize the URL Connection > > HttpURLConnection urlConn = null; > > URL url = new > URL("http://somedomain/somepath"); > > urlConn = > (HttpURLConnection)url.openConnection(); > > urlConn.setDoInput(true); > > urlConn.setDoOutput(true); > > urlConn.setUseCaches(false); > > urlConn.setAllowUserInteraction(false); > > urlConn.setRequestMethod("POST"); > > > > // ... usually many HTTP Headers and cookie > values > > set > > urlConn.setRequestProperty("someHeader", > > "someValue"); > > urlConn.setRequestProperty("anotherHeader", > > "anotherValue"); > > > > StringBuffer postData = new StringBuffer(); > > // ... generate post data in buffer > > > > //Send the post data > > PrintWriter printWriter = new > > PrintWriter(urlConn.getOutputStream()); > > printWriter.println(postData.toString()); > > printWriter.close(); > > > > //parse the response > > HTMLEnumeration tags = parser.elements(); > > > > while (parser.hasMoreNodes()) > > { > > // ... Do Something > > } > > > > This works fine on most URLs. I am normally able > to > > execute the server-side web application, obtain > and > > parse the HTML response. However, in the case of > > these two URLs, I get the MalformedInputException. > > > > Is there something I'm missing? > > > > Thanks, > > > > Bob Lewis > > > > --- Somik Raha <so...@ya...> wrote: > > > > >Date: 2003-02-24 21:33 > > >Sender: somik > > >Logged In: YES > > >user_id=187944 > > > > > >I ran the parser on these pages and it worked > fine. > > Try > > >runParser.bat > http://www.flytango.com/en/index.html. > > > > > >It could be that you have intialized your > > urlconnection > > >incorrectly. Have you tried using the parser > > directly, like : > > > > > >HTMLParser parser = new HTMLParser > > >("http://www.flytango.com/en/index.html"); > > >for (NodeIterator > > i=parser.elements();i.hasMoreNodes();) { > > > System.out.println(i.nextNode().toHtml()); > > >} > > > > --- Somik Raha <so...@ya...> wrote: > > > Hi Bob, > > > Sounds like a bug. > > > Can you file a bug report at > > > http://htmlparser.sourceforge.net? > > > > > > Regards, > > > Somik > > > --- Bob Lewis <bob...@ya...> wrote: > > > > Hi, > > > > > > > > I am trying to use htmlparser 1.3 to parse the > > > HTML > > > > at > > > > http://www.flytango.com/en/taschedule.html and > > > > http://www.flytango.com/en/index.html. When I > > > > attempt > > > > to parse these pages, I get > > > > com.sun.io.MalformedInputException: > > > > > > > > sun.io.MalformedInputException > > > > at > > > > > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > > at > > > > > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > > at > > > > > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > > at > > > > > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > > at > > > > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav > a:91) > > > > > > > > Now, if I copy the source of these pages from > a > > > > browser into a file and put them on my own > > > > webserver, > > > > I can parse them without any errors. > > > > > > > > It's my guess that there is some strange > control > > > > character in the source that is causing the > > > > exception, > > > > but I'm not entirely sure. Any suggestions? > If > > > it > > > > is > > > > a bad character, would it be possible to add > code > > > to > > > > HTMLReader that strips offending characters > from > > > the > > > > input stream? > > > > > === message truncated === __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-02-26 06:44:52
|
That sounds like a good feature request. Derrick ->what do you think ? Regards, Somik ----- Original Message ----- From: "Bob Lewis" <bob...@ya...> To: <htm...@li...> Sent: Tuesday, February 25, 2003 12:20 PM Subject: Re: [Htmlparser-user] Malformed Input Exception > Sorry, there was a typo in my last message: > > > while (parser.hasMoreNodes()) > > { > > // ... Do Something > > } > > should be > > while (tags.hasMoreNodes()) > { > // ... Do Something > } > > Also, on another note, if I try to initialize the > parser directly, I am unable to work with the > URLConnection. For example: > > HttpURLConnection urlConn = null; > HTMLParser parser = new > HTMLParser("http://somedomain/somepath"); > urlConn = > (HttpURLConnection)parser.getConnection(); > urlConn.setDoInput(true); > // ... > > This code throws an exception because the HTTP request > has already been made. > > Exception in thread "main" > java.lang.IllegalAccessError: Already connected > at > java.net.URLConnection.setDoInput(URLConnection.java:677) > > --- Bob Lewis <bob...@ya...> wrote: > > > > I tried using the parser directly, as you suggested, > > and it seems to work. However, I need to be able > > work > > with the URLConnection to set headers, cookies and > > send POST data. > > > > Typically, this is what I'm doing: > > > > //create and initialize the URL Connection > > HttpURLConnection urlConn = null; > > URL url = new URL("http://somedomain/somepath"); > > urlConn = > > (HttpURLConnection)url.openConnection(); > > urlConn.setDoInput(true); > > urlConn.setDoOutput(true); > > urlConn.setUseCaches(false); > > urlConn.setAllowUserInteraction(false); > > urlConn.setRequestMethod("POST"); > > > > // ... usually many HTTP Headers and cookie > > values > > set > > urlConn.setRequestProperty("someHeader", > > "someValue"); > > urlConn.setRequestProperty("anotherHeader", > > "anotherValue"); > > > > StringBuffer postData = new StringBuffer(); > > // ... generate post data in buffer > > > > //Send the post data > > PrintWriter printWriter = new > > PrintWriter(urlConn.getOutputStream()); > > printWriter.println(postData.toString()); > > printWriter.close(); > > > > //parse the response > > HTMLEnumeration tags = parser.elements(); > > > > while (parser.hasMoreNodes()) > > { > > // ... Do Something > > } > > > > This works fine on most URLs. I am normally able to > > execute the server-side web application, obtain and > > parse the HTML response. However, in the case of > > these two URLs, I get the MalformedInputException. > > > > Is there something I'm missing? > > > > Thanks, > > > > Bob Lewis > > > > --- Somik Raha <so...@ya...> wrote: > > > > >Date: 2003-02-24 21:33 > > >Sender: somik > > >Logged In: YES > > >user_id=187944 > > > > > >I ran the parser on these pages and it worked fine. > > Try > > >runParser.bat > > http://www.flytango.com/en/index.html. > > > > > >It could be that you have intialized your > > urlconnection > > >incorrectly. Have you tried using the parser > > directly, like : > > > > > >HTMLParser parser = new HTMLParser > > >("http://www.flytango.com/en/index.html"); > > >for (NodeIterator > > i=parser.elements();i.hasMoreNodes();) { > > > System.out.println(i.nextNode().toHtml()); > > >} > > > > --- Somik Raha <so...@ya...> wrote: > > > Hi Bob, > > > Sounds like a bug. > > > Can you file a bug report at > > > http://htmlparser.sourceforge.net? > > > > > > Regards, > > > Somik > > > --- Bob Lewis <bob...@ya...> wrote: > > > > Hi, > > > > > > > > I am trying to use htmlparser 1.3 to parse the > > > HTML > > > > at > > > > http://www.flytango.com/en/taschedule.html and > > > > http://www.flytango.com/en/index.html. When I > > > > attempt > > > > to parse these pages, I get > > > > com.sun.io.MalformedInputException: > > > > > > > > sun.io.MalformedInputException > > > > at > > > > > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > > at > > > > > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > > at > > > > > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > > at > > > > > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > > at > > > > > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav a:91) > > > > > > > > Now, if I copy the source of these pages from a > > > > browser into a file and put them on my own > > > > webserver, > > > > I can parse them without any errors. > > > > > > > > It's my guess that there is some strange control > > > > character in the source that is causing the > > > > exception, > > > > but I'm not entirely sure. Any suggestions? If > > > it > > > > is > > > > a bad character, would it be possible to add > > code > > > to > > > > HTMLReader that strips offending characters from > > > the > > > > input stream? > > > > > > > > Here is the code I am using to parse: > > > > > > > > DefaultHTMLParserFeedback feedback > > > > = new > > > > > > > > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > > > > > HTMLReader reader = null; > > > > HTMLParser parser = null; > > > > InputStreamReader isr > > > > = new > > > > InputStreamReader(urlConn.getInputStream()); > > > > reader = new HTMLReader(isr, 8192); > > > > parser = new HTMLParser(reader, > > feedback); > > > > boolean inForm = false; > > > > > > > > parser.addScanner(new > > > > HTMLInputTagScanner()); > > > > > > > > HTMLEnumeration tags = > > parser.elements(); > > > > > > > > RequestParameters params = new > > > > RequestParameters(); > > > > > > > > while (tags.hasMoreNodes()) > > > > { > > > > ... > > > > } > > > > > > > > > > > > Thanks, > > > > > > > > Bob Lewis > > > > > > > === message truncated === > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |