htmlparser-user Mailing List for HTML Parser (Page 85)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Elodie T. <et...@in...> - 2003-02-06 08:04:40
|
Hi, I noticed that some of the Tag classes have a method that permit to modify (or I guess they do) the "source" attribute (like href or src). These methods are, for example : setBaseURL, setImageURL, setLink... It seems perfect to me, as I have to modify all relative path in a html... but I can't find method that set source location in a frame tag, nor in an input tag (when type=image). What can I do ? Would it be too complex for me if I tried to add such a method in the HTMLFrameTag class ? Regards, Elodie |
From: Aminudin K. <ami...@mi...> - 2003-02-06 07:13:42
|
Hi, Currently I am testing HTMLParser for my HTML translation engine. FYI, I am using the latest integration module , Version 1.3 dated 3 February, 2003. I had problem when using htmlparser.jar , it couldn't find HTMLVisitor(I mean org.htmlparser.visitors) but it could find HTMLParser. Does this means that HTMLVisitor is not included in the pre-compiled binary that comes along with the integration release ? If recompile is the answer , then I have to learn Ant . Thanks for support :) --------------------Error--------------------------- htmlTrans.java:10: cannot resolve symbol symbol : class visitors location: package htmlparser import org.htmlparser.visitors; ^ htmlTrans.java:17: cannot resolve symbol symbol : class TextExtractingVisitor location: class htmlTrans TextExtractingVisitor visitor = new TextExtractingVisitor(); ^ htmlTrans.java:17: cannot resolve symbol symbol : class TextExtractingVisitor location: class htmlTrans TextExtractingVisitor visitor = new TextExtractingVisitor(); ^ 3 errors ---------------------------------------------------------------------------------------- Below are the codes import java.util.*; import java.io.*; import org.htmlparser.HTMLParser; import org.htmlparser.HTMLRemarkNode; import org.htmlparser.HTMLStringNode; import org.htmlparser.tags.HTMLEndTag; import org.htmlparser.tags.HTMLTag; import org.htmlparser.util.HTMLParserException; import org.htmlparser.visitors; public class htmlTrans { public static void main(String args[]){ try { HTMLParser parser = new HTMLParser("http://www.yahoo.com"); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); }catch (HTMLParserException e){ System.out.println("Error"); } } } |
From: Somik R. <so...@ya...> - 2003-02-05 18:37:07
|
--- Elodie Tasia <et...@in...> wrote: > I answer to myself ;o) I think I've found the source > of my problem. > I've added : > > parser.addScanner( new HTMLFrameSetScanner() ); > parser.addScanner( new HTMLFrameScanner() ); > > to my code, and it seems to work now. > But I've discored another problem : if there is one > or many <frameset> tags included in a <frameset>, > they are not detected (only the <frame> tags). > > Could someone confirm me that, or I'm totally wrong > ? There are many ways to get to the child tags. Here are some : Assuming you have got the first frameset tag for (SimpleEnumeration e = firstFrameSetTag.children(); e.hasMoreNodes(); ) { HTMLNode node = e.nextNode(); if (node instanceof HTMLFrameSetTag) { HTMLFrameSetTag frameSetTag = (HTMLFrameSetTag)node; } if (node instanceof HTMLFrameTag) { HTMLFrameTag frameTag = (HTMLFrameTag)node; } } ALTERNATIVELY: If you are only interested in frameset tags, parser = new HTMLParser(..); parser.registerScanners(); HTMLNode [] frameSetTags = parser.extractAllNodesThatAre(HTMLFrameSetTag.class); If you are interested in both frameset and frame, a cleaner approach is : public class MyParserVisitor extends HTMLVisitor { public void visitTag(HTMLTag tag) { if (tag.getTagName().equals("FRAMESET") || tag.getTagName().equals("FRAME") { // Do what you want to do. } } public void get..() { } } parser = new HTMLParser(..); MyParserVisitor myParserVisitor = new MyParserVisitor(); parser.visitAllNodesWith(myParserVisitor); myParserVisitor.get..(); Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Elodie T. <et...@in...> - 2003-02-05 14:36:23
|
I answer to myself ;o) I think I've found the source of my problem. I've added : parser.addScanner( new HTMLFrameSetScanner() ); parser.addScanner( new HTMLFrameScanner() ); to my code, and it seems to work now. But I've discored another problem : if there is one or many <frameset> tags included in a <frameset>, they are not detected (only the <frame> tags). Could someone confirm me that, or I'm totally wrong ? Thanx. > Hi, > > I want to modify attributes of some HTMLTags that I get. I'm using the following code just to see how the parser works. The problem is that it doesn't detect the <frame> tags, although it gets the other tags (a, img, frameset). I tested it with a HTML document that have 2 or 3 frame tags, and it sees none of them ! > > What can I do ? > > Thanx in advance for your help. > > > > HTMLReader htmlReader = new HTMLReader ( buffer, len ); > HTMLParser parser = new HTMLParser(htmlReader); > parser.registerScanners(); > > HTMLEnumeration e = parser.elements(); > while ( e.hasMoreNodes() ) { > > HTMLNode node = e.nextHTMLNode(); > > if ( node instanceof HTMLLinkTag) { > logger.debug ( " href = " + ((HTMLLinkTag) node).getLink() ); > } else { > if ( node instanceof HTMLImageTag) { > logger.debug ( " srcImg = " + ((HTMLImageTag) node).getImageURL() ); > } else { > if ( node instanceof HTMLFrameSetTag) { > logger.debug ( " srcFrameSet = " + ((HTMLFrameSetTag) node).getFrameLocation() ); > } else { > if ( node instanceof HTMLFrameTag) { > logger.debug ( " srcFrame = " + ((HTMLFrameSetTag) node).getFrameLocation() ); > } else { > if ( node instanceof HTMLTag) > logger.debug ( " HTMLTag = " + ( (HTMLTag) node).toHTML() ); > } > } > } > } > } > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Stan P. <alt...@wa...> - 2003-02-05 09:12:42
|
On Tue, 4 Feb 2003 11:18:12 -0800 (PST) Somik Raha <so...@ya...> wrote: > Pls file a bug report from the parser website. you have already done it obviously... thanks a lot, Stan. |
From: Elodie T. <et...@in...> - 2003-02-05 08:35:08
|
Hi Somik, >Yes, thats normal. This is bcos when you rip pages >off, changing the relative image and web links makes >sense, but the form link usually refers to a server >capable of processing http posts. Making that relative >to your local machine makes little sense (at least >till now) for a ripping application. >However, do you have a scenario where you think it >might be useful to change its contents ? We'd be glad >to consider it. Imagine your have a portal where you import any kind of files. The filesystem is so that the "logical paths" are different from teh "physical paths". So, when you want to visualize HTML files from this portal, the 'href' ans 'src' paths aren't valid anymore, so I must take them all, "translate" them and replace them. My method wasn't maybe the good one, but I listed all the HTML tag that could have such an attribute (src, href, ...) and I want to modify them all. Maybe, as you say, I don't need to translate the form attributes or others, but I wanted to take care all possibilities : you never know who made the HTML file ! ;o) Is that clear ? I hope my english is not too bad ;o) Regards, Elodie |
From: Elodie T. <et...@in...> - 2003-02-05 08:24:01
|
Hi, I want to modify attributes of some HTMLTags that I get. I'm using the following code just to see how the parser works. The problem is that it doesn't detect the <frame> tags, although it gets the other tags (a, img, frameset). I tested it with a HTML document that have 2 or 3 frame tags, and it sees none of them ! What can I do ? Thanx in advance for your help. HTMLReader htmlReader = new HTMLReader ( buffer, len ); HTMLParser parser = new HTMLParser(htmlReader); parser.registerScanners(); HTMLEnumeration e = parser.elements(); while ( e.hasMoreNodes() ) { HTMLNode node = e.nextHTMLNode(); if ( node instanceof HTMLLinkTag) { logger.debug ( " href = " + ((HTMLLinkTag) node).getLink() ); } else { if ( node instanceof HTMLImageTag) { logger.debug ( " srcImg = " + ((HTMLImageTag) node).getImageURL() ); } else { if ( node instanceof HTMLFrameSetTag) { logger.debug ( " srcFrameSet = " + ((HTMLFrameSetTag) node).getFrameLocation() ); } else { if ( node instanceof HTMLFrameTag) { logger.debug ( " srcFrame = " + ((HTMLFrameSetTag) node).getFrameLocation() ); } else { if ( node instanceof HTMLTag) logger.debug ( " HTMLTag = " + ( (HTMLTag) node).toHTML() ); } } } } } |
From: James M. <jmo...@uc...> - 2003-02-05 02:33:41
|
Greetings, I checked out the latest CVS version (on 02/04/2003) and saw that the fix was applied for HTMLs that have not double quotes in them. Unfortunately, I found a prior bug showing itself again The HTML below throws an index out of bounds on the word 'hello' -- it dies on the last single quote. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Untitled Document</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> </head> <script language="JavaScript" type="text/JavaScript"> // if this fails, output a 'hello' if (true) { //something good... } </script> </body> </html> Thanks! James Moliere jmo...@uc... |
From: Somik R. <so...@ya...> - 2003-02-05 00:16:41
|
Hi Folks, We were examining the design of the HTMLStringNode, and were curious to know actual use cases. We would be grateful if you could write to us about how you have used the parser (if you are doing a lot of string processing from html). Here are some sample uses we would think are most common : [1] Ripping text content from a web page [2] Transforming string node text We are looking for real-world examples for these scenarios (and any other example which doesent come under the above two categories). Your feedback will help us improve the design of the parser. Thanks for taking the time to reply. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2003-02-04 19:18:13
|
Pls file a bug report from the parser website. Regards, Somik --- Stan Pinte <alt...@wa...> wrote: > > > Begin forwarded message: > > Date: Tue, 4 Feb 2003 20:01:26 +0100 > From: Stan Pinte <alt...@wa...> > To: htm...@li...urceforge > Subject: bug in the StringExtractor.java, or in one > of the base classes.. > > > hello, > > I used the version 1.2 of htmlparser without > problem, but after having recompiled version > htmlparser1_3_20030202.zip, I have the following > problem, when doing > > > java -classpath htmlparser.jar > org.htmlparser.parserapplications.StringExtractor > http://www.lemonde.fr > > > Exception in thread "main" > java.lang.OutOfMemoryError > <<no stack trace available>> > > here enclosed the output of the StringExtractor, > before crashing. > > any idea? > > The problem doesn't occur with version1_2 > > thanks a lot, > > Stan > > > > > -- > > Stanislas Pinte > > Computer Consultant > > 98, rue Bois l'Evêque > B-4000 Liège > > web: http://www.altosw.be > email: alt...@wa... > > > ATTACHMENT part 2 application/x-gzip name=crash.txt.gz __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Stan P. <alt...@wa...> - 2003-02-04 18:44:51
|
Begin forwarded message: Date: Tue, 4 Feb 2003 20:01:26 +0100 From: Stan Pinte <alt...@wa...> To: htm...@li...urceforge Subject: bug in the StringExtractor.java, or in one of the base classes.. hello, I used the version 1.2 of htmlparser without problem, but after having reco= mpiled version htmlparser1_3_20030202.zip, I have the following problem, wh= en doing=20 java -classpath htmlparser.jar org.htmlparser.parserapplications.StringExtr= actor http://www.lemonde.fr Exception in thread "main" java.lang.OutOfMemoryError <<no stack trace available>> here enclosed the output of the StringExtractor, before crashing. any idea? The problem doesn't occur with version1_2 thanks a lot, Stan --=20 Stanislas Pinte Computer Consultant 98, rue Bois l'Ev=EAque B-4000 Li=E8ge web: http://www.altosw.be email: alt...@wa... |
From: Somik R. <so...@ya...> - 2003-02-04 16:38:15
|
Hi Elodie, > I've just discovered the HTMLParser librairy and I > wonder why there are no methods in HTMLFrameTag and > HTMLFormTag that permit to get and modify their > "src" attribute. > Is that normal ? How can I do if I want to modify > the source location, like in HTMLLinkTag and > HTMLImageTag ? Yes, thats normal. This is bcos when you rip pages off, changing the relative image and web links makes sense, but the form link usually refers to a server capable of processing http posts. Making that relative to your local machine makes little sense (at least till now) for a ripping application. However, do you have a scenario where you think it might be useful to change its contents ? We'd be glad to consider it. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Elodie T. <et...@in...> - 2003-02-04 10:10:29
|
Hi, I've just discovered the HTMLParser librairy and I wonder why there are no methods in HTMLFrameTag and HTMLFormTag that permit to get and modify their "src" attribute. Is that normal ? How can I do if I want to modify the source location, like in HTMLLinkTag and HTMLImageTag ? Thanx in advance. |
From: Somik R. <so...@ya...> - 2003-02-03 07:21:35
|
Hi Folks, Integration release 1.3-20030202 is out. From the change log : Integration build 1.3 - 20030202 -------------------------------- [1] Renamed HTMLCompositeTagScanner to CompositeTagScanner [2] Renamed HTMLTag.getParameter() to HTMLTag.getAttribute() [3] Added TableScanner [4] Added HtmlPage [5] Added SpanScanner [6] Added assertType in HTMLParserTestCase [7] Added TextExtractingVisitor [8] Added non-recursive visiting (flag in HTMLVisitor) [9] Added DivScanner [10] Modified collectInto to use NodeList [11] Added collectInto(NodeList, Class) [12] CompositeTagScanner can handle single xml-like tags e.g. <div/> [13] Fixed bug 678969 - StringParser was not going into ignore mode on encountering double quotes [14] Added LabelScanner Dhaval Udani has contributed LabelScanner. (He has also contributed a BodyScanner which will make it next week's release). We've shipped this time with two tests failing- both tests replicate the same bug - 677874 - "mishandling of double quotes". I made this release for two reasons : [1] This bug is not a new addition but was always there - its a deep bug in AttributeParser (previously known as ParameterParser) - and it might take a little time to fix [2] There are lot of new additions which we'd like to get out there - we finally have a table scanner! [3] Important bug fixes have been made which further stabilize the parser's performance (and at least one user was desperately waiting for the fix) Notable addition - HTMLNode.collectInto() has a new mode of operation - using the class type. Suppose you need to get to a node (e.g. images) that is within a composite (like a table), you can do : NodeList imageList = new ImageList(); tableTag.collectInto(imageList,HTMLImageTag.class); You can also do this directly from the parser - like so : HTMLNode node [] = parser.extractAllNodesThatAre(HTMLLinkTag.class); And here's some more news - we now have our own wiki (finally!). Go to http://htmlparser.sourceforge.net/docs/ This is a free-for-all wiki. It is a little too much for me to write the entire documentation on my own - so I'd highly appreciate if the user/developer community pitches in - that would be a great benefit for the community. The current documentation on the site is already obsolete, and I am going to take it down soon (hopefully by the next release). Regards, Somik |
From: Somik R. <so...@ya...> - 2003-01-31 17:59:20
|
I couldnt see either of the urls you sent. But, what you want to do is possible with the parser. Sending POST requests is a new feature in 1.3 (get the latest integration release). From the testcases, here's a sample (showing creation of the parser) <code> url = new URL ("http://www.canadapost.ca/tools/pcl/bin/cp_search_response-e.asp"); connection = (HttpURLConnection)url.openConnection (); connection.setRequestMethod ("POST"); connection.setRequestProperty ("Referer", "http://www.canadapost.ca/tools/pcl/bin/default-e.asp"); connection.setDoOutput (true); connection.setDoInput (true); connection.setUseCaches (false); buffer = new StringBuffer (1024); buffer.append ("app_language="); buffer.append ("english"); buffer.append ("&"); buffer.append ("app_response_start_row_number="); buffer.append ("1"); buffer.append ("&"); buffer.append ("app_response_rows_max="); buffer.append ("9"); buffer.append ("&"); buffer.append ("app_source="); buffer.append ("quick"); buffer.append ("&"); buffer.append ("query_source="); buffer.append ("q"); buffer.append ("&"); buffer.append ("name="); buffer.append ("&"); buffer.append ("postal_code="); buffer.append ("&"); buffer.append ("directory_area_name="); buffer.append ("&"); buffer.append ("delivery_mode="); buffer.append ("&"); buffer.append ("Suffix="); buffer.append ("&"); buffer.append ("street_direction="); buffer.append ("&"); buffer.append ("installation_type="); buffer.append ("&"); buffer.append ("delivery_number="); buffer.append ("&"); buffer.append ("installation_name="); buffer.append ("&"); buffer.append ("unit_numbere="); buffer.append ("&"); buffer.append ("app_state="); buffer.append ("production"); buffer.append ("&"); buffer.append ("street_number="); buffer.append (number); buffer.append ("&"); buffer.append ("street_name="); buffer.append (street); buffer.append ("&"); buffer.append ("street_type="); buffer.append (type); buffer.append ("&"); buffer.append ("test="); buffer.append ("&"); buffer.append ("city="); buffer.append (city); buffer.append ("&"); buffer.append ("prov="); buffer.append (province); buffer.append ("&"); buffer.append ("Search="); out = new PrintWriter (connection.getOutputStream ()); out.print (buffer); out.close (); parser = new HTMLParser (connection); </code> Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-01-31 12:26:54
|
hi there, i'm might want to use this for my project. my project is to extract java quizzes which is either in <form> or just in plain html. however, can this package be used to extract the related content. how i can use the get method because java quizzes with <form> requires the user to POST answers for the quiz before I can obtain the relevant answers, I want to harvest both answers and question. Some sample if you mind to take a look at it :) http://developer.java.sun.com/developer/quizzes/jbasic1-1 http://wwww.angelfire.com/or/abhilash/main.html I really hope you can help me on this matter, thank you :) |
From: Somik R. <so...@ya...> - 2003-01-30 17:44:46
|
You might be having an older version of the parser. Make sure you have the latest integration release 1.3-20030125 Regards, Somik --- Aminudin Khalid <ami...@mi...> wrote: > Hi, thank you for giving a sample program. > > I've tried to compiled the program but JAVAC > couldn't find HTMLVisitor > class . There are some other errors too. Below are > the codes and errors . > > > > *Errors : > StringTranslatingVisitor.java:1: cannot resolve > symbol > symbol : class visitors > location: package htmlparser > import org.htmlparser.visitors; > ^ > StringTranslatingVisitor.java:9: cannot resolve > symbol > symbol : class HTMLVisitor > location: class StringTranslatingVisitor > public class StringTranslatingVisitor extends > HTMLVisitor{ > ^ > StringTranslatingVisitor.java:39: cannot resolve > symbol > symbol : method visitAllNodesWith > (StringTranslatingVisitor) > location: class org.htmlparser.HTMLParser > parser.visitAllNodesWith(visitor); > ^ > StringTranslatingVisitor.java:40: cannot resolve > symbol > symbol : method getHTML () > location: class StringTranslatingVisitor > System.out.println(visitor.getHTML()); > * > > > > import org.htmlparser.HTMLParser; > import org.htmlparser.HTMLRemarkNode; > import org.htmlparser.HTMLStringNode; > import org.htmlparser.tags.HTMLEndTag; > import org.htmlparser.tags.HTMLTag; > > > public class StringTranslatingVisitor extends > HTMLVisitor{ > StringBuffer htmlData = new StringBuffer(); > > public void visitStringNode(HTMLStringNode > stringNode) { > String yourStuff="TextToBeTranslated"; > // Perform modifications here. > // finally, add to htmlData > htmlData.append(yourStuff); > } > > public void visitEndTag(HTMLEndTag endTag) { > htmlData.append(endTag.toHTML()); > } > > public void visitTag(HTMLTag tag) { > htmlData.append(tag.toHTML()); > } > > public String getHtml() { > return htmlData.toString(); > } > > public void visitRemarkNode(HTMLRemarkNode > remarkNode) { > htmlData.append(remarkNode.toHTML()); > } > > public static void main(String args[]){ > HTMLParser parser = new > HTMLParser("http://www.yahoo.com"); > parser.registerScanners(); > StringTranslatingVisitor visitor = new > StringTranslatingVisitor(); > parser.visitAllNodesWith(visitor); > System.out.println(visitor.getHTML()); > > > } > } > > > > Somik Raha wrote: > > >>What I want to do is to parse HTML code and > >>translate the content and > >>the put the translated text/content back into the > >>original HTML structure. > >> > >>Does this HTML parser suitable of doing this kind > of > >>task ? > >> > >> > > > >By translating content, I guess you mean > translation > >of meaningful text data (not tags). That is easily > >possible. You can look at the StringExtractor > example > >(org.htmlparser.parserapplications) or the > >StringFindingVisitor (org.htmlparser.visitors). > > > >The simplest approach is to write your own visitor > - > >StringTranslatingVisitor, that runs through the > entire > >html, and wherever it finds strings, these are > >translated as per your wishes. > > > >Here is a sample program : > >import org.htmlparser.HTMLRemarkNode; > >import org.htmlparser.HTMLStringNode; > >import org.htmlparser.tags.HTMLEndTag; > >import org.htmlparser.tags.HTMLTag; > > > >public class StringTranslatingVisitor extends > >HTMLVisitor { > > StringBuffer htmlData = new StringBuffer(); > > > > public void visitStringNode(HTMLStringNode > >stringNode) { > > String yourStuff=""; > > // Perform modifications here. > > // finally, add to htmlData > > htmlData.append(yourStuff); > > } > > > > public void visitEndTag(HTMLEndTag endTag) { > > htmlData.append(endTag.toHTML()); > > } > > > > public void visitTag(HTMLTag tag) { > > htmlData.append(tag.toHTML()); > > } > > > > public String getHtml() { > > return htmlData.toString(); > > } > > public void visitRemarkNode(HTMLRemarkNode > >remarkNode) { > > htmlData.append(remarkNode.toHTML()); > > } > > > >} > > > >To use this, create your parser - > >HTMLParser parser = new > >HTMLParser("http://someurl.com"); > >parser.registerScanners(); > >StringTranslatingVisitor visitor = > > new StringTranslatingVisitor(); > >parser.visitAllNodesWith(visitor); > >System.out.println(visitor.getHTML()); > > > >Regards, > >Somik > > > >__________________________________________________ > >Do you Yahoo!? > >Yahoo! Mail Plus - Powerful. Affordable. Sign up > now. > >http://mailplus.yahoo.com > > > > > >------------------------------------------------------- > >This SF.NET email is sponsored by: > >SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > >http://www.vasoftware.com > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Aminudin K. <ami...@mi...> - 2003-01-30 08:00:51
|
Hi, thank you for giving a sample program. I've tried to compiled the program but JAVAC couldn't find HTMLVisitor class . There are some other errors too. Below are the codes and errors . *Errors : StringTranslatingVisitor.java:1: cannot resolve symbol symbol : class visitors location: package htmlparser import org.htmlparser.visitors; ^ StringTranslatingVisitor.java:9: cannot resolve symbol symbol : class HTMLVisitor location: class StringTranslatingVisitor public class StringTranslatingVisitor extends HTMLVisitor{ ^ StringTranslatingVisitor.java:39: cannot resolve symbol symbol : method visitAllNodesWith (StringTranslatingVisitor) location: class org.htmlparser.HTMLParser parser.visitAllNodesWith(visitor); ^ StringTranslatingVisitor.java:40: cannot resolve symbol symbol : method getHTML () location: class StringTranslatingVisitor System.out.println(visitor.getHTML()); * import org.htmlparser.HTMLParser; import org.htmlparser.HTMLRemarkNode; import org.htmlparser.HTMLStringNode; import org.htmlparser.tags.HTMLEndTag; import org.htmlparser.tags.HTMLTag; public class StringTranslatingVisitor extends HTMLVisitor{ StringBuffer htmlData = new StringBuffer(); public void visitStringNode(HTMLStringNode stringNode) { String yourStuff="TextToBeTranslated"; // Perform modifications here. // finally, add to htmlData htmlData.append(yourStuff); } public void visitEndTag(HTMLEndTag endTag) { htmlData.append(endTag.toHTML()); } public void visitTag(HTMLTag tag) { htmlData.append(tag.toHTML()); } public String getHtml() { return htmlData.toString(); } public void visitRemarkNode(HTMLRemarkNode remarkNode) { htmlData.append(remarkNode.toHTML()); } public static void main(String args[]){ HTMLParser parser = new HTMLParser("http://www.yahoo.com"); parser.registerScanners(); StringTranslatingVisitor visitor = new StringTranslatingVisitor(); parser.visitAllNodesWith(visitor); System.out.println(visitor.getHTML()); } } Somik Raha wrote: >>What I want to do is to parse HTML code and >>translate the content and >>the put the translated text/content back into the >>original HTML structure. >> >>Does this HTML parser suitable of doing this kind of >>task ? >> >> > >By translating content, I guess you mean translation >of meaningful text data (not tags). That is easily >possible. You can look at the StringExtractor example >(org.htmlparser.parserapplications) or the >StringFindingVisitor (org.htmlparser.visitors). > >The simplest approach is to write your own visitor - >StringTranslatingVisitor, that runs through the entire >html, and wherever it finds strings, these are >translated as per your wishes. > >Here is a sample program : >import org.htmlparser.HTMLRemarkNode; >import org.htmlparser.HTMLStringNode; >import org.htmlparser.tags.HTMLEndTag; >import org.htmlparser.tags.HTMLTag; > >public class StringTranslatingVisitor extends >HTMLVisitor { > StringBuffer htmlData = new StringBuffer(); > > public void visitStringNode(HTMLStringNode >stringNode) { > String yourStuff=""; > // Perform modifications here. > // finally, add to htmlData > htmlData.append(yourStuff); > } > > public void visitEndTag(HTMLEndTag endTag) { > htmlData.append(endTag.toHTML()); > } > > public void visitTag(HTMLTag tag) { > htmlData.append(tag.toHTML()); > } > > public String getHtml() { > return htmlData.toString(); > } > public void visitRemarkNode(HTMLRemarkNode >remarkNode) { > htmlData.append(remarkNode.toHTML()); > } > >} > >To use this, create your parser - >HTMLParser parser = new >HTMLParser("http://someurl.com"); >parser.registerScanners(); >StringTranslatingVisitor visitor = > new StringTranslatingVisitor(); >parser.visitAllNodesWith(visitor); >System.out.println(visitor.getHTML()); > >Regards, >Somik > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > >------------------------------------------------------- >This SF.NET email is sponsored by: >SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! >http://www.vasoftware.com >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Somik R. <so...@ya...> - 2003-01-27 20:25:55
|
Do you have v1.2 ? Search capabilities have been in the production release (tho improved in the subsequent integration releases). Regards, Somik --- ope tomori <op...@ho...> wrote: > > > I must be using an earlier version of the parser, > because i dont have the > searchFor and searchByName methods in my formTag. > Did that come out in the > latest release? Should i download the integration > build or the htmlparser? > What is the integration build? > > > Thanks for all your help > sincerely, > ope > > > > > >From: htm...@li... > >Reply-To: htm...@li... > >To: htm...@li... > >Subject: Htmlparser-user digest, Vol 1 #174 - 1 msg > >Date: Thu, 23 Jan 2003 12:11:53 -0800 > > > >Send Htmlparser-user mailing list submissions to > > htm...@li... > > > >To subscribe or unsubscribe via the World Wide Web, > visit > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >or, via email, send a message with subject or body > 'help' to > > htm...@li... > > > >You can reach the person managing the list at > > htm...@li... > > > >When replying, please edit your Subject line so it > is more specific > >than "Re: Contents of Htmlparser-user digest..." > > > > > >Today's Topics: > > > > 1. Re: parsing form elements (Somik Raha) > > > >--__--__-- > > > >Message: 1 > >Date: Wed, 22 Jan 2003 15:47:54 -0800 (PST) > >From: Somik Raha <so...@ya...> > >Subject: Re: [Htmlparser-user] parsing form > elements > >To: htm...@li... > >Reply-To: htm...@li... > > > > > Can someone give me some direction in using the > > > formscanner and formTag to > > > parse form elements like the buttons (submit, > > > cancel, etc) on a html page. > > > >Just rig up the parser as usual > >(parser.registerScanners()) - and check your node > to > >see if it is a form tag. If it is, cast it and use > the > >api. > > > >Use searchFor, or searchByName (in HTMLFormTag). > >The former gets anything that contains the given > text, > >while the latter gives named elements within the > form > >(as subclasses of HTMLTag). > > > >Regards, > >Somik > > > >__________________________________________________ > >Do you Yahoo!? > >Yahoo! Mail Plus - Powerful. Affordable. Sign up > now. > >http://mailplus.yahoo.com > > > > > > > >--__--__-- > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > >End of Htmlparser-user Digest > > > _________________________________________________________________ > Protect your PC - get McAfee.com VirusScan Online > http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963 > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: ope t. <op...@ho...> - 2003-01-27 19:34:06
|
I must be using an earlier version of the parser, because i dont have the searchFor and searchByName methods in my formTag. Did that come out in the latest release? Should i download the integration build or the htmlparser? What is the integration build? Thanks for all your help sincerely, ope >From: htm...@li... >Reply-To: htm...@li... >To: htm...@li... >Subject: Htmlparser-user digest, Vol 1 #174 - 1 msg >Date: Thu, 23 Jan 2003 12:11:53 -0800 > >Send Htmlparser-user mailing list submissions to > htm...@li... > >To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/htmlparser-user >or, via email, send a message with subject or body 'help' to > htm...@li... > >You can reach the person managing the list at > htm...@li... > >When replying, please edit your Subject line so it is more specific >than "Re: Contents of Htmlparser-user digest..." > > >Today's Topics: > > 1. Re: parsing form elements (Somik Raha) > >--__--__-- > >Message: 1 >Date: Wed, 22 Jan 2003 15:47:54 -0800 (PST) >From: Somik Raha <so...@ya...> >Subject: Re: [Htmlparser-user] parsing form elements >To: htm...@li... >Reply-To: htm...@li... > > > Can someone give me some direction in using the > > formscanner and formTag to > > parse form elements like the buttons (submit, > > cancel, etc) on a html page. > >Just rig up the parser as usual >(parser.registerScanners()) - and check your node to >see if it is a form tag. If it is, cast it and use the >api. > >Use searchFor, or searchByName (in HTMLFormTag). >The former gets anything that contains the given text, >while the latter gives named elements within the form >(as subclasses of HTMLTag). > >Regards, >Somik > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > > >--__--__-- > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >End of Htmlparser-user Digest _________________________________________________________________ Protect your PC - get McAfee.com VirusScan Online http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963 |
From: Somik R. <so...@ya...> - 2003-01-25 23:41:45
|
Hi Folks, The next integration release is out. From the change log : Integration build 1.3 - 20030125 -------------------------------- [1] HTMLCompositeTagScanner now takes an array of match strings [2] toHTML(HTMLRenderer ...) was replaced by UrlModifyingVisitor [3] Fixed NullPointerException in HTMLScriptTag.toString() [4] Fixed bug in HTMLStringNode (breaking up empty lines into seperate string nodes) [5] Fixed thread safety issue and introduced parser helpers [6] Fixed bug 664404 - spewing incorrect line breaks in HTMLRemarkNode.toHTML() [7] Added assertXmlEquals() in HTMLParserTestCase [8] Added better option tag support [9] Replaced instanceof with getType() mechanism - much faster [10] Incorporated NodeList instead of Vector in HTMLCompositeTag [11] Added HTMLRemarkNode support in Visitor [12] Fixed bug 673379 (infinite loop on encountering links like ".someurl.html") Among the notable additions is assertXmlEquals() - this is present to enable us to perform xml testing. This method actually creates the parser and performs a node for node comparison. Reconstruction has improved a lot - you will find that the parser now does not add unnecessary line breaks - and preserves the html as it came in. One significant addition is the use of NodeList instead of Vector. The integration has been performed, so there should be a significant performance increase - check http://htmlparser.sourceforge.net/performance/simpleEnumerationPerformance.h tml In the coming week, we will be setting up a wiki on sourceforge, where we can collaboratively create documentation - hopefully that will finally take the burden out of the documentation process. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-01-25 18:48:21
|
> I came across this interesting tool for parsing HTML > files using HTMLparser and have a couple of questions. > > Q1. How is better than XML parsers, SAX parsers etc? There aer some good SAX parsers out there - but they are for parsing XML, not HTML. HTMLParser is primarily a tool for parsing HTML - and HTML is usually dirty, with no end tags, etc. Of late, we have started using HTMLParser to parse XML, and it is useful as it is so compact and very fast. However, keep in mind that HTMLParser is a tolerating parser. It cannot tell you if your xml file has errors (at least not yet) which most SAX parsers do. > Q2. Does it focus primary on HTML files or generic to > other files as well ? HTML files primarily. Lots of folks use it in their search engines, crawlers.. I use it for unit testing html (actually unit testing xsl stylesheets) for web applications at my workplace. > Q3. Where can I download the package for HTMLparser ? http://htmlparser.sourceforge.net - there is a download link. You are advised to go with an integration release, as HTMLParser is a 100% tested project. We do not add new bugs with every new release (we try not to and succeed to a good extent). > I get the following message while compiling a sample > program: > LinkExtractor.java:3: package org.htmlparser does not > exist import org.htmlparser.HTMLParser; Remove the package name if you are compiling it as your application. Bytway, this is already present in the parser, so when you download it, you shouldn't face any problems. Regards, Somik |
From: SN+ <sn...@ya...> - 2003-01-25 18:04:07
|
Hi there, I came across this interesting tool for parsing HTML files using HTMLparser and have a couple of questions. Q1. How is better than XML parsers, SAX parsers etc? Q2. Does it focus primary on HTML files or generic to other files as well ? Q3. Where can I download the package for HTMLparser ? I get the following message while compiling a sample program: LinkExtractor.java:3: package org.htmlparser does not exist import org.htmlparser.HTMLParser; Appreciate your response. Thanks in advance, Sunny. __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2003-01-24 18:03:37
|
> What I want to do is to parse HTML code and > translate the content and > the put the translated text/content back into the > original HTML structure. > > Does this HTML parser suitable of doing this kind of > task ? By translating content, I guess you mean translation of meaningful text data (not tags). That is easily possible. You can look at the StringExtractor example (org.htmlparser.parserapplications) or the StringFindingVisitor (org.htmlparser.visitors). The simplest approach is to write your own visitor - StringTranslatingVisitor, that runs through the entire html, and wherever it finds strings, these are translated as per your wishes. Here is a sample program : import org.htmlparser.HTMLRemarkNode; import org.htmlparser.HTMLStringNode; import org.htmlparser.tags.HTMLEndTag; import org.htmlparser.tags.HTMLTag; public class StringTranslatingVisitor extends HTMLVisitor { StringBuffer htmlData = new StringBuffer(); public void visitStringNode(HTMLStringNode stringNode) { String yourStuff=""; // Perform modifications here. // finally, add to htmlData htmlData.append(yourStuff); } public void visitEndTag(HTMLEndTag endTag) { htmlData.append(endTag.toHTML()); } public void visitTag(HTMLTag tag) { htmlData.append(tag.toHTML()); } public String getHtml() { return htmlData.toString(); } public void visitRemarkNode(HTMLRemarkNode remarkNode) { htmlData.append(remarkNode.toHTML()); } } To use this, create your parser - HTMLParser parser = new HTMLParser("http://someurl.com"); parser.registerScanners(); StringTranslatingVisitor visitor = new StringTranslatingVisitor(); parser.visitAllNodesWith(visitor); System.out.println(visitor.getHTML()); Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Aminudin K. <ami...@mi...> - 2003-01-24 12:14:35
|
Hi guys, I'm very new in this forum. Hello everybody .... :) I'm finding some tools/libraries that can be used as HTML parser. So I found this HTMLParser on sourceforge and I hope it can help me to develop HTML translation module. What I want to do is to parse HTML code and translate the content and the put the translated text/content back into the original HTML structure. Does this HTML parser suitable of doing this kind of task ? |