htmlparser-developer Mailing List for HTML Parser (Page 32)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Somik R. <so...@ya...> - 2002-04-05 07:14:43
|
Hi Folks, The dynamic page parsing bug is fixed, and as far as I've tested, I = am able to parse correctly pages like = http://search.yahoo.com/bin/search?p=3Ddogs=20 which Mats had posted earlier. We are now ready for release 1.1. I'd be grateful if I had some help = in testing the parser - and see if there are any showstopper bugs for = this release. (Get the latest code from CVS) Regards, Somik |
From: Somik R. <so...@ya...> - 2002-04-05 03:11:21
|
> I have used parser available in JDK. > If u say I can send u example. Yes Asgher, pls go ahead. Regards, Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Somik R. <so...@ya...> - 2002-04-05 03:04:56
|
Hi Folks, An important bug has been pointed out by Raj Sharma, which would = halt the parser if a page contained a link spread over two lines. This = was a bug in HTMLTag, and I was able to find it quickly, thanks to the = refactoring done earlier with the help of Arnaud. Also - HTMLLinkScanner and HTMLImageScanner have some small changes = in connection with the fix. Please get the latest code from CVS. =20 Regards, Somik =20 |
From: Somik R. <so...@ya...> - 2002-04-04 15:51:32
|
>How come when you use the parser on most sites to extract links it works >fine but when you use it on search engine i.e. >http://search.yahoo.com/bin/search?p=dogs which is a page with search >results for dogs, it does not work? Ah - this is a known bug. It doesent work bcos the parser is not capable of handling dynamic pages. This is actually not a difficult bug to fix. Version 1.10 of HTMLParser (the next release - coming soon) will contain this and other fixes. So you will have to wait till this weekend, or make the fix yourself - the bug probably lies in HTMLParser.java itself, in the way a page extension is handled. Regards Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Somik R. <so...@ya...> - 2002-04-04 02:22:15
|
Hi Asgher, > I have used parser available in JDK. > If u say I can send u example. Yes, pls go ahead. I dont have much time till the weekend, and it would really help me get up to speed with some help. Regards, Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Asgher A. <as...@lw...> - 2002-04-02 05:02:54
|
I have used parser available in JDK. If u say I can send u example. On Monday, April 01, 2002 at 12:47:41 PM, htm...@li... wrote: > Send Htmlparser-developer mailing list submissions to > htm...@li... > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > or, via email, send a message with subject or body 'help' to > htm...@li... > > You can reach the person managing the list at > htm...@li... > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Htmlparser-developer digest..." > > > Today's Topics: > > 1. Re: [Htmlparser-user] Swing integration (Somik Raha) > > --__--__-- > > Message: 1 > From: "Somik Raha" <so...@ya...> > To: "HTMLParser User List" <htm...@li...> > Cc: "HTMLParser Developer List" <htm...@li...> > Date: Tue, 2 Apr 2002 00:28:28 +0900 > Subject: [Htmlparser-developer] Re: [Htmlparser-user] Swing integration > > Hi Craig > Wow! Thats a great question. > Actually, I doubt if I could replace Sun Microsystems' code with mine. I > dont think Java is that open (or is it ?) > However, we could think of writing our own adapter for the html parser that > might plugin in some way... > I have never used Sun's html parser (If I had, I might not have started > this project). > I will need to study Sun's parser before I can answer your question.. > But there does seem to be some interesting possibilities. > > Regards > Somik > ----- Original Message ----- > From: "Craig Raw" <cr...@qu...> > To: <htm...@li...> > Sent: Monday, April 01, 2002 10:20 PM > Subject: [Htmlparser-user] Swing integration > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > provide a better implementation of JEditorPane's HTML viewing > > capabilities? HTML Parser would need to replace > > javax.swing.text.html.parser.Parser, which is currently somewhat buggy. > > Anyone tried this? > > > > -craig > > > > > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _________________________________________________________ > Do You Yahoo!? > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > --__--__-- > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > End of Htmlparser-developer Digest > > Asgher Ali e-mail: as...@lw... --------------------------------------------- Lahore Wide Web "The Intranet Company" http://www.lww.org/ |
From: Somik R. <so...@ya...> - 2002-04-01 15:22:00
|
Hi Craig Wow! Thats a great question. Actually, I doubt if I could replace Sun Microsystems' code with mine. I dont think Java is that open (or is it ?) However, we could think of writing our own adapter for the html parser that might plugin in some way... I have never used Sun's html parser (If I had, I might not have started this project). I will need to study Sun's parser before I can answer your question.. But there does seem to be some interesting possibilities. Regards Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Sent: Monday, April 01, 2002 10:20 PM Subject: [Htmlparser-user] Swing integration > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > provide a better implementation of JEditorPane's HTML viewing > capabilities? HTML Parser would need to replace > javax.swing.text.html.parser.Parser, which is currently somewhat buggy. > Anyone tried this? > > -craig > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Somik R. <so...@ya...> - 2002-03-31 09:19:42
|
Hi Folks, A major bug fix has been done. I had previously reported that the = parser crashes when encountering very dirty html of the form : <A HREF=3D"http://www.somelink.com">SomeText<A> Instead of the end tag, we put in a begin tag by mistake, and the parser = promptly crashes. This called for a modification in the evaluate() = method, as the current scanners dont have more than existing local info = about the parsing process. But now, Ive introduced a parameter - which = takes in the scanner. So, if a tag was being parsed, and in the process = of the parsing, another tag starts being parsed, then the second tag = will now know that a scanner process is already running. This enables the HTMLLinkScanner to come to the conclusion that its = current parsing activity is of a dirty html tag, and hence take the = appropriate action (flag the scanner into a dirty mode, and return an = HTMLEndTag - which is expected by the previous scanner). This solves this bug - and finally we can handle some really crazy = pages... This fix and some others, along with some additions (META and TITLE) = will make it to release 1.1 (coming soon). Currently, the latest code is = available thru CVS. In case any of you have written your own scanners - you will need to = modify the evaluate method signature to be compatible with the new = HTMLTagScanner. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-03-24 05:51:01
|
Dear Users, Thanks for using HTMLParser. HTMLParser is getting some new = features, namely,=20 [1] HTMLMetaTag scanner [2] Support for not ".html" pages - I am planning to bring in dynamic = pages under the purview of the parser as well. Though I might need a bit = of help for this. I wanted to have some feedback from the user community -what are the = features that you would really like to see added to the parser (or r u = quite happy with the parser as is?) Regards, Somik |
From: Somik R. <so...@ya...> - 2002-03-24 05:48:24
|
Hi Folks, I am encountering a really strange scenario - try to create a link = like this in a web page - <A HREF=3D"...">something<A> i.e. instead of putting a close tag </A>, put an open tag. I find that = Internet Explorer renders it just fine. Now if IE renders it, then = perhaps we ought to support it in HTML Parser. However, its not so easy = - check out the latest source from CVS - I have put in a testcase for this = situation which is failing (in HTMLLinkScannerTest - = com.kizna.html.scannersTests) The problem is in HTMLReader.find() - which goes into a sort of = recursion - when it finds <A ...> the first time, the scanner asks it to = find the remaining tags. Now if the second A is encountered, it will try = to keep parsing till the end tag is encountered, which wont happen. Now, = I need a clean elegant way of telling the reader not to expand in = exceptional situations like this one. I can of course do it with some flags - but before I do it - I was = wondering if anyone has insights on this problem - and if anyone thinks = we should not support this dirty html even if IE does. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-03-22 16:40:58
|
Hi Folks, Release 1.04 is out. Has the following bug fixes : [1] Parsing JSP tags which had tags within inverted commas, was causing = problems. [2] A link with no link url would cause the parser to crash with a null = pointer exception. The above bugs were reported by Gordon Deudney and Robert Kausch. More test cases added.=20 Regards, Somik |
From: Somik R. <so...@ya...> - 2002-03-12 08:52:14
|
Hi Don, It will be appreciated if you can post usage doubts in the htmlparser-user mailing list (link is at http://htmlparser.sourceforge.net). To your query - the code you posted seems rather complex to do a not so complex task :) Here's how you would do it in HTML Parser (in the attached code). The code I have given is the shortcut-way. There is a way to get much shorter code that what I am providing you, but that requires getting into the design docs of the parser - and writing a Table Scanner. Then your code could become some this like this : HTMLParser parser = new HTMLParser("http://www.nba.com"); HTMLNode node; int tableCount = 0; for (Enumeration e = parser.elements();e.hasMoreElements();) { node = (HTMLNode) e.nextElement(); if (node instanceof HTMLTableNode) { tableCount ++; if (tableCount==4) { HTMLTableNode tableNode = (HTMLTableNode)node; tableNode.print(); } } } Regards, Somik ----- Original Message ----- From: "Don Taggart" <dta...@e-...> To: <Htm...@li...> Sent: Tuesday, March 12, 2002 1:33 AM Subject: [Htmlparser-developer] HTMLParser Sample App > Hi, > I am attempting to grab the content of a certain table on any website. For > instance I'd like to get all of the text, tags, comments, etc contained in > the 4rth table I run across. I've been able to do this successfully using > the htmleditorkit in swing, but it has a few bugs. > > Would your HTML Parser be useful for this scenario, and If so, could you > give me some guidance on how to start. > > Thanks, > Don > > > Heres my code that goes and get the contents of the 4rth table at nba.com > > import java.io.*; > import java.net.*; > import java.util.*; > import javax.swing.text.*; > import javax.swing.text.html.*; > import javax.swing.text.html.parser.*; > > /** > * This small demo program shows how to use the > * HTMLEditorKit.Parser and its implementing class > * ParserDelegator in the Swing system. > */ > > public class HtmlParseDemo2 { > public static void main(String [] args) { > Reader r; > String host = ""; > String spec = "http://www.nba.com"; > long endTime; > long endTime2; > long startTime = System.currentTimeMillis(); > String snippet = ""; > > > try { > if (spec.indexOf("://") > 0) { > URL u = new URL(spec); > host = u.getHost(); > Object content = u.getContent(); > > if (content instanceof InputStream) { > > r = new InputStreamReader((InputStream)content); > } > else if (content instanceof Reader) { > r = (Reader)content; > } > else { > throw new Exception("Bad URL content type."); > } > } > else { > r = new FileReader(spec); > } > > endTime = System.currentTimeMillis(); > System.out.println("Time to complete connection: " + (endTime - > startTime)); > > HTMLEditorKit.Parser parser; > System.out.println("About to parse " + spec); > parser = new ParserDelegator(); > > HTMLParseLister2 snippetCallback = new HTMLParseLister2(host); > > file://Parse Away! > parser.parse(r, snippetCallback, true); > r.close(); > > > endTime2 = System.currentTimeMillis(); > System.out.println("Time to complete: " + (endTime2 - > startTime)); > } > catch (Exception e) { > System.err.println("Error: " + e); > e.printStackTrace(System.err); > } > } > } > > /** > * HTML parsing proceeds by calling a callback for > * each and every piece of the HTML document. This > * simple callback class simply prints an indented > * structural listing of the HTML data. > */ > class HTMLParseLister2 extends HTMLEditorKit.ParserCallback > { > > > > int indentSize = 0; > int tableNum = 0; > String atts; > String tabNum; > String endTable; > String tableLevel; > Stack tableStack = new Stack(); > boolean finished = false; > HTML.Tag selectedTag = HTML.Tag.TABLE; > String selectedTable = Integer.toString(4); > boolean inImportantTag = false; > StringBuffer snippetString = new StringBuffer(); > > > > private String host; > > > > public HTMLParseLister2(String host) { > this.host = host; > } > > public String getSnippet() { > return snippetString.toString(); > } > > protected void indent() { > indentSize += 4; > } > > protected void unIndent() { > indentSize -= 4; if (indentSize < 0) indentSize = 0; > } > > protected void pIndent() { > for(int i = 0; i < indentSize; i++) System.out.print(" "); > } > > public void handleText(char[] data, int pos) { > if (!tableStack.empty() && !finished) > { > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > { > file://pIndent(); > String str = new String(data); > System.out.println(str); > } > } > > if (inImportantTag) > { > String str = new String(data); > System.out.println(str); > } > } > > // ******************************************************** > public void handleComment(char[] data, int pos) { > > if (!tableStack.empty() && !finished) > { > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > { > file://pIndent(); > String str = new String(data); > file://System.out.println("<!--" + str + "-->"); > file://indent(); > file://pIndent(); > } > } > > if (inImportantTag) > { > String str = new String(data); > System.out.println("<!--" + str + "-->"); > } > > } > // ******************************************************** > > // ******************************************************** > public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { > // Is this Tag One of the few that we want to list outside the chosen > component > if (t == HTML.Tag.STYLE || t == HTML.Tag.LINK) > { > atts = listAttributes(a); > inImportantTag = true; > System.out.print("<" + t.toString() + " " + atts + ">"); > return; > } > > if (t == selectedTag && !finished) > { > > file://pIndent(); > tableNum++; > tabNum = Integer.toString(tableNum); > tableStack.push(tabNum); > atts = listAttributes(a); > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > { > file://System.out.println("<Table#" + tableLevel + ">"); > > } > } > > if (!tableStack.empty() && !finished) { > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > { > atts = listAttributes(a); > System.out.println("<" + t.toString() + " " + atts + ">"); > } > } > } > // ******************************************************** > > > // ******************************************************** > public void handleEndTag(HTML.Tag t, int pos) { > if (inImportantTag) > { > inImportantTag = false; > System.out.println("</" + t.toString() + ">"); > } > > if (!tableStack.empty() && !finished) > { > if (t == selectedTag) > { > file://unIndent(); > file://pIndent(); > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))){ > System.out.println("</" + t.toString() + ">"); > } > if (tableStack.peek().equals(selectedTable)) > finished = true; > endTable = (String) tableStack.pop(); > } > } > if (!tableStack.empty() && !finished) { > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable)) && t != selectedTag) { > file://pIndent(); > System.out.println("</" + t.toString() + ">"); > file://pIndent(); > } > } > } > // ******************************************************** > > > > // ******************************************************** > public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) > { > > > > > if (t == HTML.Tag.LINK && !finished) > { > atts = listAttributes(a); > System.out.println("<" + t.toString() + " " + atts + ">"); > } > > if (!tableStack.empty() && !finished) > { > > > atts = listAttributes(a); > if(a.getAttribute(HTML.Attribute.ENDTAG) != null) > { > handleEndTag(t, pos); > return; > } > file://if (tableStack.peek() == selectedTable) > file://pIndent(); > > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > System.out.println("<" + t.toString() + " " + atts + ">"); > } > } > // ******************************************************** > > > > > // ******************************************************** > private String listAttributes(AttributeSet attributes) { > Enumeration e = attributes.getAttributeNames(); > String attString = ""; > > while (e.hasMoreElements()) { > Object name = e.nextElement(); > Object value = attributes.getAttribute(name); > > if (name.toString().equals("href") || name.toString().equals("src") > || name.toString().equals("action")) > { > if (value.toString().charAt(0) == '/') > value = host + value; > } > attString = attString + name + "=\"" + value + "\" "; > > } > return attString; > } > // ******************************************************** > > // ******************************************************** > public void handleError(String errorMsg, int pos){ > file://System.out.println("Parsing error: " + errorMsg + " at " + pos); > } > } > > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Don T. <dta...@e-...> - 2002-03-11 16:37:08
|
Hi, I am attempting to grab the content of a certain table on any website. For instance I'd like to get all of the text, tags, comments, etc contained in the 4rth table I run across. I've been able to do this successfully using the htmleditorkit in swing, but it has a few bugs. Would your HTML Parser be useful for this scenario, and If so, could you give me some guidance on how to start. Thanks, Don Heres my code that goes and get the contents of the 4rth table at nba.com import java.io.*; import java.net.*; import java.util.*; import javax.swing.text.*; import javax.swing.text.html.*; import javax.swing.text.html.parser.*; /** * This small demo program shows how to use the * HTMLEditorKit.Parser and its implementing class * ParserDelegator in the Swing system. */ public class HtmlParseDemo2 { public static void main(String [] args) { Reader r; String host = ""; String spec = "http://www.nba.com"; long endTime; long endTime2; long startTime = System.currentTimeMillis(); String snippet = ""; try { if (spec.indexOf("://") > 0) { URL u = new URL(spec); host = u.getHost(); Object content = u.getContent(); if (content instanceof InputStream) { r = new InputStreamReader((InputStream)content); } else if (content instanceof Reader) { r = (Reader)content; } else { throw new Exception("Bad URL content type."); } } else { r = new FileReader(spec); } endTime = System.currentTimeMillis(); System.out.println("Time to complete connection: " + (endTime - startTime)); HTMLEditorKit.Parser parser; System.out.println("About to parse " + spec); parser = new ParserDelegator(); HTMLParseLister2 snippetCallback = new HTMLParseLister2(host); //Parse Away! parser.parse(r, snippetCallback, true); r.close(); endTime2 = System.currentTimeMillis(); System.out.println("Time to complete: " + (endTime2 - startTime)); } catch (Exception e) { System.err.println("Error: " + e); e.printStackTrace(System.err); } } } /** * HTML parsing proceeds by calling a callback for * each and every piece of the HTML document. This * simple callback class simply prints an indented * structural listing of the HTML data. */ class HTMLParseLister2 extends HTMLEditorKit.ParserCallback { int indentSize = 0; int tableNum = 0; String atts; String tabNum; String endTable; String tableLevel; Stack tableStack = new Stack(); boolean finished = false; HTML.Tag selectedTag = HTML.Tag.TABLE; String selectedTable = Integer.toString(4); boolean inImportantTag = false; StringBuffer snippetString = new StringBuffer(); private String host; public HTMLParseLister2(String host) { this.host = host; } public String getSnippet() { return snippetString.toString(); } protected void indent() { indentSize += 4; } protected void unIndent() { indentSize -= 4; if (indentSize < 0) indentSize = 0; } protected void pIndent() { for(int i = 0; i < indentSize; i++) System.out.print(" "); } public void handleText(char[] data, int pos) { if (!tableStack.empty() && !finished) { tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) { //pIndent(); String str = new String(data); System.out.println(str); } } if (inImportantTag) { String str = new String(data); System.out.println(str); } } // ******************************************************** public void handleComment(char[] data, int pos) { if (!tableStack.empty() && !finished) { tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) { //pIndent(); String str = new String(data); //System.out.println("<!--" + str + "-->"); //indent(); //pIndent(); } } if (inImportantTag) { String str = new String(data); System.out.println("<!--" + str + "-->"); } } // ******************************************************** // ******************************************************** public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { // Is this Tag One of the few that we want to list outside the chosen component if (t == HTML.Tag.STYLE || t == HTML.Tag.LINK) { atts = listAttributes(a); inImportantTag = true; System.out.print("<" + t.toString() + " " + atts + ">"); return; } if (t == selectedTag && !finished) { //pIndent(); tableNum++; tabNum = Integer.toString(tableNum); tableStack.push(tabNum); atts = listAttributes(a); tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) { //System.out.println("<Table#" + tableLevel + ">"); } } if (!tableStack.empty() && !finished) { tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) { atts = listAttributes(a); System.out.println("<" + t.toString() + " " + atts + ">"); } } } // ******************************************************** // ******************************************************** public void handleEndTag(HTML.Tag t, int pos) { if (inImportantTag) { inImportantTag = false; System.out.println("</" + t.toString() + ">"); } if (!tableStack.empty() && !finished) { if (t == selectedTag) { //unIndent(); //pIndent(); tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))){ System.out.println("</" + t.toString() + ">"); } if (tableStack.peek().equals(selectedTable)) finished = true; endTable = (String) tableStack.pop(); } } if (!tableStack.empty() && !finished) { tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable)) && t != selectedTag) { //pIndent(); System.out.println("</" + t.toString() + ">"); //pIndent(); } } } // ******************************************************** // ******************************************************** public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) { if (t == HTML.Tag.LINK && !finished) { atts = listAttributes(a); System.out.println("<" + t.toString() + " " + atts + ">"); } if (!tableStack.empty() && !finished) { atts = listAttributes(a); if(a.getAttribute(HTML.Attribute.ENDTAG) != null) { handleEndTag(t, pos); return; } //if (tableStack.peek() == selectedTable) //pIndent(); tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) System.out.println("<" + t.toString() + " " + atts + ">"); } } // ******************************************************** // ******************************************************** private String listAttributes(AttributeSet attributes) { Enumeration e = attributes.getAttributeNames(); String attString = ""; while (e.hasMoreElements()) { Object name = e.nextElement(); Object value = attributes.getAttribute(name); if (name.toString().equals("href") || name.toString().equals("src") || name.toString().equals("action")) { if (value.toString().charAt(0) == '/') value = host + value; } attString = attString + name + "=\"" + value + "\" "; } return attString; } // ******************************************************** // ******************************************************** public void handleError(String errorMsg, int pos){ //System.out.println("Parsing error: " + errorMsg + " at " + pos); } } |
From: Somik R. <so...@ya...> - 2002-03-04 14:28:27
|
HTMLParser 1.03 has been released. It contains a bug fix in = HTMLRemarkNode which was causing the parser to crash on pages with = remarks going over one line. A test case for the bug has been added in = HTMLRemarkNodeTest.=20 The release also contains the design documentation in the zip. Thanks to = Serge Kruppa for pointing out the bug. Regards Somik |
From: Somik R. <so...@ya...> - 2002-01-18 23:55:06
|
> What is the Parse.jar file in htmlparser.jar? Ah, i was wondering why the size was so much. Thanks for pointing it out. > I would like if htmlparser.jar would be named to HTMLParser.jar > according to the name of the application. > > I happened to call it with capital letters in my application > and it's easy for me to make this change but perhaps > if someone else does it he does not notice the difference. Well, class naming conventions are different from jar naming conventions.. I thought keeping all small letters is simple. > I today replaced my modified version 0.98 > with the official version 1.02 and after I solved some > incompatibilities (mainly the BufferedReader thing) > it seemed to go as it should. Great! Any suggestions on where we go from here ? It really bothers me that the parser does not show up on google when I type "html parser java" in the search. How do we go about giving it more visibility? Cheers, Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Kaarle K. <kaa...@ik...> - 2002-01-18 20:24:23
|
hi, What is the Parse.jar file in htmlparser.jar? I would like if htmlparser.jar would be named to HTMLParser.jar according to the name of the application. I happened to call it with capital letters in my application and it's easy for me to make this change but perhaps if someone else does it he does not notice the difference. I today replaced my modified version 0.98 with the official version 1.02 and after I solved some incompatibilities (mainly the BufferedReader thing) it seemed to go as it should. Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-01-16 14:09:40
|
Hi Folks, Check http://htmlparser.sourceforge.net for a totally new look. = Design documentation with sample programs has been added. Feedback is welcome. Regards, Somik |
From: Somik R. <so...@ki...> - 2002-01-16 14:08:44
|
Hi Folks, Check http://htmlparser.sourceforge.net for a totally new look. = Design documentation with sample programs has been added. Feedback is welcome. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-01-09 16:36:33
|
Hi Folks, Another bug was detected in HTMLStyleScanner, and has been = immediately fixed. v1.02 has been released with this fix, and another = one - which allows scanning of Finnish pages to proceed properly. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-01-09 11:50:17
|
Dear Kaarle, Thank you very much! You are quite right, I forgot I was using = Shift-JIS for Japanese encoding support and SJIS is a Microsoft specific = standard - not unicode, but if I use a unicode encoding, it should be = fine. I will try with UTF8, will need your help to co-ordinate some more = tests. Meanwhile this style thing is proving to be a headache, just got a = report that its crashing on google. Need to add more test cases.. Regards, Somik ----- Original Message -----=20 From: Kaarle Kaila=20 To: Somik Raha=20 Sent: Wednesday, January 09, 2002 2:40 AM Subject: Re: [Htmlparser-developer] htmlparser 1.0 (Issue with mtv3 is = that of internationalization) At 22:37 8.1.2002 +0530, Somik Raha wrote: Hi Kaarle, I found the reason for the last problem - the site : = http://www.mtv3.fi has a link in Finnish. That link is not being interpreted correctly = by the parser. The link is : <a href=3D"/ks/ks_20020701b.shtml">Palveluun p=E4=E4set = t=E4st=E4</a> hi Somik, HTMLParser reads lines from the net. It initiates the contact to that = line with a command=20 reader =3D new HTMLReader(new BufferedReader(new = InputStreamReader(uc.getInputStream(),"SJIS")),resourceLocn); I don't know what SJIS stands for. The Java API does not list that, = but lists among others ISO-8859-1. Check InputStreamReader constructor. By using ISO-8859-1 it does not = hang like it did with SJIS! SJIS seems to make everything 7-bit ascii.=20 reader =3D new HTMLReader(new BufferedReader(new = InputStreamReader(uc.getInputStream(),"ISO-8859-1")),resourceLocn); With this setting at least finnish characters come correctly.=20 I also downloaded two files you hade made changes from CVS=20 and I could read www.mtv3.fi. It even reads my webpage (rather strange = output though). In Japan I would expect the internationalizing to be an issue?? = Wouldn't UNICODE=20 be required there? regards Kaarle Whats happening is that the last < is being corrupted. I havent = faced a problem with internationalization till now - and I am kind of stuck = with this one. Maybe you'd be in a better position to solve it than me. I = will make the release with the other bug fixed, and Id be grateful if u = can proceed from there. Regards, Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844=20 |
From: Somik R. <so...@ya...> - 2002-01-08 17:35:21
|
Hi Folks, An important bug fix has been done. The parser was crashing on style = tags - this has been fixed. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-01-08 15:46:16
|
Hi Kaarle, To answer your basic question - crawler will crawl through a url (like websnake and similar robot crawlers). It will pick up links and visit those links and so on recursively depending on the depth you define. The bugs you see are not bcos of the crawler code, but bcos of some parser bugs. The scanner bugs came in when I tried to fix the case when the style tags are in one big line with other stuff. Obviously, not enough test cases. Thankfully, you are htmlparser's best tester :) Your site and http://www.yle.fi are working fine now. mtv3 is giving the wierd out of mem excpetion and I am now fixing that. As soon as thats done, maintenance release 1.01 will be out. Cheers, Somik ----- Original Message ----- From: "Kaarle Kaila" <kaa...@ik...> To: <htm...@li...> Sent: Tuesday, January 08, 2002 3:34 AM Subject: [Htmlparser-developer] htmlparser 1.0 > I tried the example applications using the bat-files > with htmlparser 1.0 with not very good success. > > 1) > runCrawler http://www.google.com 1 > This gives a list of links on the abovementioned page I assume > > 2) (finnish broadcastin company) > runCrawler http://www.yle.fi 1 > This throws > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 27 > > 3) (finnish commercial tvstation ) > runCrawler http://www.mtv3.fi 1 > this throws > Exception in thread "main" java.lang.OutOfMemoryError > <<no stack trace available>> > > 4) my own simple homepage > > After a rather long time throws: > Crawling to > http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p > id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 23 > at java.lang.String.substring(Unknown Source) > ........ > I don't think I have such microsoft links on my page. Probably something to > to with the activeisp.com that provides me with diskspace?? > > Similar result from my software page at www.kk-software.fi > -------------------- > As a result of these experiments i did not understand what the Robot tries > to do?? > > Any explanations to this? > regards > Kaarle > > --------------------------------------------- > Kaarle Kaila > http://www.iki.fi/kaila > mailto:kaa...@ik... > tel: +358 50 3725844 > > > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Somik R. <so...@ya...> - 2002-01-08 15:16:32
|
Hi Kaarle, Thanks for pointing this out. Its not a bug with the crawler, but with the parser itself - in HTMLStyleScanner... I am trying to fix it asap. Regards, Somik ----- Original Message ----- From: "Kaarle Kaila" <kaa...@ik...> To: <htm...@li...> Sent: Tuesday, January 08, 2002 3:34 AM Subject: [Htmlparser-developer] htmlparser 1.0 > I tried the example applications using the bat-files > with htmlparser 1.0 with not very good success. > > 1) > runCrawler http://www.google.com 1 > This gives a list of links on the abovementioned page I assume > > 2) (finnish broadcastin company) > runCrawler http://www.yle.fi 1 > This throws > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 27 > > 3) (finnish commercial tvstation ) > runCrawler http://www.mtv3.fi 1 > this throws > Exception in thread "main" java.lang.OutOfMemoryError > <<no stack trace available>> > > 4) my own simple homepage > > After a rather long time throws: > Crawling to > http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p > id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 23 > at java.lang.String.substring(Unknown Source) > ........ > I don't think I have such microsoft links on my page. Probably something to > to with the activeisp.com that provides me with diskspace?? > > Similar result from my software page at www.kk-software.fi > -------------------- > As a result of these experiments i did not understand what the Robot tries > to do?? > > Any explanations to this? > regards > Kaarle > > --------------------------------------------- > Kaarle Kaila > http://www.iki.fi/kaila > mailto:kaa...@ik... > tel: +358 50 3725844 > > > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Kaarle K. <kaa...@ik...> - 2002-01-07 22:06:18
|
I tried the example applications using the bat-files with htmlparser 1.0 with not very good success. 1) runCrawler http://www.google.com 1 This gives a list of links on the abovementioned page I assume 2) (finnish broadcastin company) runCrawler http://www.yle.fi 1 This throws Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String ind ex out of range: 27 3) (finnish commercial tvstation ) runCrawler http://www.mtv3.fi 1 this throws Exception in thread "main" java.lang.OutOfMemoryError <<no stack trace available>> 4) my own simple homepage After a rather long time throws: Crawling to http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String ind ex out of range: 23 at java.lang.String.substring(Unknown Source) ........ I don't think I have such microsoft links on my page. Probably something to to with the activeisp.com that provides me with diskspace?? Similar result from my software page at www.kk-software.fi -------------------- As a result of these experiments i did not understand what the Robot tries to do?? Any explanations to this? regards Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-01-05 17:11:17
|
Hi Folks, Sorry bout that, the zip file that was uploaded seemed to be = corrupted. Its fixed, and you should be able to download it now. Regards, Somik |