htmlparser-developer Mailing List for HTML Parser (Page 30)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Claude D. <CD...@ar...> - 2002-07-10 15:58:21
|
The latest version of the HTMLParser (20020707) appears to deliver good performance over the Swing parser and previous HTMLParser versions. These tests were done in context (using our application, which converts HTML documents, among others, into a normalized form and transmits the result as XML to a server over TCP/IP). We have subtracted the transmission time from these numbers, but a small amount of imprecision is probable given preprocessing and file I/O that gets done up front. Given the size of the tests (more than a half million documents), these elements should negligable. Note that this set includes a large number of small documents and we know from earlier tests that the Swing parser slows down dramatically as documents get larger, while the HTMLParser does not. =20 Total Documents processed: 642,077 Average Document Size: 4,043 =20 Average Number of Documents Per Second for: =20 Swing Parser (Java 1.3.1): 2.797185195 HTMLParser 1.1 Production Version: 2.558727723 HTMLParser 1.2 Early integration build: 2.585632061 HTMLParser 1.2 (build 20020707): 3.224910367 =20 Conclusions: The HTMLParser 1.2 is now about 15% faster than the Swing parser on Swing's home turf (Swing does best with smaller HTML files). With larger files, we have seen improvements as high as 35 times the seed of the Swing parser). =20 |
From: Claude D. <CD...@ar...> - 2002-07-08 22:08:34
|
Here are the latest number from runs we've done using the Trek data set. These are mostly small html documents used in IR (Information Retrieval) as baselines (slightly cleaned up for HTML processing): =20 Total number of documents: 642,077 Total original document Size (in bytes): 2,596,104,858 =20 Comparison (times include local socket tranmission of output documents - possibly as much as 20-25% of the total time spent): =20 Swing parser - 4715 minutes total time, average number of documents per second: 2.269625309 HTMLParser 1.1 - 5065 minutes total time, average number of documents per second: 2.112790392=20 HTMLParser 1.2 (pre optimizations) - 5026 minutes total time, average number of documents per second: 2.129184905 =20 Previous reports that the 1.2 version was slower changed as more data was processed. It was, in fact, only slightly slower than 1.1. If Somik's recent changes improve performance as much as we expect, subsequent numbers should be even better. I thought it would be nice to share these numbers. I will post numbers from a run with the latest optimizations within a few days. =20 Note that the HTMLTitleScanner, HTMLMetaTagScanner and HTMLScriptScanner are being used in this set of tests and each element is being tested with"instanceof" to catch key tag information of relevance to our application. The HTMLScriptScanner is there only to make sure we skip over any scripts. =20 |
From: Somik R. <so...@ya...> - 2002-07-08 01:07:24
|
Hi Folks, The latest integration release (2002-07-07) is out, and has major = improvements : [1] 50% speed improvement over v1.1. The previous 1.2 versions had a = slowdown bug due to which it was slower by 20% over v1.1. [2] Fixed bug in HTMLScriptScanner, which would break on incorrect HTML = inside the script code. [3] Removed HTMLFormScanner from standard registered scanners, as it has = a bug - cannot parse non-ended forms (goes into infinite loop). Thanks to Claude Duguay for the scalability reports. It is = recommended that all v1.2 users upgrade to the latest one for these = fixes. Regards, Somik ********************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012, JAPAN Phone : +81-3-5475-2646 Fax : +81-3-3445-9089 Web : http://www.kizna.com Mail : so...@ki... ********************************** |
From: Somik R. <so...@ya...> - 2002-07-04 10:07:18
|
Hi Developers, Some big changes. [1] Performance fix in HTMLStringNode. The next release of the = parser will be twice as fast as ver 1.1. Actually till the previous = release, 1.2 was 20% slower than 1.1 - thanks to Claude Duguay for = pointing this out. I was able to fix this after some profiling with = JProbe - it seems toString() is very bad - it gives a big hit.=20 [2] Bug in HTMLScriptScanner - if the html in the script code is = bad, it would crash. The script scanner is not supposed to care about = the java script code in it. This has been done by removing all other = scanners during the scan, and putting them back in after the parsing is = done. [3] Bug in HTMLFormScanner - if the form code is broken (no = </form>), there is no way to tell when to put it in. I thought I could = look for </table> but if you have nested tables that wont work. So, for = the moment, HTMLFormScanner is no longer registered in the standard set = of scanners - till I can find some elegant fix for this.. I'd be grateful if anyone has suggestions for [3]. Watch out for the = release this week. Regards, Somik =20 |
From: Somik R. <so...@ya...> - 2002-07-01 11:44:58
|
Hi Cheng Thanks for the kind words. Regarding the bug, I would call it a feature :) When you parse a link within a url - if the link is relative, it = gets processed appropriately. If you want to get the absolute link, you = should do : linkTag.getLink(). The toHTML() method however tries to reconstruct the = html as it appeared (so relative links show up as relative, and absolute = links show up as absolute). There might be a controversy regarding the = purpose of toHTML() itself - do you think toHTML() should not do an = accurate rendition in the case of the HTMLTag ? I am open to opinions = from everyone on this.. For your purposes, you will need to modify the code of toHTMLTag() = in HTMLLinkTag.java.=20 Original Code : public String toHTML() { StringBuffer sb =3D new StringBuffer(); sb.append("<"); sb.append(tagContents.toString()); sb.append(">"); HTMLNode node; for (Enumeration e =3D linkData();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); sb.append(node.toHTML()); } sb.append("</A>"); return sb.toString(); } Modified Code : public String toHTML() { StringBuffer sb =3D new StringBuffer(); sb.append("<"); sb.append(getLink()); // Modification Occurs here sb.append(">"); HTMLNode node; for (Enumeration e =3D linkData();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); sb.append(node.toHTML()); } sb.append("</A>"); return sb.toString(); } =20 Let me know if I might have misunderstood the problem, or this does not = fix it. Cheers, Somik (Note : If you checkout the code from CVS, you will get the ant build = script - this will make it really simple for you to just get the = htmlparser.jar and use it in your app.) ----- Original Message -----=20 From: Cheng Jun=20 To: htm...@li... ; = htm...@li...=20 Sent: Monday, July 01, 2002 3:51 AM Subject: [Htmlparser-user] Bug found Firstly I have to say thank you to Somik Raha. You really do a good = job to give us a new integration.=20 I am writing a program to parse webpage and retrieve the links in the = pages. I have tried the lastest version(6/30) and found there may be a bug. The following is the part of the code and output. System.out.println("Starting parsing...... " ); com.kizna.html.HTMLParser Parser =3D new = com.kizna.html.HTMLParser("E://My paper/EdCrawler/page.htm"); Parser.registerScanners() ; //Parser.parse(null); // Parse the HTML file by Tag types Enumeration e =3D Parser.elements(); while(HasMore) { try { HasMore =3D e.hasMoreElements(); //HasMore is a boolean = var }catch (Exception e2){ System.out.println( e2.toString()) ; = HasMore =3D false; }; //have to stop parsing this HTML file if( HasMore ) { com.kizna.html.HTMLNode node = =3D(com.kizna.html.HTMLNode)e.nextElement(); // HTML DoctypeTag if (node instanceof = com.kizna.html.tags.HTMLDoctypeTag) { com.kizna.html.tags.HTMLDoctypeTag DoctypeNode =3D = (com.kizna.html.tags.HTMLDoctypeTag)node; System.out.println("Doctype: " + = DoctypeNode.toPlainTextString()); }//if //title if (node instanceof com.kizna.html.tags.HTMLTitleTag) { com.kizna.html.tags.HTMLTitleTag TitleNode =3D = (com.kizna.html.tags.HTMLTitleTag)node; System.out.println("Title: "+ = TitleNode.toPlainTextString() ); } //MATA if (node instanceof com.kizna.html.tags.HTMLMetaTag) { com.kizna.html.tags.HTMLMetaTag MataNode =3D = (com.kizna.html.tags.HTMLMetaTag)node; System.out.println("MATA HTTP-EQUIV: " + = MataNode.getHttpEquiv() +" MATA name: "+ MataNode.getMetaTagName() + " = CONTENT :" + MataNode.getMetaTagContents()); }//if // Links if (node instanceof HTMLLinkTag) { HTMLLinkTag LinkNode =3D (HTMLLinkTag)node; // Retrieve the data from the object and print it System.out.println("LINK: = "+LinkNode.toPlainTextString() +" " + " toHTML " + LinkNode.toHTML()); }//if //Parser end } // if(HasMore ) }//while System.out.println("Parising END."); = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D part of the output=20 Doctype:=20 Title: The University of Edinburgh MATA HTTP-EQUIV: null MATA name: Description CONTENT :The University = of Edinburgh, promoting excellence in teaching and research. MATA HTTP-EQUIV: null MATA name: keywords CONTENT :edinburgh = ,university ,degree ,study, studying, research, Scotland, uk, alumni, = graduate, postgraduate, PhD, masters, grad ,post ,edinboro, college, = school MATA HTTP-EQUIV: null MATA name: publisher CONTENT :The University of = Edinburgh MATA HTTP-EQUIV: null MATA name: author CONTENT :University Web = Editor LINK: Prospective Students toHTML <a href=3D"studying/">Prospective = Students</A> LINK: News & Events toHTML <a href=3D"news/">News & = Events</A> LINK: Faculties & Departments toHTML <a = href=3D"/misc/depts.html">Faculties & Departments</A> LINK: Present Students toHTML <a href=3D"/presentstudents/">Present = Students</A> LINK: Research toHTML <a href=3D"research/">Research</A> LINK: Support Services toHTML <a href=3D"/misc/support.html">Support = Services</A> LINK: Staff toHTML <a href=3D"staff/">Staff</A> INK: Lifelong Learning toHTML <a = href=3D"http://www.lifelong.ed.ac.uk/">Lifelong Learning</A> LINK: The Library toHTML <a href=3D"http://www.lib.ed.ac.uk/">The = Library</A> =A1=A1=A1=A1 Now we could see the links with the same domain name would only be = displayed as part of the linkself.=20 So please check the toHTML() method.=20 =20 =20 Cheng Jun c....@sm... 2002-07-01 02:38:51 |
From: Somik R. <so...@ya...> - 2002-06-30 12:26:02
|
Hi Folks, This week's integration release is out - you can get it from = http://htmlparser.sourceforge.net.=20 All test cases are passing. Couple of bugs fixed - some interesting = bugs reported by Cedric Rosa (thanks Cedric).=20 A major refactoring has come in - by which writing a scanner becomes = easier. You dont have to worry about associating scanners with the tags = they create. Its done thru a template method internally. Also, you dont = need to worry about whether your tag has made a call to parseParameters. = It is done automatically. =20 (Kaarle - about now, almost all the scanners are using = parseParameters, thanks a lot for contributing that). =20 For the next release, Claude Duguay will be collaborating closely = with us - he has given some great suggestions and will be providing some = code - he is going to help us get this parser into the professional = league, and is currently doing scalability analysis on the parser (those = on the user list would have already seen the analysis comparing swing = and v1.1). We should have results on v1.2 soon from him. We should = probably get 1.2 out in stable form after this collaboration. (Claude -- = Thanks a ton) You can look forward to some exciting improvements in the coming = weeks.. Cheers, Somik ********************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012, JAPAN Phone : +81-3-5475-2646 Fax : +81-3-3445-9089 Web : http://www.kizna.com Mail : so...@ki... ********************************** |
From: R. <ced...@fr...> - 2002-06-27 11:36:09
|
Hello, A little fix: HTMLStyleScanner and HTMLTitleScanner was linked with the same character. So it was impossible to extract only title or only style. I've just replaced "t" by "T" for title. "addScanner(new HTMLStyleScanner("-t")); addScanner(new HTMLTitleScanner("-T"));" Cedric. |
From: Somik R. <so...@ya...> - 2002-06-27 03:12:03
|
Hi Folks, Just finished a big refactoring - HTMLEndTag now derives from = HTMLTag. HTMLTagScanner.scan() returns a HTMLTag instead of HTMLNode. A = template method has been created, which connects a tag with its scanner, = and also ensures that parsing of attributes has been done. So, scanner = designers dont have to bother about these issues anymore. From the designer perspective, nothing much changes, its the same = evaluate() and scan() api that you have to keep using. All tests updated and passing. Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-06-27 02:17:08
|
Hi Cedric, Thanks for the bug report. This has been reproduced in = HTMLTagTest.testBrokenTag(), and has been fixed. The parser now runs = without failing on the same html file provided. This fix will make it in the next integration release. Regarding your earlier bug report, although the bug has been fixed, = I am thinking I should introduce a template method, so that new scanner = writers dont have to bother about registering the tags with their = respective scanners. Hopefully this refactoring will be in soon enabling scanners to be = written safely. Also need to get cracking at Claude's refactoring = suggestions. Regards, Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: htm...@li...=20 Sent: Thursday, June 27, 2002 12:48 AM Subject: [Htmlparser-user] Bad formed web page Re Somik, First, thanks for your patch I'll download it as soon as possible. I've just tested your program with a web page which contains errors. = I'm=20 programming a search engine and some pages may contains errors. I attached a copy of a bad page example: the problem is the page is = trim=20 before its end (a download error for example). It miss a ">" ("<br") which cause the program crash with a null = pointer=20 exception ... Can you fix this problem or tell me where (in the sources) I can = search for=20 patching ? Thanks by advance for your good support. Cedric. At 20:28 26/06/2002 +0900, you wrote: >Hi Cedric, > This has been fixed. These two scanners (meta and title tag = scanners)=20 > were not being associated with their tags. Reproduced with a test = case=20 > and fixed. Code on CVS has been updated. This bug fix will make it = in the=20 > next integration release (hopefully this weekend). > Thanks for the bug report. >Cheers, >Somik >>----- Original Message ----- >>From: <mailto:so...@ya...>Somik Raha >>To:=20 = >><mailto:htm...@li...>htm...@li...u= rceforge.net=20 >> >>Sent: Wednesday, June 26, 2002 8:13 PM >>Subject: Re: [Htmlparser-user] -m option doesn't work ? >> >>It does look like a bug - you could probably open a BugZilla report = (from=20 = >><http://htmlparser.sourceforge.net>http://htmlparser.sourceforge.net), = >>and describe your fix. I will also try to take a deeper look as soon = as I=20 >>find some time. >> >>Regards, >>Somik >>>----- Original Message ----- >>>From: <mailto:ced...@fr...>C=E9dric Rosa >>>To:=20 = >>><mailto:htm...@li...>htm...@li...= urceforge.net=20 >>> >>>Sent: Wednesday, June 26, 2002 8:14 PM >>>Subject: Re: [Htmlparser-user] -m option doesn't work ? >>> >>>I've tried with many urls, it's the same problem, but you can check = with : = >>>"<http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm>= http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm" >>> >>>I've just modified the source code to make it works (and now it = woks fine) >>>... so maybe it's a bug ? >>> >>>Thanks for your help. >>> >>>Cedric. >>> >>>At 20:02 26/06/2002 +0900, you wrote: >>> >Hi Cedric, >>> > Can you give us the url, or send the page over? >>> > >>> >Regards >>> >Somik >>> >>----- Original Message ----- >>> >>From:=20 >>> = <<mailto:ced...@fr...>mailto:ced...@fr...>C=E9dric = Rosa >>> >>To: >>> = >><<mailto:htm...@li...>mailto:htmlparser-user@ = >>> = lists.sourceforge.net><mailto:htm...@li...>htmlp= ars...@li...=20 >>> >>> >> >>> >>Sent: Wednesday, June 26, 2002 5:40 PM >>> >>Subject: [Htmlparser-user] -m option doesn't work ? >>> >> >>> >>Hello, >>> >> >>> >>When I'm trying to parse a web page with htmlparser with this = code: >>> >> >>> >>HTMLParser parser =3D new HTMLParser("foo.html"); >>> >>parser.registerScanners(); >>> >>parser.parse(null); >>> >> >>> >>eveything is OK but when I tried to parse the page with : >>> >> >>> >>parser.parse("-m"); >>> >>or >>> >>parser.parse("-t"); >>> >> >>> >>I received no answer from the software even if page contains = meta tag or >>> >>title. >>> >> >>> >>What's wrong ? >>> >> >>> >>thanks by advance for your answers. >>> >> >>> >>Cedric. >>> >> >>> >> >>> >> >>> >>------------------------------------------------------- >>> >>This sf.net email is sponsored by: Jabber Inc. >>> >>Don't miss the IM event of the season | Special offer for OSDN = members! >>> >>JabConf 2002, Aug. 20-22, Keystone, CO >>> = >><<http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn>http:/ = >>> /www.jabberconf.com/osdn >>> >>_______________________________________________ >>> >>Htmlparser-user mailing list >>> = >><<mailto:Htm...@li...>mailto:Htmlparser-user@ = >>> = lists.sourceforge.net><mailto:Htm...@li...>Htmlp= ars...@li... >>> = >><https://lists.sourceforge.net/lists/listinfo/htmlparser-user>https:// = >>> lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>>------------------------------------------------------- >>>This sf.net email is sponsored by: Jabber Inc. >>>Don't miss the IM event of the season | Special offer for OSDN = members! >>>JabConf 2002, Aug. 20-22, Keystone, CO=20 >>><http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn >>>_______________________________________________ >>>Htmlparser-user mailing list = >>><mailto:Htm...@li...>Htm...@li...= urceforge.net >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-06-26 02:05:56
|
1) Point me to something that will tell me how to setup CVS to get an = update and I try to get set up to check things in. From your other mail, it seems you got CVS to work. You definitely need = SSH to check in code - http://cdx.sourceforge.net/win-HOWTO.htm=20 I was using Tortoise CVS earlier - its important that you make a = checkout once using SSH from your dos shell. Then you can continue to = update and commit using Tortoise CVS. The better and more elegant option is to use Eclipse - the great free = Open Source IDE supported by IBM - it interfaces very cleanly with CVS = and SSH (extssh), and you dont need to setup anything. Lets continue our tech discussions on the developer list. Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-06-26 02:01:17
|
WRT exception handling vs. feedback, only fatal exceptions should be = thrown and feedback, where you are currently using System.out or = System.err should go through an interface that users can reroute as they = might prefer (to logs, console or ignore them). I have written up the = classes and packaged them under the com.kizna.html.util package. I can = send these to you in any form you like. I agree. The existing System.err.println() statements - I think they all = indicate fatal errors - hence should be converted to an exception = throwing system. The Callback mechanism should also come in so we can start using it in = the rest of the library. Also - another issue I have been thinking of is SAX compliance. I dont = think it will be hard to make callbacks from the parse() method... What = do you think ? The files are: =20 HTMLFeedback DefaultHTMLFeedback FeedbackManager HTMLParserException (a chained exception class). =20 You put them in CVS. Do you think it'd be better to have a = com.kizna.html.exceptions package instead of util, for better naming = conventions ? I am debating whether to keep the ChainedException class as a base class = for more general use and use an HTMLParserException subclass. Any = thoughts? Hmm.. I'd need to see the code before I can comment. Since you are now going to be a developer - here are two important = guidelines (which you might be already following) : [1] all the code that is checked in must come with testcases and should = not break existing tests. As of now the parser is almost 100% covered by = tests. [2] The bug fixing strategy is - write a testcase to simulate the bug, = make the testcase fail, then fix the bug. Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-06-26 01:39:24
|
----- Original Message -----=20 From: Somik Raha=20 To: htm...@li...=20 Sent: Wednesday, June 26, 2002 10:13 AM Subject: Re: [Htmlparser-user] Testing/feedback, question Dear Claude, Great mail to read. Bytway, as I understand you've used v1.1 for = these tests. However, I have made some special optimizations in v1.2, = particularly to improve scalability. The String node parser now creates only one HTMLStringNode object for continuous text. So if you had 10,000 lines, = v1.1 would create 10,000 objects, while v1.2 would create only one. The other scanners also have been optimized. I think this would result in a substantial improvement in your test results. Bytway, do you think you can write an article about your tests - we could put it up on the HTMLParser page. Also, send me your sourceforge id, I'd like to add you as a = developer to this project, so that you can check in improvement directly to CVS. Regards, Somik ----- Original Message ----- From: "Claude Duguay" <CD...@ar...> To: <htm...@li...> Sent: Wednesday, June 26, 2002 2:57 AM Subject: RE: [Htmlparser-user] Testing/feedback, question Here are some test results I thought you may be interested in: We ran about 58k files through our conversion process using both the old, Swing-based HTML parser and the new HTMLParser solution yesterday. Some of these files are not HTML and are routed to other parsers, but this particular set of files was especially problematic with the Swing parser. The exact nature of the Swing parser problem is a reallocation of buffer space with too small an increment deep down inside the parser code. In effect, some ungodly low number (4-8) of bytes are alllocated as the string grows each time, causing an array copy each time with a growing string. This is problematic when handling files with large text content between a specific set of tags, such as large log listings between <PRE> tags. Using the old (Swing) parser, we processed 57952 documents, encountered 67 errors, ran in 10305 minutes (several days), with an original aggregate file size of 6,252,739,014 bytes and a converted document collection size around 761,653,928 bytes. Using the new (HTMLParser) parser, we processed 58113 documents, encountered 69 errors, ran in 294 minutes, with an original aggregate file size of 6,256,488,243 bytes and a converted document collection size around 431,198,296 bytes. While this is not a conclusive test - there are clearly discrepencies between the two conversion runs that need to be resolved, such as different output size counts, which are attributable to changes we have made - the timing different is impressive: Going from 10305 minutes to 294 minutes, is just over 35 times faster. This is mostly attributable to the problematic files in this test set, which took on the order of hours to process each. Yet clearly the HTMLParser solution overcomes a serious bug in the Swing parser (which cannot be patched by anyone but Sun or it's Java license holders - given the way the Java license agreement it written). Note that the same low-level reallocation of string resources in the Swing parser is less problematic in cases where less text is found between each tag, but the performance differences should still be sigificant taken over a large set of files. I will share what I can as we learn more. ------------------------------------------------------- This sf.net email is sponsored by: Jabber Inc. Don't miss the IM event of the season | Special offer for OSDN members! JabConf 2002, Aug. 20-22, Keystone, CO http://www.jabberconf.com/osdn _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- This sf.net email is sponsored by: Jabber Inc. Don't miss the IM event of the season | Special offer for OSDN members!=20 JabConf 2002, Aug. 20-22, Keystone, CO http://www.jabberconf.com/osdn _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-06-26 01:38:54
|
----- Original Message -----=20 From: Claude Duguay=20 To: htm...@li...=20 Sent: Tuesday, June 25, 2002 6:38 AM Subject: [Htmlparser-user] Testing/feedback, question I've just started using the HTMLParser and hope to be able to provide improved throughput and reliability over the Swing HTML parser by applying this open source solution, hopefully offering bug fixes/enhancements back to the community. We have (my company) processed about 11 million HTML documents successfully (with the Swing parser), some of which we'll see tested again with the HTMLParser code in the next few weeks. To date, we have only run a few simple tests with the HTMLParser code but it appears now that the library is writing to standard err. I would expect all errors to result in parser-specific exceptions that the calling application would be free to handle as it may see fit. Some of the data we are processing is not publicly available. The errors we have seen are issues with vary large HTML files that were generated from log files. These are suprisingly common but offer a special challenge to HTML parsers in that they tend to contain large strings of log file information between <pre></pre> tags. We'll probably be running about 1 or 2 million files through the parser this week. I will try to report problems and get set up to build the library so that I can offer more specific class/line-based feedback/fixes. Thanks. ------------------------------------------------------- Sponsored by: ThinkGeek at http://www.ThinkGeek.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-06-26 01:38:07
|
1) There is command line handling and connection-oriented code in HTMLParser. This code should be uncoupled. Perhaps an HTMLParserMain class to handle the command line wrapper, keeping the HTMLParser code dedicated to parsing? Good suggestion. This refactoring should be done. 2) Thanks for filling in the toString methods in 1.2. I had noticed most missing in 1.1 and was concerned. While there's room for minor improvement (the use of StringBuffer to build strings and a consistent naming conventions), these are minor quips. I've found it useful to have a void toString(StringBuffer buffer); method variant in container classes, for building up strings from contained classes more efficiently. We need to go thru a phase of optimization looking at the strings used. = The toString(StringBuffer) method also sounds useful.=20 3) I love the existence of the toHTML(); methods. This was the suggestion of Sam Joseph (it used to be toRawString() in = older integration releases). Thanks Sam! 4) I see it's now possible to get something by calling getTag. This was missing in 1.1. Thanks. Hmm.. This method should actually read getTagName().=20 5) I noticed a lot of code in the HTMLTag class which is 'private static'. This suggests the need for an external class to handle this type of work. At peripheral glace, I'm presuming you're functioning as a Finite State Machine (thus the 'automata' prefix)? Ah yes, I have been thinking of doing this refactoring for a while, and = also refactor the other finite state machines for strings and remarks. Thanks for the big investment. I'd be happy to spend a little time helping with some of the grunt work. If you think the use of the Callback mechanism is good, for example, I could replace all the System.out and System.err for you and send you the code. You are most welcome to join us - as I mentioned, I'd be happy to add = you as a developer.=20 6) I noticed that you don't have a custom exception class. I have code kicking around that implements chained exceptions (as in Java 1.4) but is compatible with earlier Java versions. Chained exceptions are incredibly useful for wraping underlying exceptions into higher-level exceptions while retaining the stack trace. This results in highly usable libraries because it provides suitable high-level explanations of a problem, while retaining lower level context. Sounds like a great idea. Pls go ahead and add it to the CVS version. 7) I also have a very simple but versatile command line handler class that you can use if you like. It lets you retrieve arguments as either flags or parameter-followed options, single or multiple letter commands, order-depentent, etc. While simple, this is one of those classes that nobody should live without ;-). It would be good to have this in the parser. Great to have you on board! Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-06-22 07:04:00
|
Hi Folks, Integration Release 2002-06-23 is out. You can get it from = http://htmlparser.sourceforge.net.=20 Apart from a couple of bug fixes, some major changes : [1 ] HTMLFormScanner, HTMLFrameSetScanner included and registered in = HTMLParser. These have been redesigned, and optimized. Earlier scanners = also refactored and optimized.=20 [2] API change - toRawString() renamed to toHTML() - intention revealing = name.=20 117 test cases now, all passing. Check the release notes for more = details. Hopefully, this should be the last release before stable release 1.2. = I'd be grateful if the community can check this out quickly - see if = there are any bugs remaining that need fixing..=20 To do : [1] Nice docs for writing your own scanner. Regards, Somik ********************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012, JAPAN Phone : +81-3-5475-2646 Fax : +81-3-3445-9089 Web : http://www.kizna.com Mail : so...@ki... ********************************** |
From: Somik R. <so...@ya...> - 2002-06-16 09:15:40
|
Hi Folks, A new integration build is out. Major change : [1] HTMLStringNode now gives string blocks, all in one string node = object, instead of several string node objects for continuous lines. = This is based on a=20 bug report by Gordon Deudney. This will improve the scalability of the = parser. [2] HTMLScriptScanner's scan method has been refactored. For folks = writing new scanners, take a look at this method - to see how simple it = is to make your own scanners. There's a substantial reduction in the = code size and complexity. To do :=20 [1] Integrate Raghavendra Srimantula's scanners (Form and Frame) as soon = as the test cases are available. [2] Write a guide for writing your own scanners. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-06-07 05:22:25
|
Hi Folks, An integration build is out, incorporating 2 bug fixes in the script = scanner, contributed by Wolfgang Germund.=20 Wolfgang --> Thanks a lot for the nice test cases that you = submitted, and of course, the fixes. Incorporated and released in the = latest package. Release 1.2 is still some way off... need test cases for the form = and frame scanners...=20 Regards, Somik |
From: Craig R. <cr...@qu...> - 2002-05-13 10:36:30
|
Wrong mail address, again :> -------- Hi Somik, I thought I'd brief you on how my investigation in the SwingParser was going. I took your CVS module and managed with some changes to integrate it into Swing's JEditorPane HTML renderer to make a simple HTML browser. It soon become apparent however that the renderer requires perfectly formed HTML. After playing with the idea of trying to fix bad HTML myself, I realised the enormity of this task and looked for an existing implementation. JTidy (http://www.sourceforge.net/projects/jtidy), a port of a C library (HTML Tidy), is another SourceForge project which performs HTML validation and pretty-printing. It produces a DOM of the HTML page from an InputStream from which I performed the relevant callbacks. The result is a good replacement for Sun's DocumentParser, and it produces a nice output of what was wrong/fixed during parsing. I am still trying to determine whether the 174kb it adds on to any project is worth it tho (and if there are any performance implications). I haven't checked my code back in since it longer depends on htmlparser in any way, but I can send it to you if you're interested. -craig |
From: Somik R. <so...@ya...> - 2002-05-12 09:07:49
|
Hi Raghav I went thru the yahoo.txt, and just like your previous one, this one = too had very dirty html. The reason you got the OutofMemoryException was = that this kind of html sent the parser into an infinite loop (in = HTMLLinkScanner). The tag which did this was : <a href=3Ds/8741><img = src=3D"http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=3D16 = width=3D16 border=3D0></img></td><td nowrap> <a href=3Ds/7509><b>Yahoo! Movies</b></a> As you can see, the first link tag does not have an end tag. I verified = with the actual yahoo page, and this link occurs quite decently, with = the correct end tag. After looking closely at your supplied file, I also = notice the </img> file, which is highly unusual in normal html. So - I am guessing that this file is generated by a program and not by a = human. You would definitely want to check the program thats doing it - = its surely buggy. However, my yardstick for the robustness of this parser is Internet = Explorer. If the stuff works in IE, then its got to work here. And as I = tried this particularly bad piece of html, I found IE does not crash. = Hence, I had to go about empowering the parser to parse these erroneous = tags <sigh> Took hours!! </sigh> The good news is, its done. We can parse these tags, and the correct = end tag is inserted just before td. Of course, I have done a minimal = adjustment for your purpose. As time goes on, robustness ought to = increase further. All test cases passing. The framework for handling = dirty html is also slightly modified. An integration release has been made (2002-05-12), and is under the = integration builds package. You can download from = http://htmlparser.sourceforge.net.=20 =20 The parser should not crash on your html now. Regards, Somik ----- Original Message -----=20 From: Raghavender Srimantula=20 To: htm...@li...=20 Sent: Saturday, May 11, 2002 4:32 AM Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations andwriteoutdocument Hi Somik, I have mentioned about the out of memory error problem earlier. last = time=20 for every iteration of for loop I was adding the whole page to my = string=20 buffer. so it was giving me the out of memory error. I removed that = now. it=20 was working fine till yesterday. now I find that error again. this = time=20 nothing to do with string buffer...and it looks like a real problem. I = can=20 send you the main class and the yahoo.txt I have. try running it. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations=20 >andwriteoutdocument >Date: Fri, 10 May 2002 00:43:19 +0900 > >Hi Raghav, > On analyzing yahoo.txt, I found that you have incorrect html. = There is=20 >a script tag that has not been closed. So naturally the script = scanner goes=20 >bonkers. Rename the extension to .html, and open this file in IE, and = you=20 >will find that IE also cant handle this. > I verified from www.yahoo.com, and found that they do have the = correct=20 ></script> tag provided. So I guess your yahoo.txt file is faulty. > >Regards, >Somik > ----- Original Message ----- > From: Raghavender Srimantula > To: htm...@li... > Sent: Thursday, May 09, 2002 4:53 AM > Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations andwriteoutdocument > > > Hi Somik, > I was using the 1.1 version of htmlparser. I save the = www.yahoo.com=20 >content > in a flat file yahoo.txt. and I run the parser against this. = throws a > nullpointerexception in HTMLScriptScanner. this seems to be a new=20 >addition > for 1.1. I will send the stacktrace, the main program and the = yahoo.txt. > actually I cannot send the stacktrace. I made some changes and the = line > numbers dont match. but if you run this program you would see the > nullpointerexception. > Thanks, > Raghav > > > >From: "Somik Raha" <so...@ya...> > >Reply-To: htm...@li... > >To: <htm...@li...> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > >and writeoutdocument > >Date: Mon, 6 May 2002 13:59:11 +0900 > > > >Hi Raghav, > > I sent another mail sometime back to you - > > > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in = the > >enumeration will be your HTMLImageTag." > >HTMLNode node; > >HTMLImageTag imageTag; > >for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) { > > node =3D (HTMLNode)e.nextElement(); > > if (node instanceof HTMLImageTag) { > > imageTag =3D (HTMLImageTag)node; > > // your code here > > } > >} > > > >Regards, > >Somik > >----- Original Message ----- > >From: "Raghavender Srimantula" <kin...@ho...> > >To: <htm...@li...> > >Sent: Monday, May 06, 2002 10:43 AM > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > >and writeoutdocument > > > > > > > Hi Somik, > > > this question is regarding "not all images are being = retrieved". I=20 >mean > >the > > > images under <a tag. I did try to open the attachment you sent = me. I > >could > > > not find anything. but seeing the previous mails I could read = that=20 >it is > >not > > > a bug. but still if I do want to retrieve all the images how = do I do=20 >it. > > > Thanks, > > > Raghav > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > >Reply-To: htm...@li... > > > >To: <htm...@li...> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > >locations > > > >and write outdocument > > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > > > >Hi Raghav, > > > > Ah - this was a question by Annette Doyle (titled "Not = all=20 >image > >tags > > > >are returned"). I am attaching my reply. > > > > > > > >Regards > > > >Somik > > > > > > > >----- Original Message ----- > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > >To: <htm...@li...> > > > >Sent: Tuesday, April 30, 2002 11:16 AM > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > >locations > > > >and write outdocument > > > > > > > > > > > > > hi Somik, > > > > > I found one more interesting thing here. when I am trying = to get=20 >all > >the > > > > > images the image scanner would give me images > > > > > <img > = >src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm> > > > > > so if I do a imagetag.getImageLocation(), I would get > > > > > = http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > > > but is the html content is like this > > > > > <a href=3Ds/6006><img > > > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > > border=3D0 width=3D70 height=3D22></a> > > > > > which starts with <a and ends with </a>, then the image = scanner=20 >will > >not > > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif = when=20 >I do > >a > > > > > imagetag.getImageLocation(). this is not even classified = as an > >ImageTag. > > > > > this is classified as LinkTag. how to get this image. > > > > > > > > > > the above content is from www.yahoo.com. on the netscape = browser=20 >if > >you > > > >goto > > > > > view-->pageinfo, you will see a bunch of images. > > > > > but when you run the htmlparser you can get only one = image. > > > > > > > > > > Thanks, > > > > > Raghav > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > >Reply-To: htm...@li... > > > > > >To: <htm...@li...> > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > >locations > > > > > >and write outdocument > > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > > > >Can you describe your application ? Was it parsing a = single=20 >page > >when > > > >the > > > > > >problem occurred ? > > > > > > > > > > > >Regards, > > > > > >Somik > > > > > >----- Original Message ----- > > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > > >To: <htm...@li...> > > > > > >Cc: <htm...@li...> > > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > >locations > > > > > >and write outdocument > > > > > > > > > > > > > > > > > > > Hi Somik, > > > > > > > I encountered a strange problem today. while I was = running > > > > > >htmlparser...I > > > > > > > got a java.lang.OutOfMemoryError. seems that lot of = objects=20 >are > > > >being > > > > > > > allocated. where exactly is this happening. I mean = could you > >give > >me > > > >an > > > > > >idea > > > > > > > where or in which file the potential problem could be. > > > > > > > Raghav > > > > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > > >Reply-To: htm...@li... > > > > > > > >To: <htm...@li...> > > > > > > > >CC: <htm...@li...> > > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image=20 >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > > > >Hi Annette, > > > > > > > > Pls find attached a program to get you started. = This > >program > > > >will > > > > > >do > > > > > > > >what you want - you will need to modify the construct = that > >checks > > > >for > > > > > >the > > > > > > > >image tag - and replace it with the location of your=20 >choice. > > > > > > > > Also - I found one bug thanks to this = requirement -=20 >image > >tags > > > > > >params > > > > > > > >were not being correctly put in. Though it needs a = deeper=20 >look, > >I > > > >have > > > > > >done > > > > > > > >a quick fix for now, and all test cases are passing = (with=20 >one > >test > > > >case > > > > > >in > > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > > Please check out the latest html parser source = code=20 >from > >CVS. > > > > > > > > > > > > > > > >Regards, > > > > > > > >Somik > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: Doyle, Annette > > > > > > > > To: htm...@li... > > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > > Subject: [Htmlparser-user] Hints on how to change = image=20 >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to = change=20 >only > >image > > > >tag > > > > > > > >locations and then, (or at the same time) write out = the=20 >html > > > >document > > > > > >to > > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > > > Join the world's largest e-mail service with MSN = Hotmail. > > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Htmlparser-user mailing list > > > > > > > Htm...@li... > > > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > >Htmlparser-user mailing list > > > > > >Htm...@li... > > > > > = >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > > > > >=20 >_________________________________________________________________ > > > > > Send and receive Hotmail on your mobile device: > >http://mobile.msn.com > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ><< > > > > >=20 > = >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not= aBu > >g].eml > > > > >> > > > > > > > > > > > > > > > = _________________________________________________________________ > > > MSN Photos is the easiest way to share and print your photos: > > > http://photos.msn.com/support/worldwide.aspx > > > > > > > > > = _______________________________________________________________ > > > > > > Have big pipes? SourceForge.net is looking for download = mirrors. We > >supply > > > the hardware. You get the recognition. Email Us: > >ban...@so... > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _________________________________________________________________ > Get your FREE download of MSN Explorer at=20 >http://explorer.msn.com/intl.asp. > _________________________________________________________________ Join the world's largest e-mail service with MSN Hotmail.=20 http://www.hotmail.com |
From: Somik R. <so...@ya...> - 2002-05-08 11:17:08
|
Hi Craig, You are now on the developer team. > For the handleSimpleTag, I'm thinking the only way to do this is to > maintain an internal tag buffer and callback only once the entire > document has been parsed and the end tags have been found. Its not > ideal, but you have to be able to deal with <p> and <p> </p>. Hmm - I am using a very similar approach - check the code and the explanations that I sent earlier. I dont know what the parser is doing with <p>, it will be interesting to find out. Cheers, Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: <so...@ya...> Sent: Wednesday, May 08, 2002 7:38 PM Subject: [Htmlparser-developer] RE: [Htmlparser-user] Swing integration > Thanks, Somik, I'm on the user and dev lists now, and its coming through > fine. My SourceForge ID is 538740, username 'craigra'. > > For the handleSimpleTag, I'm thinking the only way to do this is to > maintain an internal tag buffer and callback only once the entire > document has been parsed and the end tags have been found. Its not > ideal, but you have to be able to deal with <p> and <p> </p>. > > > > -----Original Message----- > From: htm...@li... > [mailto:htm...@li...] On Behalf Of Somik > Raha > Sent: 08 May 2002 12:13 PM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, > I actually replied to you on htmlparser-developer, your earlier > mails > went there. Are you on that list ? > Am attaching the relevant mails to this mail - hope it goes thru. > Regards > Somik > ----- Original Message ----- > From: "Craig Raw" <cr...@qu...> > To: <htm...@li...> > Cc: <so...@ya...> > Sent: Wednesday, May 08, 2002 6:49 PM > Subject: [Htmlparser-user] Swing integration > > > > Posted this earlier, seems to have got lost.... > > ---- > > > > > > Hi Somik, > > > > I'm looking into the HTMLParser-Swing integration again, and I have > two > > questions: > > > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > > callback functions. Can this position be extracted from the HTMLTag's > > elementBegin()? > > > > 2. There is a need to differentiate between a callback to > > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > > iterating through the HTMLTag elements Enumeration. How? > > > > You mentioned you have started an implementation - if you have a > > framework going, I'd be happy to continue with the donkey work. I > really > > think this could make Swing's HTML rendering a lot more stable. > > > > Regards, > > Craig > > > > > > > > > > > > -----Original Message----- > > From: Somik Raha [mailto:so...@ya...] > > Sent: 16 April 2002 04:57 AM > > To: htm...@li... > > Cc: Craig Raw > > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, Asgher > > I finally had the time to check Swing integration. Boy - the > parser > > design in Swing sucks!! Theoretically its possible to do it - and I > got > > started, but just realized that in order to be compatible with swing > > objects > > that do compile time type checking with a particular tag, I have to > > actually > > have 73 if statements to give the right tag to the callback. > > I have more important things to do at the moment, but probably > will > > get > > back to this donkey work. *sigh* > > > > I am thinking we should make release 1.1 and then try this. Any > > suggestions ? > > > > Regards, > > Somik > > ----- Original Message ----- > > From: "Somik Raha" <so...@ya...> > > To: <htm...@li...> > > Sent: Thursday, April 04, 2002 11:20 AM > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > > Hi Craig, > > > Thanks a lot for the post. Pls go ahead with your analysis. I > will > > try > > > to catch up this weekend. > > > Regards, > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: "'Somik Raha'" <so...@ya...> > > > Sent: Tuesday, April 02, 2002 3:32 PM > > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > > > > Hi Somik, > > > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > > which > > > > is the driver behind JEditorPane's reading and writing HTML > > > > capabilities. > > > > > > > > --- > > > > Extendable/Scalable > > > > > > > > To maximize the usefulness of this kit, a great deal of effort has > > gone > > > > into making it extendable. These are some of the features. > > > > The parser is replaceable. The default parser is the Hot Java > parser > > > > which is DTD based. A different DTD can be used, or an entirely > > > > different parser can be used. To change the parser, reimplement > the > > > > getParser method. The default parser is dynamically loaded when > > first > > > > asked for, so the class files will never be loaded if an > alternative > > > > parser is used. The default parser is in a separate package called > > > > parser below this package. > > > > > > > > The parser drives the ParserCallback, which is provided by > > HTMLDocument. > > > > To change the callback, subclass HTMLDocument and reimplement the > > > > createDefaultDocument method to return document that produces a > > > > different reader. The reader controls how the document is > > structured. > > > > Although the Document provides HTML support by default, there is > > nothing > > > > preventing support of non-HTML tags that result in alternative > > element > > > > structures. > > > > --- > > > > > > > > I may find some time to look into this as well, although I am not > > sure > > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > > capabilities.... > > > > > > > > -craig > > > > > > > > > > > > -----Original Message----- > > > > From: htm...@li... > > > > [mailto:htm...@li...] On Behalf Of > > Somik > > > > Raha > > > > Sent: 01 April 2002 05:28 PM > > > > To: HTMLParser User List > > > > Cc: HTMLParser Developer List > > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > > > Hi Craig > > > > Wow! Thats a great question. > > > > Actually, I doubt if I could replace Sun Microsystems' code > with > > > > mine. I > > > > dont think Java is that open (or is it ?) > > > > However, we could think of writing our own adapter for the html > > parser > > > > that > > > > might plugin in some way... > > > > I have never used Sun's html parser (If I had, I might not > have > > > > started > > > > this project). > > > > I will need to study Sun's parser before I can answer your > > > > question.. > > > > But there does seem to be some interesting possibilities. > > > > > > > > Regards > > > > Somik > > > > ----- Original Message ----- > > > > From: "Craig Raw" <cr...@qu...> > > > > To: <htm...@li...> > > > > Sent: Monday, April 01, 2002 10:20 PM > > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit > to > > > > > provide a better implementation of JEditorPane's HTML viewing > > > > > capabilities? HTML Parser would need to replace > > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > > buggy. > > > > > Anyone tried this? > > > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > _________________________________________________________ > > > > Do You Yahoo!? > > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _______________________________________________________________ > > > > Have big pipes? SourceForge.net is looking for download mirrors. We > supply > > the hardware. You get the recognition. Email Us: > ban...@so... > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________________________ > > Have big pipes? SourceForge.net is looking for download mirrors. We supply > the hardware. You get the recognition. Email Us: ban...@so... > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Craig R. <cr...@qu...> - 2002-05-08 10:38:45
|
Thanks, Somik, I'm on the user and dev lists now, and its coming through fine. My SourceForge ID is 538740, username 'craigra'. For the handleSimpleTag, I'm thinking the only way to do this is to maintain an internal tag buffer and callback only once the entire document has been parsed and the end tags have been found. Its not ideal, but you have to be able to deal with <p> and <p> </p>. -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Somik Raha Sent: 08 May 2002 12:13 PM To: htm...@li... Cc: Craig Raw Subject: Re: [Htmlparser-user] Swing integration Hi Craig, I actually replied to you on htmlparser-developer, your earlier mails went there. Are you on that list ? Am attaching the relevant mails to this mail - hope it goes thru. Regards Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: <so...@ya...> Sent: Wednesday, May 08, 2002 6:49 PM Subject: [Htmlparser-user] Swing integration > Posted this earlier, seems to have got lost.... > ---- > > > Hi Somik, > > I'm looking into the HTMLParser-Swing integration again, and I have two > questions: > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? > > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? > > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. > > Regards, > Craig > > > > > > -----Original Message----- > From: Somik Raha [mailto:so...@ya...] > Sent: 16 April 2002 04:57 AM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, Asgher > I finally had the time to check Swing integration. Boy - the parser > design in Swing sucks!! Theoretically its possible to do it - and I got > started, but just realized that in order to be compatible with swing > objects > that do compile time type checking with a particular tag, I have to > actually > have 73 if statements to give the right tag to the callback. > I have more important things to do at the moment, but probably will > get > back to this donkey work. *sigh* > > I am thinking we should make release 1.1 and then try this. Any > suggestions ? > > Regards, > Somik > ----- Original Message ----- > From: "Somik Raha" <so...@ya...> > To: <htm...@li...> > Sent: Thursday, April 04, 2002 11:20 AM > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, > > Thanks a lot for the post. Pls go ahead with your analysis. I will > try > > to catch up this weekend. > > Regards, > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: "'Somik Raha'" <so...@ya...> > > Sent: Tuesday, April 02, 2002 3:32 PM > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > Hi Somik, > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > which > > > is the driver behind JEditorPane's reading and writing HTML > > > capabilities. > > > > > > --- > > > Extendable/Scalable > > > > > > To maximize the usefulness of this kit, a great deal of effort has > gone > > > into making it extendable. These are some of the features. > > > The parser is replaceable. The default parser is the Hot Java parser > > > which is DTD based. A different DTD can be used, or an entirely > > > different parser can be used. To change the parser, reimplement the > > > getParser method. The default parser is dynamically loaded when > first > > > asked for, so the class files will never be loaded if an alternative > > > parser is used. The default parser is in a separate package called > > > parser below this package. > > > > > > The parser drives the ParserCallback, which is provided by > HTMLDocument. > > > To change the callback, subclass HTMLDocument and reimplement the > > > createDefaultDocument method to return document that produces a > > > different reader. The reader controls how the document is > structured. > > > Although the Document provides HTML support by default, there is > nothing > > > preventing support of non-HTML tags that result in alternative > element > > > structures. > > > --- > > > > > > I may find some time to look into this as well, although I am not > sure > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > capabilities.... > > > > > > -craig > > > > > > > > > -----Original Message----- > > > From: htm...@li... > > > [mailto:htm...@li...] On Behalf Of > Somik > > > Raha > > > Sent: 01 April 2002 05:28 PM > > > To: HTMLParser User List > > > Cc: HTMLParser Developer List > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > Hi Craig > > > Wow! Thats a great question. > > > Actually, I doubt if I could replace Sun Microsystems' code with > > > mine. I > > > dont think Java is that open (or is it ?) > > > However, we could think of writing our own adapter for the html > parser > > > that > > > might plugin in some way... > > > I have never used Sun's html parser (If I had, I might not have > > > started > > > this project). > > > I will need to study Sun's parser before I can answer your > > > question.. > > > But there does seem to be some interesting possibilities. > > > > > > Regards > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: <htm...@li...> > > > Sent: Monday, April 01, 2002 10:20 PM > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > > provide a better implementation of JEditorPane's HTML viewing > > > > capabilities? HTML Parser would need to replace > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > buggy. > > > > Anyone tried this? > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________________________ > > Have big pipes? SourceForge.net is looking for download mirrors. We supply > the hardware. You get the recognition. Email Us: ban...@so... > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-05-07 11:57:54
|
Hi Craig, A brief description - you will probably want to use ParserTester. The code is quite dirty at the moment, but a basic idea is : [1] HTMLParserAdapter adapts our HTMLParser into Parser (Swing's parser) - The donkey work comes here with all those if-then statements. [2] HTMLParserProvider gives me the parser (makes the method public), so I can control it. [3] TrialParser - the parser class that allows you to configure which parser you want. ParserTester uses this to create two different parsers, by using the c'tor params. Based on the params, at the point of invoking the parser (in MyParserDelegator), the decision is made as to which parser is to be used. [4] MyParserCallBack - the same class is used for both parsers. For every call back method, an object of a certain time is created, which is collected in a vector, and is used later for comparison in the testcase. So, handleSimpleTag() will create a SimpleTagCallBack object. If this method is correctly called by our parser, then the two objects ought to match. (The equals method accomplishes this). [5] testTypes package contains the various types like SimpleTagCallBack, which aid us in testing these call back objects returned by the two parsers. [6] ParserTester - the main testing mechanism - where you get to create the two parsers, choose what html they have to parse, and then compare their respective callback objects. This one's a nightmare - bcos the swing parser puts in tags that werent there. You can ignore the other classes safely (I ought to delete them). If you have any doubts, pls let me know. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-07 10:34:24
|
Hi Craig, You can get the latest code of the SwingParser from CVS. The module name is SwingParser. cvs -z3 -d:ext:dev...@cv...:/cvsroot/htmlpar ser co SwingParser Bytway, if you give me your developer id, I can add you to the developer list. Then you can directly checkin your work. Regards, Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: "'Somik Raha'" <so...@ya...> Sent: Tuesday, May 07, 2002 6:54 PM Subject: [Htmlparser-developer] RE: [Htmlparser-user] Swing integration > Hi Somik, > > I'm looking into the HTMLParser-Swing integration again, and I have two > questions: > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? > > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? > > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. > > Regards, > Craig > > > > > > -----Original Message----- > From: Somik Raha [mailto:so...@ya...] > Sent: 16 April 2002 04:57 AM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, Asgher > I finally had the time to check Swing integration. Boy - the parser > design in Swing sucks!! Theoretically its possible to do it - and I got > started, but just realized that in order to be compatible with swing > objects > that do compile time type checking with a particular tag, I have to > actually > have 73 if statements to give the right tag to the callback. > I have more important things to do at the moment, but probably will > get > back to this donkey work. *sigh* > > I am thinking we should make release 1.1 and then try this. Any > suggestions ? > > Regards, > Somik > ----- Original Message ----- > From: "Somik Raha" <so...@ya...> > To: <htm...@li...> > Sent: Thursday, April 04, 2002 11:20 AM > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, > > Thanks a lot for the post. Pls go ahead with your analysis. I will > try > > to catch up this weekend. > > Regards, > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: "'Somik Raha'" <so...@ya...> > > Sent: Tuesday, April 02, 2002 3:32 PM > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > Hi Somik, > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > which > > > is the driver behind JEditorPane's reading and writing HTML > > > capabilities. > > > > > > --- > > > Extendable/Scalable > > > > > > To maximize the usefulness of this kit, a great deal of effort has > gone > > > into making it extendable. These are some of the features. > > > The parser is replaceable. The default parser is the Hot Java parser > > > which is DTD based. A different DTD can be used, or an entirely > > > different parser can be used. To change the parser, reimplement the > > > getParser method. The default parser is dynamically loaded when > first > > > asked for, so the class files will never be loaded if an alternative > > > parser is used. The default parser is in a separate package called > > > parser below this package. > > > > > > The parser drives the ParserCallback, which is provided by > HTMLDocument. > > > To change the callback, subclass HTMLDocument and reimplement the > > > createDefaultDocument method to return document that produces a > > > different reader. The reader controls how the document is > structured. > > > Although the Document provides HTML support by default, there is > nothing > > > preventing support of non-HTML tags that result in alternative > element > > > structures. > > > --- > > > > > > I may find some time to look into this as well, although I am not > sure > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > capabilities.... > > > > > > -craig > > > > > > > > > -----Original Message----- > > > From: htm...@li... > > > [mailto:htm...@li...] On Behalf Of > Somik > > > Raha > > > Sent: 01 April 2002 05:28 PM > > > To: HTMLParser User List > > > Cc: HTMLParser Developer List > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > Hi Craig > > > Wow! Thats a great question. > > > Actually, I doubt if I could replace Sun Microsystems' code with > > > mine. I > > > dont think Java is that open (or is it ?) > > > However, we could think of writing our own adapter for the html > parser > > > that > > > might plugin in some way... > > > I have never used Sun's html parser (If I had, I might not have > > > started > > > this project). > > > I will need to study Sun's parser before I can answer your > > > question.. > > > But there does seem to be some interesting possibilities. > > > > > > Regards > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: <htm...@li...> > > > Sent: Monday, April 01, 2002 10:20 PM > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > > provide a better implementation of JEditorPane's HTML viewing > > > > capabilities? HTML Parser would need to replace > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > buggy. > > > > Anyone tried this? > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________________________ > > Have big pipes? SourceForge.net is looking for download mirrors. We supply > the hardware. You get the recognition. Email Us: ban...@so... > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2002-05-07 10:06:12
|
Hi Craig, > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? Yes - thats exactly what Im doing at the moment > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? Simple tags are those which dont come in XML like pairs. e.g. <BR> <META> will be simple tags. While <title> would be a start tag, as its got to have an end tag. Sadly, for every different case, we will need to manually handle them. > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. Ok - I can put out the code - maybe as a new module.. Will let u know as soon as its done. Regards, Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: "'Somik Raha'" <so...@ya...> Sent: Tuesday, May 07, 2002 6:54 PM Subject: RE: [Htmlparser-user] Swing integration > Hi Somik, > > I'm looking into the HTMLParser-Swing integration again, and I have two > questions: > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? > > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? > > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. > > Regards, > Craig > > > > > > -----Original Message----- > From: Somik Raha [mailto:so...@ya...] > Sent: 16 April 2002 04:57 AM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, Asgher > I finally had the time to check Swing integration. Boy - the parser > design in Swing sucks!! Theoretically its possible to do it - and I got > started, but just realized that in order to be compatible with swing > objects > that do compile time type checking with a particular tag, I have to > actually > have 73 if statements to give the right tag to the callback. > I have more important things to do at the moment, but probably will > get > back to this donkey work. *sigh* > > I am thinking we should make release 1.1 and then try this. Any > suggestions ? > > Regards, > Somik > ----- Original Message ----- > From: "Somik Raha" <so...@ya...> > To: <htm...@li...> > Sent: Thursday, April 04, 2002 11:20 AM > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, > > Thanks a lot for the post. Pls go ahead with your analysis. I will > try > > to catch up this weekend. > > Regards, > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: "'Somik Raha'" <so...@ya...> > > Sent: Tuesday, April 02, 2002 3:32 PM > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > Hi Somik, > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > which > > > is the driver behind JEditorPane's reading and writing HTML > > > capabilities. > > > > > > --- > > > Extendable/Scalable > > > > > > To maximize the usefulness of this kit, a great deal of effort has > gone > > > into making it extendable. These are some of the features. > > > The parser is replaceable. The default parser is the Hot Java parser > > > which is DTD based. A different DTD can be used, or an entirely > > > different parser can be used. To change the parser, reimplement the > > > getParser method. The default parser is dynamically loaded when > first > > > asked for, so the class files will never be loaded if an alternative > > > parser is used. The default parser is in a separate package called > > > parser below this package. > > > > > > The parser drives the ParserCallback, which is provided by > HTMLDocument. > > > To change the callback, subclass HTMLDocument and reimplement the > > > createDefaultDocument method to return document that produces a > > > different reader. The reader controls how the document is > structured. > > > Although the Document provides HTML support by default, there is > nothing > > > preventing support of non-HTML tags that result in alternative > element > > > structures. > > > --- > > > > > > I may find some time to look into this as well, although I am not > sure > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > capabilities.... > > > > > > -craig > > > > > > > > > -----Original Message----- > > > From: htm...@li... > > > [mailto:htm...@li...] On Behalf Of > Somik > > > Raha > > > Sent: 01 April 2002 05:28 PM > > > To: HTMLParser User List > > > Cc: HTMLParser Developer List > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > Hi Craig > > > Wow! Thats a great question. > > > Actually, I doubt if I could replace Sun Microsystems' code with > > > mine. I > > > dont think Java is that open (or is it ?) > > > However, we could think of writing our own adapter for the html > parser > > > that > > > might plugin in some way... > > > I have never used Sun's html parser (If I had, I might not have > > > started > > > this project). > > > I will need to study Sun's parser before I can answer your > > > question.. > > > But there does seem to be some interesting possibilities. > > > > > > Regards > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: <htm...@li...> > > > Sent: Monday, April 01, 2002 10:20 PM > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > > provide a better implementation of JEditorPane's HTML viewing > > > > capabilities? HTML Parser would need to replace > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > buggy. > > > > Anyone tried this? > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Craig R. <cr...@qu...> - 2002-05-07 09:54:23
|
Hi Somik, I'm looking into the HTMLParser-Swing integration again, and I have two questions: 1. The HTMLEditorKit.ParserCallback takes a position with most of its callback functions. Can this position be extracted from the HTMLTag's elementBegin()? 2. There is a need to differentiate between a callback to handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when iterating through the HTMLTag elements Enumeration. How? You mentioned you have started an implementation - if you have a framework going, I'd be happy to continue with the donkey work. I really think this could make Swing's HTML rendering a lot more stable. Regards, Craig -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: 16 April 2002 04:57 AM To: htm...@li... Cc: Craig Raw Subject: Re: [Htmlparser-user] Swing integration Hi Craig, Asgher I finally had the time to check Swing integration. Boy - the parser design in Swing sucks!! Theoretically its possible to do it - and I got started, but just realized that in order to be compatible with swing objects that do compile time type checking with a particular tag, I have to actually have 73 if statements to give the right tag to the callback. I have more important things to do at the moment, but probably will get back to this donkey work. *sigh* I am thinking we should make release 1.1 and then try this. Any suggestions ? Regards, Somik ----- Original Message ----- From: "Somik Raha" <so...@ya...> To: <htm...@li...> Sent: Thursday, April 04, 2002 11:20 AM Subject: Re: [Htmlparser-user] Swing integration > Hi Craig, > Thanks a lot for the post. Pls go ahead with your analysis. I will try > to catch up this weekend. > Regards, > Somik > ----- Original Message ----- > From: "Craig Raw" <cr...@qu...> > To: "'Somik Raha'" <so...@ya...> > Sent: Tuesday, April 02, 2002 3:32 PM > Subject: RE: [Htmlparser-user] Swing integration > > > > Hi Somik, > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - which > > is the driver behind JEditorPane's reading and writing HTML > > capabilities. > > > > --- > > Extendable/Scalable > > > > To maximize the usefulness of this kit, a great deal of effort has gone > > into making it extendable. These are some of the features. > > The parser is replaceable. The default parser is the Hot Java parser > > which is DTD based. A different DTD can be used, or an entirely > > different parser can be used. To change the parser, reimplement the > > getParser method. The default parser is dynamically loaded when first > > asked for, so the class files will never be loaded if an alternative > > parser is used. The default parser is in a separate package called > > parser below this package. > > > > The parser drives the ParserCallback, which is provided by HTMLDocument. > > To change the callback, subclass HTMLDocument and reimplement the > > createDefaultDocument method to return document that produces a > > different reader. The reader controls how the document is structured. > > Although the Document provides HTML support by default, there is nothing > > preventing support of non-HTML tags that result in alternative element > > structures. > > --- > > > > I may find some time to look into this as well, although I am not sure > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > capabilities.... > > > > -craig > > > > > > -----Original Message----- > > From: htm...@li... > > [mailto:htm...@li...] On Behalf Of Somik > > Raha > > Sent: 01 April 2002 05:28 PM > > To: HTMLParser User List > > Cc: HTMLParser Developer List > > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig > > Wow! Thats a great question. > > Actually, I doubt if I could replace Sun Microsystems' code with > > mine. I > > dont think Java is that open (or is it ?) > > However, we could think of writing our own adapter for the html parser > > that > > might plugin in some way... > > I have never used Sun's html parser (If I had, I might not have > > started > > this project). > > I will need to study Sun's parser before I can answer your > > question.. > > But there does seem to be some interesting possibilities. > > > > Regards > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: <htm...@li...> > > Sent: Monday, April 01, 2002 10:20 PM > > Subject: [Htmlparser-user] Swing integration > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > provide a better implementation of JEditorPane's HTML viewing > > > capabilities? HTML Parser would need to replace > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > buggy. > > > Anyone tried this? > > > > > > -craig > > > > > > > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _________________________________________________________ > Do You Yahoo!? > Get your free @yahoo.com address at http://mail.yahoo.com > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |