htmlparser-developer Mailing List for HTML Parser (Page 33)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Somik R. <so...@ya...> - 2002-01-03 20:04:50
|
Hi Folks, A new year present - HTMLParser 1.0 is released. We've finally made = the transition from alpha to a beta stage. Modifications henceforth = would only be of a maintenance nature and API should remain constant. There are huge changes in the architecture, and lots of bug fixes. = Thanks a lot to Kaarle Kaaila for some great support and ideas. Thanks = also to Rodney Foley, for some nice ideas for improvement. And thanks to = everyone else who's been supporting this project.=20 Looking forward to your continuing support, and wishing you a very = happy new year. Cheers, Somik |
From: Somik R. <so...@ya...> - 2001-12-26 04:48:30
|
Merry Christmas to all (I do have it on my schedule :) > I don't know if I have much to give here but I would > remind that see that the binary file is OK and that it's name is > OK. I once suggested that call it HTMLParser.jar and not > Parse.jar as that is very close to the common XML-parser filename (Parser.jar) > and does not tell anything what it's about. Thanks for the tips - I am planning to do a thorough job this time. And I agree with you - changing the name is a good idea. Regards, Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Kaarle K. <kaa...@ik...> - 2001-12-25 23:28:53
|
At 16:27 25.12.2001 +0900, Somik Raha wrote: >Hi Folks, hi! And Merry Christmas for those of you who have it in your schedule! I don't know if I have much to give here but I would remind that see that the binary file is OK and that it's name is OK. I once suggested that call it HTMLParser.jar and not Parse.jar as that is very close to the common XML-parser filename (Parser.jar) and does not tell anything what it's about. In 0.98 Parse.jar had also files in wrong classes. regards Kaarle > The two bugs that I mentioned in my last mail are fixed. The Robot > crawler now crawls thru Google very comfortably. The problem was the > inclusion of placeholder images (which dont use any real image files). > Also, a big bug in HTMLStyleScanner has been fixed - yahoo is getting > parsed fine. > And a big internal change - I have incorporated parseParameters() > (written by Kaarle Kaaila) finally, and it works great with the Image and > Link scanners. It has made the code of both the scanners much simpler to read. > Thanks Kaarle! > > This is it for the release version. I need some help to make a decent > release - I want to create proper docs this time. If you folks can pitch > in, I'd be very grateful. Also, pls go thru the code, and see if u can > find any glaring bugs or changes. CVS is updated - more testcases are > added and all are passing. > >Cheers, >Somik > --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2001-12-25 07:32:33
|
Hi Folks, The two bugs that I mentioned in my last mail are fixed. The Robot = crawler now crawls thru Google very comfortably. The problem was the = inclusion of placeholder images (which dont use any real image files). = Also, a big bug in HTMLStyleScanner has been fixed - yahoo is getting = parsed fine. And a big internal change - I have incorporated parseParameters() = (written by Kaarle Kaaila) finally, and it works great with the Image = and Link scanners. It has made the code of both the scanners much = simpler to read. Thanks Kaarle! This is it for the release version. I need some help to make a = decent release - I want to create proper docs this time. If you folks = can pitch in, I'd be very grateful. Also, pls go thru the code, and see = if u can find any glaring bugs or changes. CVS is updated - more = testcases are added and all are passing. =20 Cheers, Somik =20 |
From: Somik R. <so...@ya...> - 2001-12-24 09:35:04
|
Hi Folks, I have finally fulfilled my promise - a major overhaul of the design = is done - all test cases are passing. I have updated the latest code on = CVS. Ive tried to keep the interface consistent, so user applications = wont break. The changes are mainly internal. However, big change is - = you need to call registerScanners() on the parser object.=20 No more confusing anonymous scanner registration. You can register = by calling parser.addScanner(some scanner object), and also remove the = same. Was able to do all this within an hour (thanks to the test cases). Bad news though - I discovered two bugs (which I verified, have = existed earlier) -=20 [1] When scanning yahoo.com, the parser goes into an infinite loop [2] In extractImageLocn(), there seems to be some problem in parsing = dynamic links, in constructing relative paths.=20 Also extractImageLocn is badly in need of refactoring. I think we can look forward to a release of HTMLParser 1.0 pretty soon = with these two bugs fixed, and also incorporating parseParameters inside = the Scanners' logic. Looking forward to your comments (bug findings) and = help. Cheers, Somik |
From: Somik R. <so...@ya...> - 2001-11-13 16:56:19
|
Hi folks, I have modified the architecture, to include the change I spoke of = last. Now, the parser throws an exception if no scanners have been = registered. This feature can be turned off by setting a boolean flag, = but by default it is set to true. Also, a static method called registerScanners is now available in = HTMLParser, which will register some of the common scanners. Hopefully, this will alleviate much of the confusion being caused by = the scanner registration process. Regards, Somik |
From: Kaarle K. <kaa...@ik...> - 2001-10-24 19:24:00
|
I looked at the different classes in HTMLParser on how to utilize parseParaneters in parsing the tags. I created another evaluate method in HTMLTagScanner that uses the parsed parameters and it seemed to work OK at least in some cases. In HTMLTag you can see in scan method where I thought the tag should be parsed. At the end of the code you can see the methods to retrieve values from it how I thought it could be used. They would be in use after scan method has been called. Some problem that I had during my tests I think were e.g. with JspTags. I don't know how well that one is like the rest? I have not put these changes into CVS as the TestCases gave some errors that I did not have time to check. Should we make changes in this direction? Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Kaarle K. <kaa...@ik...> - 2001-10-24 18:50:16
|
hi! I have made modifications to htmlParser to modules com\kizna\html\tags\HTMLLinkNode.java com\kizna\html\tags\HTMLTag.java com\kizna\html\scanners\HTMLLinkScanner.java com\kizna\htmlTests\HTMLTagTest.java I have modified the classes so that method getText() and parseParameters() functions in HTMLTag even if LinkScanner is active. Added some testcases too. I hope it went OK! It is now in CVS. regards Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Kaarle K. <kaa...@ik...> - 2001-10-24 15:22:52
|
hi! I have made modifications to htmlParser to modules com\kizna\html\tags\HTMLLinkNode.java com\kizna\html\tags\HTMLTag.java com\kizna\html\scanners\HTMLLinkScanner.java com\kizna\htmlTests\HTMLTagTest.java I have modified the classes so that method getText() and parseParameters() functions in HTMLTag even if LinkScanner is active. Added some testcases too. I hope it went OK! It is now in CVS. regards Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Kaarle K. <kaa...@ik...> - 2001-10-13 18:17:18
|
Hi! I have added the parseParameters method to parse the parameters of the tag. parseParameters does not function yet if some of the listener classes have been registered but without any listeners you can extract data you need from your html file. This example is also in the class comments. It parses all HREF parameters from all A tag's. I use it myself for some more special extracts. This version of HTMLTag.java you can find in the CVS-repository. regards Kaarle Kaila HTMLTag tag; Hashtable h; String tmp; try { HTMLReader in = new HTMLReader(new FileReader(path),2048); HTMLParser p = new HTMLParser(in); Enumeration en = p.elements(); while (en.hasMoreElements()) { try { tag = (HTMLTag)en.nextElement(); h = tag.parseParameters(); tmp = (String)h.get(tag.TAGNAME); if (tmp != null && tmp.equalsIgnoreCase("A")) {; System.out.println("URL is :" + h.get("HREF")); } } catch (ClassCastException ce){} } } catch (IOException ie) { ie.printStackTrace(); } --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... |