webharvest-develop Mailing List for Harvest Web Indexing
Brought to you by:
sxw
You can subscribe to this list here.
2000 |
Jan
|
Feb
|
Mar
(2) |
Apr
(2) |
May
|
Jun
|
Jul
(10) |
Aug
(1) |
Sep
(2) |
Oct
|
Nov
|
Dec
(2) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
(3) |
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(1) |
2002 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
(1) |
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: D M. <mrd...@gm...> - 2014-06-30 22:34:57
|
If I only wanted to crawl sites with an .info domain extension, could I do that with Harvest-NG? -- *315-572-1575* *PO Box 34 North Syracuse NY 13212* *This message (and any associated files) is intended only for the use of the individual or entity to which it is addressed and may contain information that is confidential, subject to copyright or constitutes a trade secret. If you are not the intended recipient you are hereby notified that any dissemination, copying or distribution of this message, or files associated with this message, is strictly prohibited. If you have received this message in error, please notify us immediately by replying to the message and deleting it from your computer. Messages sent to and from us may be monitored. * |
From: martin <mar...@16...> - 2010-06-15 09:55:35
|
hi ,everyone: I'm not able to restart the application. I was trying to run the application from command line by "java -jar e:\webharvest_all_2.jar". I'm running the version Java(TM) SE Runtime Environment (build 1.6.0_20-b02) and Webharvest 2.0.beta. I Can run it well the first time, but after that, An error is invoked: Exception in thread "AWT-EventQueue-0" java.lang.NegativeArraySizeException at org.webharvest.gui.Settings.readString(Unknown Source) at org.webharvest.gui.Settings.readObject(Unknown Source) at org.webharvest.gui.Settings.readFromFile(Unknown Source) at org.webharvest.gui.Settings.<init>(Unknown Source) at org.webharvest.gui.Ide.<init>(Unknown Source) at CommandLine$1.run(Unknown Source) at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:209) at java.awt.EventQueue.dispatchEvent(EventQueue.java:597) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161) at java.awt.EventDispatchThread.run(EventDispatchThread.java:122) I Tried 1.0 version, it cann't start either. If I use CommandLine.java to start, it goes well. Why Gui cann't work? martin 网易为中小企业免费提供企业邮箱(自主域名) |
From: <he...@ko...> - 2003-07-27 09:21:11
|
Li4gvsiz58fPvLy/5D8gDQoNCltjbGljayB0byBzZWUgXQ0KPGh0dHA6Ly93d3cubG92ZXN1cmYu Y28ua3IvZGVmYXVsdC5hc3A/cz0xMjY2NTc5JmVtYWlsPXdlYmhhcnZlc3QtZGV2ZWxvDQpwQGxp c3RzLnNvdXJjZWZvcmdlLm5ldD4gIAkNCg0KIrvntvsgx8+46byttbUgxKOxuLfOILi4s6q0wrDU ILmrvbwgwMe5zLChIMDWwdI/IiANCiK52bbzuLgguri0wiC757b7tbUgwNa+7r/kLiINCiK/1iCx 17exILvntvvAuyDHz8HSPyCwxbrOtOfH0rChILXOt8G/9r/kPyIgDQois60gsdcgu+e298C7ILvn tvvHz7TCsMXB9iwgu+e2+7neseK4piC/+MfPtMKwxyC+xrPXv+QuLi4iDQoNCrevuuq8rcfBDQo8 aHR0cDovL3d3dy5sb3Zlc3VyZi5jby5rci9kZWZhdWx0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2Vi aGFydmVzdC1kZXZlbG8NCnBAbGlzdHMuc291cmNlZm9yZ2UubmV0PiAqILTCIL+kuK7GrrXpwMcg ua7IrcTateW3ziC8usDlx9EgwM6wo7D8sOggu+fAzMaut84gvLrAzluzsrD6v6ldLCBbs7Kw+rOy XSwNClu/qb/Nv6ldLCDDu7zSs+Jbs7Kw+r+pXcDHILi4s7LAxyDEv7nCtM/GvLimILjwtc4gvPa/ 68fRIMTBxdnGrrfOIL3MsdsgtvPAzMfBwMcgu+e7/ciwsccgurjIo7/NIL26xeTEv8DHILnmwfa4 piDAp8fPv6kNCrXut8/H0rantMIgwMy4pywgvLq47Swgwda80iwgwPzIrbn4yKO4piC5r8H2vsrA uLjnILHit8/H0iDHyr/kvvjAzCCwobjtwLi3ziDA2rHivNKws7imIMfSILz2IMDWtMIgvsjA/MfR ILvnwMzGriDA1LTPtNkuDQrDu7zSs+K758DMxq60wiB3d3cubXlnYWwuY28ua3IgPGh0dHA6Ly93 d3cubXlnYWwuY28ua3I+ICDA1LTPtNkuIA0KDQrD38O1wM4gLSANCg0KDQrAzCC43sDPwLsgvtXA uLfOILnesO0gvc3B9iC+ysC4vcO02bjpILjewM+89r3FsMXA/Q0KPGh0dHA6Ly93d3cubG92ZXN1 cmYuY28ua3IvdG9wYXIvcmVqZWN0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2ViaGFydmVzdC1kDQpl dmVsb3BAbGlzdHMuc291cmNlZm9yZ2UubmV0PiAgwLsgxam4ryDHz7y8v+QuIA0KDQo= |
From: <do...@tr...> - 2003-06-11 18:44:45
|
Li4gvsiz58fPvLy/5D8gDQoNCltjbGljayB0byBzZWUgXQ0KPGh0dHA6Ly93d3cubG92ZXN1cmYu Y28ua3IvZGVmYXVsdC5hc3A/cz0xMjY2NTc5JmVtYWlsPXdlYmhhcnZlc3QtZGV2ZWxvDQpwQGxp c3RzLnNvdXJjZWZvcmdlLm5ldD4gIAkNCg0KIrvntvsgx8+46byttbUgxKOxuLfOILi4s6q0wrDU ILmrvbwgwMe5zLChIMDWwdI/IiANCiK52bbzuLgguri0wiC757b7tbUgwNa+7r/kLiINCiK/1iCx 17exILvntvvAuyDHz8HSPyCwxbrOtOfH0rChILXOt8G/9r/kPyIgDQois60gsdcgu+e298C7ILvn tvvHz7TCsMXB9iwgu+e2+7neseK4piC/+MfPtMKwxyC+xrPXv+QuLi4iDQoNCrevuuq8rcfBDQo8 aHR0cDovL3d3dy5sb3Zlc3VyZi5jby5rci9kZWZhdWx0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2Vi aGFydmVzdC1kZXZlbG8NCnBAbGlzdHMuc291cmNlZm9yZ2UubmV0PiAqILTCIL+kuK7GrrXpwMcg ua7IrcTateW3ziC8usDlx9EgwM6wo7D8sOggu+fAzMaut84gvLrAzluzsrD6v6ldLCBbs7Kw+rOy XSwNClu/qb/Nv6ldLCDDu7zSs+Jbs7Kw+r+pXcDHILi4s7LAxyDEv7nCtM/GvLimILjwtc4gvPa/ 68fRIMTBxdnGrrfOIL3MsdsgtvPAzMfBwMcgu+e7/ciwsccgurjIo7/NIL26xeTEv8DHILnmwfa4 piDAp8fPv6kNCrXut8/H0rantMIgwMy4pywgvLq47Swgwda80iwgwPzIrbn4yKO4piC5r8H2vsrA uLjnILHit8/H0iDHyr/kvvjAzCCwobjtwLi3ziDA2rHivNKws7imIMfSILz2IMDWtMIgvsjA/MfR ILvnwMzGriDA1LTPtNkuDQrDu7zSs+K758DMxq60wiB3d3cubXlnYWwuY28ua3IgPGh0dHA6Ly93 d3cubXlnYWwuY28ua3I+ICDA1LTPtNkuIA0KDQrD38O1wM4gLSANCg0KDQrAzCC43sDPwLsgvtXA uLfOILnesO0gvc3B9iC+ysC4vcO02bjpILjewM+89r3FsMXA/Q0KPGh0dHA6Ly93d3cubG92ZXN1 cmYuY28ua3IvdG9wYXIvcmVqZWN0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2ViaGFydmVzdC1kDQpl dmVsb3BAbGlzdHMuc291cmNlZm9yZ2UubmV0PiAgwLsgxam4ryDHz7y8v+QuIA0KDQo= |
From: <pr...@tx...> - 2002-12-30 21:39:15
|
SXQncyBub3cgb3IgTmV2ZXIhIA0KDQpbY2xpY2sgdG8gc2VlIF0NCjxodHRwOi8vMjExLjIzOC45 Ni4yMjIvZGVmYXVsdC5hc3A/cz0xMjY2NTc5JmVtYWlsPXdlYmhhcnZlc3QtZGV2ZWxvcEBsaQ0K c3RzLnNvdXJjZWZvcmdlLm5ldD4gIAkNCg0KLi4uIL7Is+fHz7y8v+Q/DQoNCrOywNogv6nA2sDH ILv1t86/7iDAz8W7LiANCiK/qbHiILChuri8vMi/Li4uXl7A/C4uwM7FzbPdILawILTZtM+5yLyt Li4uIMDMt7iw1CC3ucHuuvG+8LOiuK4uLrDUwMy16bOiuK4uLg0Ks7Kz4LOiuK4uLiC4uLOywLsg wMy378HWtMkgu+fAzMautMkgw7cguriwxbWiyL9eXiC79bfOv+4gu+e2+yDDo77GILaws6q9xyAN CrrQwLq/qS4uLrX8IcDMv7nIv35+pL6kvqS+IL/kwfIgwP7AusDMtenAzCC4uLOywLsgsKHB9rTC IMW4ILvnwMzGrr/NtMIgtN64rg0KMjC06yDB37ndv6G8rSA0MLTrILHuwfYgwaS4uyC/1by6x9Eg yLC1v7XpwLsgx8+0wiCw98DMtPW287G4v+ReXiogsde4rrG4IA0KtaXAzMauILjxwPvAxyCx17fs tbUgwNa0wiC53bjpILW/sMWzqiBTeHi4piC48cD7wM4gsde37LW1IMDWsbjIv34gwPq1tSC80rmu DQq16LHXILCstMm12C4us9EgyLmx4sD7wMy28y4utNm0+i4ux+C6ucfRIMfPt+cgtce8wH5+ILLA ILChuri8vMi/fn5+Xl4qIA0KDQoNCg0Kw9/DtcDOIC0gDQoNCrevuuq8rcfBKg0KPGh0dHA6Ly8y MTEuMjM4Ljk2LjIyMi9kZWZhdWx0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2ViaGFydmVzdC1kZXZl bG9wQGxpDQpzdHMuc291cmNlZm9yZ2UubmV0PiAgwLogvLrAziBbs7Kw+r+pXSwgW7OysPqzsl0s IFu/qb/Nv6ldLCDDu7zSs+Jbs7Kw+r+pXcDHIMS/ucK0z8a8uKYguPC1ziDH1cfRDQq758DMxq63 ziC9zLHbILbzwMzHwcDHILvnu/3IsLHHwMcgurjIo7/NIL26xeTEv8DHILnmwfa4piDAp8fPv6kg wMy4pywgvLq47Swgwda80iwgwPzIrbn4yKPAuyC5r8H2vsrAuLjnILHit8/H0iDHyr/kvvjAzA0K sKG47bi4wLi3ziDA2rHivNKws7imIMfSILz2IMDWtMIgu+fAzMauIMDUtM+02S4gDQrDu7zSs+LA uyDAp8fRILvnwMzGrrTCIL+pseIgPGh0dHA6Ly8yMTEuMjM4Ljk2LjE1MT4gIMDUtM+02S4gDQoN Cg0KwMwguN7Az8C7IL7VwLi3ziC53rDtIL3NwfYgvsrAuL3DtNm46SC43sDPvPa9xbDFwP0NCjxo dHRwOi8vMjExLjIzOC45Ni4yMjIvdG9wYXIvcmVqZWN0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2Vi aGFydmVzdC1kZXZlbA0Kb3BAbGlzdHMuc291cmNlZm9yZ2UubmV0PiAgwLsgxam4ryDHz7y8v+Qu DQq1zrn4ILTZvcMgurizu8H2IL7KvcC0z7TZLiANCg0K |
From: <pre...@ha...> - 2002-09-29 17:54:49
|
SXQncyBub3cgb3IgTmV2ZXIhIA0KDQpbY2xpY2sgdG8gc2VlIF0NCjxodHRwOi8vMjExLjIzOC45 Ni4yMjIvZGVmYXVsdC5hc3A/cz0xMjY2NTc5JmVtYWlsPXdlYmhhcnZlc3QtZGV2ZWxvcEBsaQ0K c3RzLnNvdXJjZWZvcmdlLm5ldD4gIAkNCg0KLi4uIL7Is+fHz7y8v+Q/DQoNCrOywNogv6nA2sDH ILv1t86/7iDAz8W7LiANCiK/qbHiILChuri8vMi/Li4uXl7A/C4uwM7FzbPdILawILTZtM+5yLyt Li4uIMDMt7iw1CC3ucHuuvG+8LOiuK4uLrDUwMy16bOiuK4uLg0Ks7Kz4LOiuK4uLiC4uLOywLsg wMy378HWtMkgu+fAzMautMkgw7cguriwxbWiyL9eXiC79bfOv+4gu+e2+yDDo77GILaws6q9xyAN CrrQwLq/qS4uLrX8IcDMv7nIv35+pL6kvqS+IL/kwfIgwP7AusDMtenAzCC4uLOywLsgsKHB9rTC IMW4ILvnwMzGrr/NtMIgtN64rg0KMjC06yDB37ndv6G8rSA0MLTrILHuwfYgwaS4uyC/1by6x9Eg yLC1v7XpwLsgx8+0wiCw98DMtPW287G4v+ReXiogsde4rrG4IA0KtaXAzMauILjxwPvAxyCx17fs tbUgwNa0wiC53bjpILW/sMWzqiBTeHi4piC48cD7wM4gsde37LW1IMDWsbjIv34gwPq1tSC80rmu DQq16LHXILCstMm12C4us9EgyLmx4sD7wMy28y4utNm0+i4ux+C6ucfRIMfPt+cgtce8wH5+ILLA ILChuri8vMi/fn5+Xl4qIA0KDQoNCg0Kw9/DtcDOIC0gDQoNCrevuuq8rcfBKg0KPGh0dHA6Ly8y MTEuMjM4Ljk2LjIyMi9kZWZhdWx0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2ViaGFydmVzdC1kZXZl bG9wQGxpDQpzdHMuc291cmNlZm9yZ2UubmV0PiAgwLogvLrAziBbs7Kw+r+pXSwgW7OysPqzsl0s IFu/qb/Nv6ldLCDDu7zSs+Jbs7Kw+r+pXcDHIMS/ucK0z8a8uKYguPC1ziDH1cfRDQq758DMxq63 ziC9zLHbILbzwMzHwcDHILvnu/3IsLHHwMcgurjIo7/NIL26xeTEv8DHILnmwfa4piDAp8fPv6kg wMy4pywgvLq47Swgwda80iwgwPzIrbn4yKPAuyC5r8H2vsrAuLjnILHit8/H0iDHyr/kvvjAzA0K sKG47bi4wLi3ziDA2rHivNKws7imIMfSILz2IMDWtMIgu+fAzMauIMDUtM+02S4gDQrDu7zSs+LA uyDAp8fRILvnwMzGrrTCIL+pseIgPGh0dHA6Ly8yMTEuMjM4Ljk2LjE1MT4gIMDUtM+02S4gDQoN Cg0KwMwguN7Az8C7IL7VwLi3ziC53rDtIL3NwfYgvsrAuL3DtNm46SC43sDPvPa9xbDFwP0NCjxo dHRwOi8vMjExLjIzOC45Ni4yMjIvdG9wYXIvcmVqZWN0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2Vi aGFydmVzdC1kZXZlbA0Kb3BAbGlzdHMuc291cmNlZm9yZ2UubmV0PiAgwLsgxam4ryDHz7y8v+Qu DQq1zrn4ILTZvcMgurizu8H2IL7KvcC0z7TZLiANCg0K |
From: <pre...@lo...> - 2002-05-15 16:16:50
|
Li4gvsiz58fPvLy/5D8gDQoNCltjbGljayB0byBzZWUgXQ0KPGh0dHA6Ly93d3cubG92ZXN1cmYu Y28ua3IvZGVmYXVsdC5hc3A/cz0xMjY2NTc5JmVtYWlsPXdlYmhhcnZlc3QtZGV2ZWxvDQpwQGxp c3RzLnNvdXJjZWZvcmdlLm5ldD4gIAkNCg0KIrvntvsgx8+46bytIMSjsbi3ziC4uLOqtMKw1CC5 q728IMDHucywoSDA1sHSPyIgDQoiudm287i4ILq4tMIgu+e2+7W1IMDWvu6/5C4iDQoiv9Ygsde3 sSC757b7wLsgx8/B0j8gsMW6zrTnx9KwoSC1zrfBv/a/5D8iIA0KIrOtILHXILvntvfAuyC757b7 x8+0wrDFwfYsILvntvu53rHiuKYgv/jHz7TCsMcgvsaz17/kLi4uIg0KDQp3d3cuTG92ZXN1cmYu Y28ua3INCjxodHRwOi8vd3d3LmxvdmVzdXJmLmNvLmtyL2RlZmF1bHQuYXNwP3M9MTI2NjU3OSZl bWFpbD13ZWJoYXJ2ZXN0LWRldmVsbw0KcEBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQ+ICogx9Gx28bH wLogw9EgMjksMDAwIMbkwMzB9rfOILTcwM+x4rTJIMClILvnwMzGrrfOtMIgx9GxucC6ILmwt9Ag vLyw6A0Kw9a067HUuPDAxyC758DMxq63ziC8usDOIFuzsrD6v6ldLCBbs7Kw+rOyXSwgW7+pv82/ qV0sIMO7vNKz4luzsrD6v6ldwMcgNCCws8DHIMS/ucK0z8a8uKYguPC1ziC89r/rx9Egu+fAzMau t84gvcyx2w0KtvPAzMfBwMcgu+e7/ciwscfAxyC6uMijv80gvbrF5MS/wMcguebB9rimIMPWv+y8 scC4t84gwbbB97XIIMClILvnwMzGriDA1LTPtNkuIA0Kw7u80rPiwLsgwKfH0SC758DMxq60wiB3 d3cubXlnYWwuY28ua3IgPGh0dHA6Ly93d3cubXlnYWwuY28ua3I+ICDA1LTPtNkuIA0KDQrD38O1 wM4gLSANCg0KDQrAzCC43sDPwLsgvtXAuLfOILnesO0gvc3B9iC+ysC4vcO02bjpILjewM+89r3F sMXA/Q0KPGh0dHA6Ly93d3cuTE9WRVNVUkYuY28ua3IvdG9wYXIvcmVqZWN0LmFzcD9zPTEyNjY1 NzkmZW1haWw9d2ViaGFydmVzdC1kDQpldmVsb3BAbGlzdHMuc291cmNlZm9yZ2UubmV0PiAgwLsg xam4ryDHz7y8v+QuIA0KDQo= |
From: <gol...@wo...> - 2002-01-21 10:45:22
|
tNQuLiC+yLPnx8+8vL/kPyANCg0KW2NsaWNrIHRvIHNlZSBdDQo8aHR0cDovL3d3dy5sb3Zlc3Vy Zi5jby5rci9kZWZhdWx0LmFzcD9zPTEyNjY1NzkmZW1haWw9d2ViaGFydmVzdC1kZXZlbG8NCnBA bGlzdHMuc291cmNlZm9yZ2UubmV0PiAgCQ0KDQoiu+e2+yDHz7jpvK0gxKOxuLfOILi4s6q0wrDU ILmrvbwgwMe5zLChIMDWwdI/IiANCiK52bbzuLgguri0wiC757b7tbUgwNa+7r/kLiINCiK/1iCx 17exILvntvvAuyDHz8HSPyCwxbrOtOfH0rChILXOt8G/9r/kPyIgDQois60gsdcgu+e298C7ILvn tvvHz7TCsMXB9iwgu+e2+7neseK4piC/+MfPtMKwxyC+xrPXv+QuLi4iDQoNCnd3dy5Mb3Zlc3Vy Zi5jby5rcg0KPGh0dHA6Ly93d3cubG92ZXN1cmYuY28ua3IvZGVmYXVsdC5hc3A/cz0xMjY2NTc5 JmVtYWlsPXdlYmhhcnZlc3QtZGV2ZWxvDQpwQGxpc3RzLnNvdXJjZWZvcmdlLm5ldD4gKiDH0bHb xsfAuiDD0SAyOSwwMDAgxuTAzMH2t84gtNzAz7HitMkgwKUgu+fAzMaut860wiDH0bG5wLogubC3 0CC8vLDoDQrD1rTrsdS48MDHILvnwMzGrrfOILy6wM4gW7OysPq/qV0sIFuzsrD6s7JdLCBbv6m/ zb+pXSwgw7u80rPiW7OysPq/qV3AxyA0ILCzwMcgxL+5wrTPxry4piC48LXOILz2v+vH0SC758DM xq63ziC9zLHbDQq288DMx8HAxyC757v9yLCxx8DHILq4yKO/zSC9usXkxL/AxyC55sH2uKYgw9a/ 7LyxwLi3ziDBtsH3tcggwKUgu+fAzMauIMDUtM+02S4gDQrDu7zSs+LAuyDAp8fRILvnwMzGrrTC IHd3dy5teWdhbC5jby5rciA8aHR0cDovL3d3dy5teWdhbC5jby5rci9kZWZhdWx0LmFzcD4gIMDU tM+02S4gDQoNCsPfw7XAziAtIA0KDQoNCsDMILjewM/AuyC+1cC4t84gud6w7SC9zcH2IL7KwLi9 w7TZuOkguN7Az7z2vcWwxcD9DQo8aHR0cDovL3d3dy5sb3Zlc3VyZi5jby5rci90b3Bhci9yZWpl Y3QuYXNwP3M9MTI2NjU3OSZlbWFpbD13ZWJoYXJ2ZXN0LWQNCmV2ZWxvcEBsaXN0cy5zb3VyY2Vm b3JnZS5uZXQ+ICDAuyDFqbivIMfPvLy/5C4gDQoNCg== |
From: Xavi D. F. <xd...@is...> - 2001-12-14 16:39:35
|
Hi. Is the Harvest-NG still hosted in SourceForge.net ?. I haven't seen much activity, though I'm not very used to sf.net We've used Harvest-NG at iSOCO to implement part of a solution for a customer, and would like to contribute our customizations so that they may be added to the project. We worked from the harvest-ng-1.0.2 snapshot, but we may download the cvs version and try to apply our changes there if you want. Then we could submit them as patches through sf.net . Would that do ? I'll tell you the list of changes so that you can see if they are interesting at all: - MaxRunTime. Both a command line option for reaper and a configuration file directive: --maxruntime <t> Exit orderly after a time t. t is the concatenation of an integer and an optional character that can be "s" for seconds (default), "m" for minutes, "h" for hours or "d" for days. For example --maxruntime 7d means run no more than one week. It will exit before that if all work is done. - Removed a warning in ClientServer.pm and Object.pm - added some debugging keys and messages (possibly too verbose now?) - Added a database module that does the same as DB.pm but stores document's full-text and urls in an Oracle table. It also does some checking to associate a site id with each page. It may be too dependent on the database schema, and so of little use for the general public. - Simplistic extracting of URL candidates from JavaScript files or Javascript code in HTML pages. It simply looks for string literals that may be URLs and tries a couple of base URLs if they're relative. This produces quite a lot of wrong URLs that are tried and give a 404 error, so it isn't very efficient, and it shouldn't be used without the consent of the visited sites (it will clutter their log with errors and they may think they have broken links or misbehaving JavaScript when that isn't the case). And of course there are many ways in that a site may be navigable for a Javascript enabled user and not for Harvest-NG even with this hack. - Added a time and process id stamp to debug messages. In fact we've done it easy but ugly. The Debug::ok function that should simply return whether the message is to be printed will print a timestamp every time it returns true. This side effect will cause some misplaced timestamps in the log once in a while, but allows us to know when things where happening without major code changes. - a dump-like utility that outputs all pages in the Harvest database to a Comma Separated Values file suitable for sqlldr (an Oracle import utility). This is also of little general interest because it is dependent on the particular schema. - Some small scripts to analyze the debug log: - extract fetched and stored urls - count pages fetched (tried) and stored (got) at given time intervals for each site or file extension (.cgi, .asp, .html, etc.) - generate a file for each domain for gnuplot to plot the number of pages fetched or stored over time (little tested). In fact I'm now realizing some more work would be needed to clean it up a little and for instance make the javascript url extraction optional. Currently you can't turn it off. Also, everything has been done in a hurry and may always be more cleanly done, but I don't know how much time I'm going to be allowed to spend on this. Another thing I'd like to ask is why there aren't copyright notices in the file headers, and no reference to the license in the source itself. I guess I should add iSOCO copyright notices for the changes and a reference to the GPL, but ideally I should only add it to the existing copyright notice from the original authors. But that existing copyright notice isn't there. -- Xavi Drudis Ferran xd...@is... |
From: Adams, C. M ERDC-ITL-N. <Cha...@er...> - 2001-08-10 13:41:09
|
Howdie all! I am trying to add the size of the downloaded URL to the Harvest database. There are a couple of ways to do this, I can get the bytes field from the HTTP header info, or check the file size before it is deleted. I can get this information easily enough, what I can't figure out is how to add a field to the database to contain it. I think I would want this in the metadata object, and I have added some code I thought should work, but I can't get any output when I try to dump it back out (I have modified the dump program accordingly). Is there an easy way to do this? Cheers! Chad Adams Harvest/Summarise/Reaper/HTTP.pm added this after line 25: $obj->manage->lmt($obj->headers->get('Last-Modified')); open(FILESIZE, "ls -l $file | nawk '{print \$5;}' |"); @filesize = <FILESIZE>; close FILESIZE; chomp(@filesize[0]); $obj->metadata->set("filesize", @filesize[0]); |
From: Pradeep K. <p_...@ya...> - 2001-07-04 05:21:50
|
Hi Simon, Being new to Harvest ng i'm not sure whether i have the right version of the utility. The configuration i have here is as follows: RedHat Linux Ver: 7 Perl: 5.6.1 Harvest 1.6.1 As i had mentioned,i'm trying to run gatherer on a site ex: "www.qut.edu.au" and want to extract any content on the site bearing the word "queensland" on it.I'm not sure if this is even possible.As of now,all that i have been able to retreive from "www.qut.edu.au" is everything on that site.I'm not being able to establish a filter in order to get only the relevant information. So,the perl script that is running against PRODUCTION.gdbm is parsing everything it retrieved. Since i haven't been able to configure gatherer to get more specific information,i'm presently doing the matching in my perl script that's parsing PRODUCTION.gdbm. In the sense,i'm reading one record of the GDBM file,look for the word "queensland" in the content of that record.If it does find one then,i'm writing the record into the database after making suitable modifications. The match finding process in perl is hardly generic and i'll need to make changes to the script quite often in order to get a match as close or desired as possible.This is definitely not the right way. So i would greatly appreciate if i can get clarification on these points: 1. Do i have everything needed in Harvest NG version that i have downloaded ? As,i don't seem to find the Controller directory under my harvest installation directory. 2. In example-5 that comes with the harvest v-1.6.1 download,filters are being used.The example says: "'regex' can be a pattern for a domainname, or IP addresses." Can we specify a "Allow string" in the place of domainname or IP address in order to enable a search on the string alone. I'm thankful for the prompt responses i have received and would greatly appreciate if i can get any thoughts on the queries i have listed. Regards, Pradeep. --- Simon Wilkinson <sx...@dc...> wrote: > On Tuesday 03 July 2001 12:57, you wrote: > > I would greatly appreciate if i could get some > info. > > on how to configure Gatherer to search on certain > > keywords on the specified sites. > > Use the ContentRegex SOIF filter. This allows you to > apply a regular > expression, and keep / drop objects based on their > content. ContentRegex is > currently only available in the CVS tree, not in the > 1.0.2 release. > > Cheers, > > Simon. > -- > Simon Wilkinson <si...@sx...> > http://www.sxw.org.uk > Buying an operating system without source is like > buying > a self-assembly Space Shuttle with no instructions. > > > _______________________________________________ > Webharvest-develop mailing list > Web...@li... > http://lists.sourceforge.net/lists/listinfo/webharvest-develop __________________________________________________ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail http://personal.mail.yahoo.com/ |
From: Simon W. <sx...@dc...> - 2001-07-03 15:28:33
|
On Tuesday 03 July 2001 12:57, you wrote: > I would greatly appreciate if i could get some info. > on how to configure Gatherer to search on certain > keywords on the specified sites. Use the ContentRegex SOIF filter. This allows you to apply a regular expression, and keep / drop objects based on their content. ContentRegex is currently only available in the CVS tree, not in the 1.0.2 release. Cheers, Simon. -- Simon Wilkinson <si...@sx...> http://www.sxw.org.uk Buying an operating system without source is like buying a self-assembly Space Shuttle with no instructions. |
From: Pradeep K. <p_...@ya...> - 2001-07-03 11:57:26
|
Hi, Thanks Stefan,for your response. We are stuck with a more basic problem after we got thro' the process of installing Harvest NG.We have been able to run the examples.The result was a set of files as documented. PRODUCTION.GDBM looks to be the most important file so we parsed that using a perl script and dumped the data into a table on MySQL. We now need to refine our search to ensure that information only on certain keywords are retreived. Frankly, i have not done an extensive study of the docs but i do need to get the refined search done in order to get the project to the next stage. I would greatly appreciate if i could get some info. on how to configure Gatherer to search on certain keywords on the specified sites. Thanks in advance, Pradeep. --- Stefan Kokkelink <sko...@ma...> wrote: > Hi Pradeep, > > perhaps you should condsider to write a MySQL > implementation of the Harvest::Database::Generic > module. > > what you would have to think of: > -- what data do you want to store? > -- some pre- and postfilter depend on the > Metadata::SOIF > module, do you want to use them? > > regards, > stefan > > > > Pradeep Kumar wrote: > > > > Hi, > > > > We are designing a system which uses Harvest NG > to > > do a search on the web to get information on key > > words. > > We,at this stage of awareness,plan to read the > GDBM > > files and write the data out to MySQL using a perl > > script. > > A few Questions we have: > > 1. Is it essential to keep the process of getting > > info. from Net separate from the process of > writing > > data to MySQL? Or is it possible to integrate the > > whole process by plugging another application to > > Harvest NG ? Or even still,does Harvest NG provide > for > > a way to do the write to the Database? > > > > 2. Since the output of the search are going to be > > pretty big files,typically,how should we organise > for > > the conversion to DataBase part of the application > in > > order to ensure there are no file locking issues > > between Harvest NG writing in to the GDBM files > and > > our Perl script reading from it? > > > > 3. Is there a cleaner design than the one we can > come > > up with? The design being,keep the Harvest NG > process > > separate from the perl script which reads the GDBM > > files and writes it to the Databse. > > > > Any help is greatly appreciated. > > > > Pradeep. > > > > __________________________________________________ > > Do You Yahoo!? > > Get personalized email addresses from Yahoo! Mail > > http://personal.mail.yahoo.com/ > > > > _______________________________________________ > > Webharvest-develop mailing list > > Web...@li... > > > http://lists.sourceforge.net/lists/listinfo/webharvest-develop > > _______________________________________________ > Webharvest-develop mailing list > Web...@li... > http://lists.sourceforge.net/lists/listinfo/webharvest-develop __________________________________________________ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail http://personal.mail.yahoo.com/ |
From: Stefan K. <sko...@ma...> - 2001-07-02 14:29:43
|
Hi Pradeep, perhaps you should condsider to write a MySQL implementation of the Harvest::Database::Generic module. what you would have to think of: -- what data do you want to store? -- some pre- and postfilter depend on the Metadata::SOIF module, do you want to use them? regards, stefan Pradeep Kumar wrote: > > Hi, > > We are designing a system which uses Harvest NG to > do a search on the web to get information on key > words. > We,at this stage of awareness,plan to read the GDBM > files and write the data out to MySQL using a perl > script. > A few Questions we have: > 1. Is it essential to keep the process of getting > info. from Net separate from the process of writing > data to MySQL? Or is it possible to integrate the > whole process by plugging another application to > Harvest NG ? Or even still,does Harvest NG provide for > a way to do the write to the Database? > > 2. Since the output of the search are going to be > pretty big files,typically,how should we organise for > the conversion to DataBase part of the application in > order to ensure there are no file locking issues > between Harvest NG writing in to the GDBM files and > our Perl script reading from it? > > 3. Is there a cleaner design than the one we can come > up with? The design being,keep the Harvest NG process > separate from the perl script which reads the GDBM > files and writes it to the Databse. > > Any help is greatly appreciated. > > Pradeep. > > __________________________________________________ > Do You Yahoo!? > Get personalized email addresses from Yahoo! Mail > http://personal.mail.yahoo.com/ > > _______________________________________________ > Webharvest-develop mailing list > Web...@li... > http://lists.sourceforge.net/lists/listinfo/webharvest-develop |
From: Pradeep K. <p_...@ya...> - 2001-06-30 12:40:52
|
Hi, We are designing a system which uses Harvest NG to do a search on the web to get information on key words. We,at this stage of awareness,plan to read the GDBM files and write the data out to MySQL using a perl script. A few Questions we have: 1. Is it essential to keep the process of getting info. from Net separate from the process of writing data to MySQL? Or is it possible to integrate the whole process by plugging another application to Harvest NG ? Or even still,does Harvest NG provide for a way to do the write to the Database? 2. Since the output of the search are going to be pretty big files,typically,how should we organise for the conversion to DataBase part of the application in order to ensure there are no file locking issues between Harvest NG writing in to the GDBM files and our Perl script reading from it? 3. Is there a cleaner design than the one we can come up with? The design being,keep the Harvest NG process separate from the perl script which reads the GDBM files and writes it to the Databse. Any help is greatly appreciated. Pradeep. __________________________________________________ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail http://personal.mail.yahoo.com/ |
From: Simon W. <sx...@dc...> - 2001-01-04 00:27:59
|
> On Tue, Jan 02, 2001 at 10:26:57AM +0100, Stefan Kokkelink wrote: > > Good question. Since I intend to provide an RDF storage > > module for Harvest-NG, I would like to know if there are > > any plans for Harvest-NG in the near future. > > > Same over here. I had to do some things which might be useful for others, > but since I had to tweak a couple of lines in Harvest::Controller.pm, > I'd like to discuss things here to find out whether that is the Right Way > (TM) -- and contribute or solve my problem in another way. Folks - I've been kinda busy with other things over the last few months. Harvest-NG has scratched my itch (so to speak), and I've got no pressing plans to make major changes to it. There doesn't seem to be that much pressure for those changes to be made - it seems to work in most situations I, and others, have deployed it in. That said, I'm delighted to accept code and suggestions for improvements and modifications. I'm wading through a backlog of email at the moment - I'll hopefully get to your suggestions shortly. So, please do use Harvest-NG as a platform - I'm committed to continuing to support it and coordinating its continued development. Cheers, Simon. |
From: <to...@ma...> - 2001-01-02 10:12:36
|
On Tue, Jan 02, 2001 at 10:26:57AM +0100, Stefan Kokkelink wrote: > to...@ma... wrote: > > > > Hi, > > > > sorry for the noise. [...] > Good question. Since I intend to provide an RDF storage > module for Harvest-NG, I would like to know if there are > any plans for Harvest-NG in the near future. > Same over here. I had to do some things which might be useful for others, but since I had to tweak a couple of lines in Harvest::Controller.pm, I'd like to discuss things here to find out whether that is the Right Way (TM) -- and contribute or solve my problem in another way. A happy new year -- tomas |
From: Stefan K. <sko...@ma...> - 2001-01-02 09:28:05
|
to...@ma... wrote: > > Hi, > > sorry for the noise. I'm planning to use Harvest-NG in a project > (I'm actually hacking happily at it, as I posted a couple of days > ago), and I'd like to contribute my changes. Is this list the > right place for it? Is it active? Good question. Since I intend to provide an RDF storage module for Harvest-NG, I would like to know if there are any plans for Harvest-NG in the near future. Simon? All the best, Stefan |
From: <to...@ma...> - 2000-12-30 18:03:28
|
Hi, sorry for the noise. I'm planning to use Harvest-NG in a project (I'm actually hacking happily at it, as I posted a couple of days ago), and I'd like to contribute my changes. Is this list the right place for it? Is it active? Regards, and a good start of 2001 -- tomas |
From: <to...@ma...> - 2000-12-27 15:29:16
|
Hi, I am new to this list. I'm using Harvest NG for one of my customers. One of the requirements was to exclude some URLs from storage (but to follow the links included). For this I had to develop a module and make a few changes to Harvest/Controller.pm. I'd like to discuss my changes and make them available. How? Greetings -- tomas |
From: Simon W. <sx...@dc...> - 2000-09-07 23:50:48
|
> I'm trying to crawl a site that puts a session ID in its URLs using > harvest-ng. These session IDs are, however, removable and the URLs are > still ok. What I would really like to be able to do is modify a URL > before visiting it. I haven't been able to work out from the > documentation how this can be done in a config file. I'm doing this for sites which use minivend (which has a removable session ID). However, the code in the CVS repository has a slight fault (ooops), which I've just checked in the fix for (you want version 1.2 of URLFix.pm) Once you've got that, either from the repository directly or via a nightly snapshot, the magic that you need in the configuration file is: <Postfilters> ... URLFix /(.*)\?.*;[0-9]*;.*/$1/ ... </Postfilters> The argument to URLFix is a perl regular expression. In this case it removes stuff of the form ?wibble;100;wobble Hope that helps! There should be a new version of harvest-ng appearing which will include this and other additional functionality in about a month, once I get back from tour. Cheers, Simon |
From: Matthew B. <ma...@ar...> - 2000-09-07 18:22:34
|
I'm trying to crawl a site that puts a session ID in its URLs using harvest-ng. These session IDs are, however, removable and the URLs are still ok. What I would really like to be able to do is modify a URL before visiting it. I haven't been able to work out from the documentation how this can be done in a config file. I'm quite prepared to hack perl if that's necessary, but if somebody could point me to the specific bit where URLs are grabbed from the HTML that would simplify my life quite a bit :) Thanks, Matthew Booth |
From: Jason D. P. <jdp...@jd...> - 2000-08-21 08:07:13
|
Jason D. Piercy President & CEO JDP GROUP Inc. Website: www.jdp-associates.com Email: jdp...@jd... Unified Messaging (Telephone/Fax/Mobile) 416.652.6028 ext. 22 |
From: Simon W. <sx...@dc...> - 2000-07-11 15:00:08
|
On Tue, 11 Jul 2000, Chris Brown wrote: > Any other ideas? I used yaz (another great tool from indexdata, and totally free this time) to talk to zebra, and a perl script around it to display the results. I've attached our current live perl script below - let me know if this works for you - you may need to make some tweaks to your configuration in order to get exactly the data you want returned by Zebra. As I mentioned earlier, time permitting I would eventually like to include this in the next version of Harvest-NG (in a much tidied state, of course) A quick disclaimer here - I know next to nothing about Z39.50 - I just blundered through by trial and error to get something that works. If there are any glaring uglies that can be noticed by those more experienced, let me know! You may well be more interested in the modules embedded within the script than with the "printing" glue. The Harvest::Query::Yaz object is an attempt at creating a perl interface to Yaz to encapsulate the state of a given query. Cheers, Simon. |
From: Chris B. <cp...@fe...> - 2000-07-11 14:21:36
|
Simon Wilkinson wrote: > > On Tue, 11 Jul 2000, Chris Brown wrote: > > I've been toying with Harvest NG and wondered what indexing tools people > > are using. I don't have Harvest installed. Before I go ahead I'd like to > > know what alternatives people are using. > > I'm currently using Zebra for the noncommercial stuff. Zebra is, IMHO, one > of the best freetext searching engines currently available - however its only > free for non-commercial use. If you're developing a site which can use Zebra > (available from www.indexdata.dk) I'd strongly recommend it. I'm going to > bundle the set of tools that I use to interface Harvest-NG to it in the next > release. > > If you're looking for something for commercial use, or to use something that > is truely free (speech, not beer), I'm in the process of evaluating a number > of alternatives for a client at present. I'll let the list know how I get on. Thanks for this Simon I've installed and tested zebra. I have also managed to index the SOIF files created when I tested my Harvest NG installation. I am now looking for a method to build a web interface, preferably using Perl. Are there any resources/tools available? I could use the Socket module and talk raw Z39.50 with the server. Any other ideas? Cheers -- Chris Brown Institute of Physics Publishing http://www.iop.org/ Dirac House Temple Back Bristol BS1 6BE email: cp...@fe... Tel: +44 (0)117 930 1220 Fax: +44 (0)117 930 1181 |