From: Jose G. L. <jg...@gs...> - 2008-05-19 16:15:25
|
Hi all, I have just downloaded the new beta version and Im testing the mbox crawler, everything seems to work right, but I have a doubt, If I test the crawler several times over the same mbox file I allways have one object modified, but I didnt made any modification over the file: The first time Saved RDF model to /tmp/mboxcrawl.rdf Crawl report Crawl started: Mon May 19 18:01:38 CEST 2008 Crawl stopped: Mon May 19 18:05:46 CEST 2008 Crawl time: 247647ms Exit code: completed New objects: 3231 Modified objects: 0 Unmodified objects: 1 Deleted objects: 0 The next time: Saved RDF model to /tmp/mboxcrawl.rdf Crawl report Crawl started: Mon May 19 17:58:17 CEST 2008 Crawl stopped: Mon May 19 17:58:32 CEST 2008 Crawl time: 14884ms Exit code: completed New objects: 0 Modified objects: 1 Unmodified objects: 3231 Deleted objects: 0 Other thing, I remember to read something about mbox files with email address in different formats to avoid spam, for example: From: glgxg at sbcglobal.net (NoOp) is this issue supported? Congratulation with this new version, Im very happy with this new crawler, I suppose that I could use this crawler as an information extractor for Qualipso Project with very few modifications (ontologies conversion, jena repository...), are there any license problem with this example crawlers? Regards, -- José Gato Luis | Libre Software Engineering Lab (GSyC) Tel: (+34)-914 888 105 | Universidad Rey Juan Carlos jg...@gs... | Edif. Departamental II - Despacho 121 http://libresoft.urjc.es/ | c/Tulipán s/n 28933 Móstoles (Madrid) |
From: Jose G. L. <jg...@gs...> - 2008-05-20 09:46:41
|
Hi all, I have another implementation question. is there any way to use the example handler with my own functions? for example, I want to use the SimpleCrawlerHandler, but I want the crawler calls to my own newObject function before the newObject function implemented in the SimpleCrawler handler. I want to use all the functions in the mboxCrawlerExample( abstractHandler, SimpleHandler, ValidationHandler...) with only a few functions implemented by me. Thanks all, Jose Gato Luis escribió: > Hi all, > > I have just downloaded the new beta version and Im testing the mbox > crawler, everything seems to work right, but I have a doubt, If I test > the crawler several times over the same mbox file I allways have one > object modified, but I didnt made any modification over the file: > > The first time > > Saved RDF model to /tmp/mboxcrawl.rdf > Crawl report > Crawl started: Mon May 19 18:01:38 CEST 2008 > Crawl stopped: Mon May 19 18:05:46 CEST 2008 > Crawl time: 247647ms > Exit code: completed > New objects: 3231 > Modified objects: 0 > Unmodified objects: 1 > Deleted objects: 0 > > The next time: > > Saved RDF model to /tmp/mboxcrawl.rdf > Crawl report > Crawl started: Mon May 19 17:58:17 CEST 2008 > Crawl stopped: Mon May 19 17:58:32 CEST 2008 > Crawl time: 14884ms > Exit code: completed > New objects: 0 > Modified objects: 1 > Unmodified objects: 3231 > Deleted objects: 0 > > > Other thing, I remember to read something about mbox files with email > address in different formats to avoid spam, for example: > > From: glgxg at sbcglobal.net (NoOp) > > is this issue supported? > > > Congratulation with this new version, Im very happy with this new > crawler, I suppose that I could use this crawler as an information > extractor for Qualipso Project with very few modifications (ontologies > conversion, jena repository...), are there any license problem with this > example crawlers? > > Regards, > -- José Gato Luis | Libre Software Engineering Lab (GSyC) Tel: (+34)-914 888 105 | Universidad Rey Juan Carlos jg...@gs... | Edif. Departamental II - Despacho 121 http://libresoft.urjc.es/ | c/Tulipán s/n 28933 Móstoles (Madrid) |
From: Antoni M. <ant...@gm...> - 2008-05-21 00:32:50
|
Jose Gato Luis pisze: > Hi all, > > I have another implementation question. is there any way to use the > example handler with my own functions? for example, I want to use the > SimpleCrawlerHandler, but I want the crawler calls to my own newObject > function before the newObject function implemented in the SimpleCrawler > handler. I want to use all the functions in the mboxCrawlerExample( > abstractHandler, SimpleHandler, ValidationHandler...) with only a few > functions implemented by me. > > Thanks all, > I think I don't understand the question. You can always subclass SimpleCrawlerHandler and override a single objectNew method, do whatever you please with the DataObject and call super.objectNew at the end. Does this answer your question? The examples where never really meant for production use, more a tool to easily test the aperture components and showcase its abilities. If you have specific issues with them, post sourceforge tickets, preferrably with patches if possible :) Antoni Mylka ant...@gm... |
From: Jose G. L. <jg...@gs...> - 2008-05-22 16:23:14
|
Antoni Myłka escribió: > Jose Gato Luis pisze: >> Hi all, >> >> I have another implementation question. is there any way to use the >> example handler with my own functions? for example, I want to use the >> SimpleCrawlerHandler, but I want the crawler calls to my own newObject >> function before the newObject function implemented in the SimpleCrawler >> handler. I want to use all the functions in the mboxCrawlerExample( >> abstractHandler, SimpleHandler, ValidationHandler...) with only a few >> functions implemented by me. >> >> Thanks all, >> > > I think I don't understand the question. You can always subclass > SimpleCrawlerHandler and override a single objectNew method, do whatever > you please with the DataObject and call super.objectNew at the end. Does > this answer your question? sure, thank you... > > The examples where never really meant for production use, more a tool to > easily test the aperture components and showcase its abilities. If you > have specific issues with them, post sourceforge tickets, preferrably > with patches if possible :) in this moment we need to do very simple tasks, I can use the example crawler, using the newObject funciont to send the RDF information to a semantic repository to be stored. > > Antoni Mylka > ant...@gm... > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel -- José Gato Luis | Libre Software Engineering Lab (GSyC) Tel: (+34)-914 888 105 | Universidad Rey Juan Carlos jg...@gs... | Edif. Departamental II - Despacho 121 http://libresoft.urjc.es/ | c/Tulipán s/n 28933 Móstoles (Madrid) |
From: Antoni M. <ant...@gm...> - 2008-05-20 23:55:54
|
Jose Gato Luis pisze: > Hi all, > > I have just downloaded the new beta version and Im testing the mbox > crawler, everything seems to work right, but I have a doubt, If I test > the crawler several times over the same mbox file I allways have one > object modified, but I didnt made any modification over the file: > > The first time > > Saved RDF model to /tmp/mboxcrawl.rdf > Crawl report > Crawl started: Mon May 19 18:01:38 CEST 2008 > Crawl stopped: Mon May 19 18:05:46 CEST 2008 > Crawl time: 247647ms > Exit code: completed > New objects: 3231 > Modified objects: 0 > Unmodified objects: 1 > Deleted objects: 0 > > The next time: > > Saved RDF model to /tmp/mboxcrawl.rdf > Crawl report > Crawl started: Mon May 19 17:58:17 CEST 2008 > Crawl stopped: Mon May 19 17:58:32 CEST 2008 > Crawl time: 14884ms > Exit code: completed > New objects: 0 > Modified objects: 1 > Unmodified objects: 3231 > Deleted objects: 0 Please file a bug in sourceforge, if the file you're crawling is publicly available, please include a link or attach it if it's reasonably small. And feel free to submit a patch if you find a solution :) > > Other thing, I remember to read something about mbox files with email > address in different formats to avoid spam, for example: > > From: glgxg at sbcglobal.net (NoOp) > > is this issue supported? I remember stumbling over headers that looked much worse than that. The crawler should do as follows <uri:emailUri> nmo:from <uri:contactUri> . <uri:contactUri> a nco:Contact . <uri:contactUri> nco:hasEmailAddress <mailto:glgxg%20at%20sbcglobal.net> . <mailto:glgxg%20at%20sbcglobal.net> a nco:EmailAddress . <mailto:glgxg%20at%20sbcglobal.net> nco:emailAddress "glgxg at sbcglobal.net" . I.e. it produces an instance of the nco:EmailAddress class whose URI is composed with mailto: and an urlencoded version of whatever the value of the from header is. The instance gets an nco:emailAddress property with the string literal version of the email address (with all the quirks the spam-avoiding systems might invent). Clearly the mailto uri will be wrong and will probably not work if you try to pass it to some mail app, but it's there. The crawler doesn't try to 'guess' the correct address from the obfuscated one. This behavior seemed to work on the test data set, i.e. my mailbox. If it doesn't, file a bug report. > Congratulation with this new version, Im very happy with this new > crawler, I suppose that I could use this crawler as an information > extractor for Qualipso Project with very few modifications (ontologies > conversion, jena repository...), are there any license problem with this > example crawlers? The entire content of the examples folder is under AFL, at least that's what the headers in the files say. I don't see any issues with your using or modifying them. Are there? Antoni Mylka ant...@gm... |
From: Jose G. L. <jg...@gs...> - 2008-05-22 16:23:20
|
Im having the next exception with this headers: 5972 [main] WARN org.semanticdesktop.aperture.crawler.mbox.MboxCrawler - MessagingException while processing mbox://home/jgato/proyectos/Qualipso/A4/svn/qualipso/private/work/A4/src/../tools/mbox-test/UbuntuUsers-2008-February.mbox/47A880D6.7030807%40rug.nl-886012653 javax.mail.internet.AddressException: Illegal whitespace in address in string ``a.l.w.kuijper at rug.nl'' at javax.mail.internet.InternetAddress.checkAddress(InternetAddress.java:926) at javax.mail.internet.InternetAddress.parse(InternetAddress.java:819) at javax.mail.internet.InternetAddress.parseHeader(InternetAddress.java:580) at javax.mail.internet.MimeMessage.getAddressHeader(MimeMessage.java:680) at javax.mail.internet.MimeMessage.getFrom(MimeMessage.java:340) at org.semanticdesktop.aperture.crawler.mail.DataObjectFactory.handleSinglePart(DataObjectFactory.java:289) at org.semanticdesktop.aperture.crawler.mail.DataObjectFactory.handleMailPart(DataObjectFactory.java:197) at org.semanticdesktop.aperture.crawler.mail.DataObjectFactory.createDataObjects(DataObjectFactory.java:149) at org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler.getObject(AbstractJavaMailCrawler.java:464) at org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler.crawlMessage(AbstractJavaMailCrawler.java:366) at org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler.crawlMessages(AbstractJavaMailCrawler.java:332) at org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler.crawlSingleFolder(AbstractJavaMailCrawler.java:281) at org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler.crawlFolder(AbstractJavaMailCrawler.java:212) at org.semanticdesktop.aperture.crawler.mbox.MboxCrawler.crawlObjects(MboxCrawler.java:90) at org.semanticdesktop.aperture.crawler.base.CrawlerBase.crawl(CrawlerBase.java:216) at org.qualipso.a4.informationsource.tool.ExampleMboxCrawler.crawl(ExampleMboxCrawler.java:90) Maybe Im using an older jar version... On 21/05/08 Antoni MyBka wrote: > I remember stumbling over headers that looked much worse than that. > The > crawler should do as follows > > <uri:emailUri> nmo:from <uri:contactUri> . > <uri:contactUri> a nco:Contact . > <uri:contactUri> nco:hasEmailAddress > <mailto:glgxg%20at%20sbcglobal.net> . > <mailto:glgxg%20at%20sbcglobal.net> a nco:EmailAddress . > <mailto:glgxg%20at%20sbcglobal.net> nco:emailAddress > "glgxg at sbcglobal.net" . > > I.e. it produces an instance of the nco:EmailAddress class whose URI > is > composed with mailto: and an urlencoded version of whatever the value > of > the from header is. The instance gets an nco:emailAddress property > with > the string literal version of the email address (with all the quirks > the > spam-avoiding systems might invent). Clearly the mailto uri will be > wrong and will probably not work if you try to pass it to some mail > app, > but it's there. The crawler doesn't try to 'guess' the correct > address > from the obfuscated one. This behavior seemed to work on the test > data > set, i.e. my mailbox. If it doesn't, file a bug report. -- José Gato Luis | Libre Software Engineering Lab (GSyC) Tel: (+34)-914 888 105 | Universidad Rey Juan Carlos jg...@gs... | Edif. Departamental II - Despacho 121 http://libresoft.urjc.es/ | c/Tulipán s/n 28933 Móstoles (Madrid) |