From: jm <jmu...@gm...> - 2007-11-21 17:54:46
|
Hello, As I have upgraded from an older alpha(2) to the new beta, here is my feedback. I only use aperture for text extraction. 1. Text files like a Hello.java gets no extractor assigned. I am not sure here what the old behaviour was, but I thought it always used PlainExtractor when the stream was text and no other extractor was found, is that correct? 2. Regarding MimeExtractor and HtmlExtractor, before I used my own versions. I compared them to the new default versions in aperture and I still get different results (for my own needs my versions are better). I just wanted to know wether you guys have some sort of html files and email files that you use for benchmarking the text extraction so I can further compare my version agains the default version with those files. thanks, javi |
From: jm <jmu...@gm...> - 2007-12-04 13:00:07
|
More feedback... I was adding gzip text extraction to my code using aperture, and as it is mostly related, tar extraction too. I had to add <description> <mimeType>application/x-tar</mimeType> <extensions>tar</extensions> </description> in the mimetypes.xml that is inside the jar, as no reference to tar was found. Would it be possible to add this for next version of aperture? Also, can somebody just ack that I am sending the emails properly to the list? thanks On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote: > Hello, > > As I have upgraded from an older alpha(2) to the new beta, here is my > feedback. I only use aperture for text extraction. > > 1. Text files like a Hello.java gets no extractor assigned. I am not > sure here what the old behaviour was, but I thought it always used > PlainExtractor when the stream was text and no other extractor was > found, is that correct? > > 2. Regarding MimeExtractor and HtmlExtractor, before I used my own > versions. I compared them to the new default versions in aperture and > I still get different results (for my own needs my versions are > better). I just wanted to know wether you guys have some sort of html > files and email files that you use for benchmarking the text > extraction so I can further compare my version agains the default > version with those files. > > thanks, > javi > |
From: Leo S. <leo...@df...> - 2007-12-04 13:56:16
|
It was jm who said at the right time 04.12.2007 14:00 the following words: > More feedback... > > I was adding gzip text extraction to my code using aperture, and as it > is mostly related, tar extraction too. I had to add > > <description> > <mimeType>application/x-tar</mimeType> > <extensions>tar</extensions> > </description> > > in the mimetypes.xml that is inside the jar, as no reference to tar > was found. Would it be possible to add this for next version of > aperture? > this is possible. the core aperture developers had a telco two weeks ago about the zipfile problem and we scetched a solution based on "microcrawlers". Your code and approach is a good input we need, based on your throurough experience I would also humbly ask you to review our idea and give feedback via the list: http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects > Also, can somebody just ack that I am sending the emails properly to the list? > ack! its good input. we listen :-) best Leo > thanks > > On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote: > >> Hello, >> >> As I have upgraded from an older alpha(2) to the new beta, here is my >> feedback. I only use aperture for text extraction. >> >> 1. Text files like a Hello.java gets no extractor assigned. I am not >> sure here what the old behaviour was, but I thought it always used >> PlainExtractor when the stream was text and no other extractor was >> found, is that correct? >> >> 2. Regarding MimeExtractor and HtmlExtractor, before I used my own >> versions. I compared them to the new default versions in aperture and >> I still get different results (for my own needs my versions are >> better). I just wanted to know wether you guys have some sort of html >> files and email files that you use for benchmarking the text >> extraction so I can further compare my version agains the default >> version with those files. >> >> thanks, >> javi >> >> > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > -- ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann Deutsches Forschungszentrum fuer Kuenstliche Intelligenz DFKI GmbH Trippstadter Strasse 122 P.O. Box 2080 Fon: +49 631 20575-116 D-67663 Kaiserslautern Fax: +49 631 20575-102 Germany Mail: leo...@df... Geschaeftsfuehrung: Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 ____________________________________________________ |
From: jm <jmu...@gm...> - 2007-12-05 10:47:10
|
Leo, thanks for the reply. I can certainly give my input in case it helps even a little bit. I had already seen that wiki page (but I think not the last version anyway), but I wrote my code some time ago, with an alpha or aperture an no such wiki page existed then. In my code I am only using Extractors, no Crawlers or DataObjects. I only work with inputstreams, no files etc, and at the time I started implementing I took that decision after looking a bit at some examples, don't remember anymore the details. And maybe its was not the best decision, but using only extractors works fine so far for me. My approach for zip, gzip and tar has been to create extractors for these types, and the associated factories. Then I use a custom ExtractorRegistryImpl and add my extractor factories there (not sure if it is the intended way, but works). In my custom ExtractorRegistryImpl I also add a couple of my own extractors I use to replace the ones in aperture (mime and html). Each extractors code is mostly trivial, here is the zip one without exception management etc, gzip and tar are pretty similar: public void extract(org.ontoware.rdf2go.model.node.URI id, InputStream is, Charset charset, String mimeType, RDFContainer result) throws ExtractorException { ZipInputStream zis = new ZipInputStream(is); while (true) { ZipEntry entry = zis.getNextEntry(); if (entry == null) { break;} if (entry.isDirectory()) {continue;} // convert the stream to a markable one (for mime finding etc) InputStream cis = ApertureExt.convertToMarkableStream(zis); String zipenmime = ApertureExt.findMime(entry.getName(), cis); Extractor extractor = ApertureExt.findExtractor(zipenmime); if (extractor == null) {continue;} RDFContainer zentryres = ApertureExt.doApertureExtraction(entry.getName(), extractor, zipenmime, ContentTypeInfo.getCharsetFromContentType(zipenmime), cis); ApertureExt.addAll(result, zentryres); } } hope this helps javi On Dec 4, 2007 2:55 PM, Leo Sauermann <leo...@df...> wrote: > > It was jm who said at the right time 04.12.2007 14:00 the following words: > More feedback... > > I was adding gzip text extraction to my code using aperture, and as it > is mostly related, tar extraction too. I had to add > > <description> > <mimeType>application/x-tar</mimeType> > <extensions>tar</extensions> > </description> > > in the mimetypes.xml that is inside the jar, as no reference to tar > was found. Would it be possible to add this for next version of > aperture? > > this is possible. > > the core aperture developers had a telco two weeks ago about the zipfile > problem and we scetched a solution based on "microcrawlers". > Your code and approach is a good input we need, based on your throurough > experience I would also humbly ask you to review our idea and give feedback > via the list: > http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects > > > Also, can somebody just ack that I am sending the emails properly to the > list? > > ack! its good input. > > we listen :-) > > best > Leo > > thanks > > On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote: > > > Hello, > > As I have upgraded from an older alpha(2) to the new beta, here is my > feedback. I only use aperture for text extraction. > > 1. Text files like a Hello.java gets no extractor assigned. I am not > sure here what the old behaviour was, but I thought it always used > PlainExtractor when the stream was text and no other extractor was > found, is that correct? > > 2. Regarding MimeExtractor and HtmlExtractor, before I used my own > versions. I compared them to the new default versions in aperture and > I still get different results (for my own needs my versions are > better). I just wanted to know wether you guys have some sort of html > files and email files that you use for benchmarking the text > extraction so I can further compare my version agains the default > version with those files. > > thanks, > javi > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > > > > -- > ____________________________________________________ > DI Leo Sauermann http://www.dfki.de/~sauermann > > Deutsches Forschungszentrum fuer > Kuenstliche Intelligenz DFKI GmbH > Trippstadter Strasse 122 > P.O. Box 2080 Fon: +49 631 20575-116 > D-67663 Kaiserslautern Fax: +49 631 20575-102 > Germany Mail: leo...@df... > > Geschaeftsfuehrung: > Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > Vorsitzender des Aufsichtsrats: > Prof. Dr. h.c. Hans A. Aukes > Amtsgericht Kaiserslautern, HRB 2313 > ____________________________________________________ > > |
From: Leo S. <leo...@df...> - 2007-12-05 16:49:12
|
It was jm who said at the right time 05.12.2007 11:47 the following words: > Leo, thanks for the reply. I can certainly give my input in case it > helps even a little bit. > > I had already seen that wiki page (but I think not the last version > anyway), but I wrote my code some time ago, with an alpha or aperture > an no such wiki page existed then. > > In my code I am only using Extractors, no Crawlers or DataObjects. I > only work with inputstreams, no files etc, and at the time I started > implementing I took that decision after looking a bit at some > examples, don't remember anymore the details. And maybe its was not > the best decision, but using only extractors works fine so far for me. > if your application works on a single file as input, extractors are fine. > My approach for zip, gzip and tar has been to create extractors for > these types, and the associated factories. Then I use a custom > ExtractorRegistryImpl and add my extractor factories there (not sure > if it is the intended way, but works). you can just change the xml file and create the registry with the modified file. > In my custom > ExtractorRegistryImpl I also add a couple of my own extractors I use > to replace the ones in aperture (mime and html). > > Each extractors code is mostly trivial, here is the zip one without > exception management etc, gzip and tar are pretty similar: > public void extract(org.ontoware.rdf2go.model.node.URI id, > InputStream is, Charset charset, String mimeType, RDFContainer result) > throws ExtractorException { > ZipInputStream zis = new ZipInputStream(is); > while (true) { > ZipEntry entry = zis.getNextEntry(); > if (entry == null) { break;} > if (entry.isDirectory()) {continue;} > // convert the stream to a markable one (for mime finding etc) > InputStream cis = ApertureExt.convertToMarkableStream(zis); > String zipenmime = ApertureExt.findMime(entry.getName(), cis); > Extractor extractor = ApertureExt.findExtractor(zipenmime); > if (extractor == null) {continue;} > RDFContainer zentryres = > ApertureExt.doApertureExtraction(entry.getName(), extractor, > zipenmime, ContentTypeInfo.getCharsetFromContentType(zipenmime), cis); > ApertureExt.addAll(result, zentryres); > } > } > > hope this helps > yep, nice inspiration. I notice you made a big helper object "ApertureExt" that gathers all "useful bits" for you as static methods. this is a good idea, we may add something like it to aperture sooner or later. (either doing it like in openrdf's "Rio", or like Rdf2Go's "RDF2Go" class.) best Leo > javi > > On Dec 4, 2007 2:55 PM, Leo Sauermann <leo...@df...> wrote: > >> It was jm who said at the right time 04.12.2007 14:00 the following words: >> More feedback... >> >> I was adding gzip text extraction to my code using aperture, and as it >> is mostly related, tar extraction too. I had to add >> >> <description> >> <mimeType>application/x-tar</mimeType> >> <extensions>tar</extensions> >> </description> >> >> in the mimetypes.xml that is inside the jar, as no reference to tar >> was found. Would it be possible to add this for next version of >> aperture? >> >> this is possible. >> >> the core aperture developers had a telco two weeks ago about the zipfile >> problem and we scetched a solution based on "microcrawlers". >> Your code and approach is a good input we need, based on your throurough >> experience I would also humbly ask you to review our idea and give feedback >> via the list: >> http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects >> >> >> Also, can somebody just ack that I am sending the emails properly to the >> list? >> >> ack! its good input. >> >> we listen :-) >> >> best >> Leo >> >> thanks >> >> On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote: >> >> >> Hello, >> >> As I have upgraded from an older alpha(2) to the new beta, here is my >> feedback. I only use aperture for text extraction. >> >> 1. Text files like a Hello.java gets no extractor assigned. I am not >> sure here what the old behaviour was, but I thought it always used >> PlainExtractor when the stream was text and no other extractor was >> found, is that correct? >> >> 2. Regarding MimeExtractor and HtmlExtractor, before I used my own >> versions. I compared them to the new default versions in aperture and >> I still get different results (for my own needs my versions are >> better). I just wanted to know wether you guys have some sort of html >> files and email files that you use for benchmarking the text >> extraction so I can further compare my version agains the default >> version with those files. >> >> thanks, >> javi >> >> >> ------------------------------------------------------------------------- >> SF.Net email is sponsored by: The Future of Linux Business White Paper >> from Novell. From the desktop to the data center, Linux is going >> mainstream. Let it simplify your IT future. >> http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 >> _______________________________________________ >> Aperture-devel mailing list >> Ape...@li... >> https://lists.sourceforge.net/lists/listinfo/aperture-devel >> >> >> >> -- >> ____________________________________________________ >> DI Leo Sauermann http://www.dfki.de/~sauermann >> >> Deutsches Forschungszentrum fuer >> Kuenstliche Intelligenz DFKI GmbH >> Trippstadter Strasse 122 >> P.O. Box 2080 Fon: +49 631 20575-116 >> D-67663 Kaiserslautern Fax: +49 631 20575-102 >> Germany Mail: leo...@df... >> >> Geschaeftsfuehrung: >> Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) >> Dr. Walter Olthoff >> Vorsitzender des Aufsichtsrats: >> Prof. Dr. h.c. Hans A. Aukes >> Amtsgericht Kaiserslautern, HRB 2313 >> ____________________________________________________ >> >> >> -- ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann Deutsches Forschungszentrum fuer Kuenstliche Intelligenz DFKI GmbH Trippstadter Strasse 122 P.O. Box 2080 Fon: +49 631 20575-116 D-67663 Kaiserslautern Fax: +49 631 20575-102 Germany Mail: leo...@df... Geschaeftsfuehrung: Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 ____________________________________________________ |
From: jm <jmu...@gm...> - 2007-12-10 12:23:21
|
One thing I forgot, it's a detail, but anyway....In case of pdf extraction I am seeing some errors when trying to extract pdf with pdfbox. Now I use a second library jpod, to try as a fallback in case pdfbox got nothing. On Dec 5, 2007 11:47 AM, jm <jmu...@gm...> wrote: > Leo, thanks for the reply. I can certainly give my input in case it > helps even a little bit. > > I had already seen that wiki page (but I think not the last version > anyway), but I wrote my code some time ago, with an alpha or aperture > an no such wiki page existed then. > > In my code I am only using Extractors, no Crawlers or DataObjects. I > only work with inputstreams, no files etc, and at the time I started > implementing I took that decision after looking a bit at some > examples, don't remember anymore the details. And maybe its was not > the best decision, but using only extractors works fine so far for me. > > My approach for zip, gzip and tar has been to create extractors for > these types, and the associated factories. Then I use a custom > ExtractorRegistryImpl and add my extractor factories there (not sure > if it is the intended way, but works). In my custom > ExtractorRegistryImpl I also add a couple of my own extractors I use > to replace the ones in aperture (mime and html). > > Each extractors code is mostly trivial, here is the zip one without > exception management etc, gzip and tar are pretty similar: > public void extract(org.ontoware.rdf2go.model.node.URI id, > InputStream is, Charset charset, String mimeType, RDFContainer result) > throws ExtractorException { > ZipInputStream zis = new ZipInputStream(is); > while (true) { > ZipEntry entry = zis.getNextEntry(); > if (entry == null) { break;} > if (entry.isDirectory()) {continue;} > // convert the stream to a markable one (for mime finding etc) > InputStream cis = ApertureExt.convertToMarkableStream(zis); > String zipenmime = ApertureExt.findMime(entry.getName(), cis); > Extractor extractor = ApertureExt.findExtractor(zipenmime); > if (extractor == null) {continue;} > RDFContainer zentryres = > ApertureExt.doApertureExtraction(entry.getName(), extractor, > zipenmime, ContentTypeInfo.getCharsetFromContentType(zipenmime), cis); > ApertureExt.addAll(result, zentryres); > } > } > > hope this helps > javi > > > On Dec 4, 2007 2:55 PM, Leo Sauermann <leo...@df...> wrote: > > > > It was jm who said at the right time 04.12.2007 14:00 the following words: > > More feedback... > > > > I was adding gzip text extraction to my code using aperture, and as it > > is mostly related, tar extraction too. I had to add > > > > <description> > > <mimeType>application/x-tar</mimeType> > > <extensions>tar</extensions> > > </description> > > > > in the mimetypes.xml that is inside the jar, as no reference to tar > > was found. Would it be possible to add this for next version of > > aperture? > > > > this is possible. > > > > the core aperture developers had a telco two weeks ago about the zipfile > > problem and we scetched a solution based on "microcrawlers". > > Your code and approach is a good input we need, based on your throurough > > experience I would also humbly ask you to review our idea and give feedback > > via the list: > > http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects > > > > > > Also, can somebody just ack that I am sending the emails properly to the > > list? > > > > ack! its good input. > > > > we listen :-) > > > > best > > Leo > > > > thanks > > > > On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote: > > > > > > Hello, > > > > As I have upgraded from an older alpha(2) to the new beta, here is my > > feedback. I only use aperture for text extraction. > > > > 1. Text files like a Hello.java gets no extractor assigned. I am not > > sure here what the old behaviour was, but I thought it always used > > PlainExtractor when the stream was text and no other extractor was > > found, is that correct? > > > > 2. Regarding MimeExtractor and HtmlExtractor, before I used my own > > versions. I compared them to the new default versions in aperture and > > I still get different results (for my own needs my versions are > > better). I just wanted to know wether you guys have some sort of html > > files and email files that you use for benchmarking the text > > extraction so I can further compare my version agains the default > > version with those files. > > > > thanks, > > javi > > > > > > ------------------------------------------------------------------------- > > SF.Net email is sponsored by: The Future of Linux Business White Paper > > from Novell. From the desktop to the data center, Linux is going > > mainstream. Let it simplify your IT future. > > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > > _______________________________________________ > > Aperture-devel mailing list > > Ape...@li... > > https://lists.sourceforge.net/lists/listinfo/aperture-devel > > > > > > > > -- > > ____________________________________________________ > > DI Leo Sauermann http://www.dfki.de/~sauermann > > > > Deutsches Forschungszentrum fuer > > Kuenstliche Intelligenz DFKI GmbH > > Trippstadter Strasse 122 > > P.O. Box 2080 Fon: +49 631 20575-116 > > D-67663 Kaiserslautern Fax: +49 631 20575-102 > > Germany Mail: leo...@df... > > > > Geschaeftsfuehrung: > > Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) > > Dr. Walter Olthoff > > Vorsitzender des Aufsichtsrats: > > Prof. Dr. h.c. Hans A. Aukes > > Amtsgericht Kaiserslautern, HRB 2313 > > ____________________________________________________ > > > > > |