From: jm <jmu...@gm...> - 2007-12-10 12:23:21
|
One thing I forgot, it's a detail, but anyway....In case of pdf extraction I am seeing some errors when trying to extract pdf with pdfbox. Now I use a second library jpod, to try as a fallback in case pdfbox got nothing. On Dec 5, 2007 11:47 AM, jm <jmu...@gm...> wrote: > Leo, thanks for the reply. I can certainly give my input in case it > helps even a little bit. > > I had already seen that wiki page (but I think not the last version > anyway), but I wrote my code some time ago, with an alpha or aperture > an no such wiki page existed then. > > In my code I am only using Extractors, no Crawlers or DataObjects. I > only work with inputstreams, no files etc, and at the time I started > implementing I took that decision after looking a bit at some > examples, don't remember anymore the details. And maybe its was not > the best decision, but using only extractors works fine so far for me. > > My approach for zip, gzip and tar has been to create extractors for > these types, and the associated factories. Then I use a custom > ExtractorRegistryImpl and add my extractor factories there (not sure > if it is the intended way, but works). In my custom > ExtractorRegistryImpl I also add a couple of my own extractors I use > to replace the ones in aperture (mime and html). > > Each extractors code is mostly trivial, here is the zip one without > exception management etc, gzip and tar are pretty similar: > public void extract(org.ontoware.rdf2go.model.node.URI id, > InputStream is, Charset charset, String mimeType, RDFContainer result) > throws ExtractorException { > ZipInputStream zis = new ZipInputStream(is); > while (true) { > ZipEntry entry = zis.getNextEntry(); > if (entry == null) { break;} > if (entry.isDirectory()) {continue;} > // convert the stream to a markable one (for mime finding etc) > InputStream cis = ApertureExt.convertToMarkableStream(zis); > String zipenmime = ApertureExt.findMime(entry.getName(), cis); > Extractor extractor = ApertureExt.findExtractor(zipenmime); > if (extractor == null) {continue;} > RDFContainer zentryres = > ApertureExt.doApertureExtraction(entry.getName(), extractor, > zipenmime, ContentTypeInfo.getCharsetFromContentType(zipenmime), cis); > ApertureExt.addAll(result, zentryres); > } > } > > hope this helps > javi > > > On Dec 4, 2007 2:55 PM, Leo Sauermann <leo...@df...> wrote: > > > > It was jm who said at the right time 04.12.2007 14:00 the following words: > > More feedback... > > > > I was adding gzip text extraction to my code using aperture, and as it > > is mostly related, tar extraction too. I had to add > > > > <description> > > <mimeType>application/x-tar</mimeType> > > <extensions>tar</extensions> > > </description> > > > > in the mimetypes.xml that is inside the jar, as no reference to tar > > was found. Would it be possible to add this for next version of > > aperture? > > > > this is possible. > > > > the core aperture developers had a telco two weeks ago about the zipfile > > problem and we scetched a solution based on "microcrawlers". > > Your code and approach is a good input we need, based on your throurough > > experience I would also humbly ask you to review our idea and give feedback > > via the list: > > http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects > > > > > > Also, can somebody just ack that I am sending the emails properly to the > > list? > > > > ack! its good input. > > > > we listen :-) > > > > best > > Leo > > > > thanks > > > > On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote: > > > > > > Hello, > > > > As I have upgraded from an older alpha(2) to the new beta, here is my > > feedback. I only use aperture for text extraction. > > > > 1. Text files like a Hello.java gets no extractor assigned. I am not > > sure here what the old behaviour was, but I thought it always used > > PlainExtractor when the stream was text and no other extractor was > > found, is that correct? > > > > 2. Regarding MimeExtractor and HtmlExtractor, before I used my own > > versions. I compared them to the new default versions in aperture and > > I still get different results (for my own needs my versions are > > better). I just wanted to know wether you guys have some sort of html > > files and email files that you use for benchmarking the text > > extraction so I can further compare my version agains the default > > version with those files. > > > > thanks, > > javi > > > > > > ------------------------------------------------------------------------- > > SF.Net email is sponsored by: The Future of Linux Business White Paper > > from Novell. From the desktop to the data center, Linux is going > > mainstream. Let it simplify your IT future. > > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > > _______________________________________________ > > Aperture-devel mailing list > > Ape...@li... > > https://lists.sourceforge.net/lists/listinfo/aperture-devel > > > > > > > > -- > > ____________________________________________________ > > DI Leo Sauermann http://www.dfki.de/~sauermann > > > > Deutsches Forschungszentrum fuer > > Kuenstliche Intelligenz DFKI GmbH > > Trippstadter Strasse 122 > > P.O. Box 2080 Fon: +49 631 20575-116 > > D-67663 Kaiserslautern Fax: +49 631 20575-102 > > Germany Mail: leo...@df... > > > > Geschaeftsfuehrung: > > Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) > > Dr. Walter Olthoff > > Vorsitzender des Aufsichtsrats: > > Prof. Dr. h.c. Hans A. Aukes > > Amtsgericht Kaiserslautern, HRB 2313 > > ____________________________________________________ > > > > > |