Re: [Aperture-devel] aperture 1.0.1 beta for text extraction

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

One thing I forgot, it's a detail, but anyway....In case of pdf
extraction I am seeing some errors when trying to extract pdf with
pdfbox. Now I use a second library jpod, to try as a fallback in case
pdfbox got nothing.

On Dec 5, 2007 11:47 AM, jm <jmu...@gm...> wrote:
> Leo, thanks for the reply. I can certainly give my input in case it
> helps even a little bit.
>
> I had already seen that wiki page (but I think not the last version
> anyway), but I wrote my code some time ago, with an alpha or aperture
> an no such wiki page existed then.
>
> In my code I am only using Extractors, no Crawlers or DataObjects. I
> only work with inputstreams, no files etc, and at the time I started
> implementing I took that decision after looking a bit at some
> examples, don't remember anymore the details. And maybe its was not
> the best decision, but using only extractors works fine so far for me.
>
> My approach for zip, gzip and tar has been to create extractors for
> these types, and the associated factories. Then I use a custom
> ExtractorRegistryImpl and add my extractor factories there (not sure
> if it is the intended way, but works). In my custom
> ExtractorRegistryImpl I also add a couple of my own extractors I use
> to replace the ones in aperture (mime and html).
>
> Each extractors code is mostly trivial, here is the zip one without
> exception management etc, gzip and tar are pretty similar:
>      public void extract(org.ontoware.rdf2go.model.node.URI id,
> InputStream is, Charset charset, String mimeType, RDFContainer result)
>             throws ExtractorException {
>         ZipInputStream zis = new ZipInputStream(is);
>             while (true) {
>                 ZipEntry entry = zis.getNextEntry();
>                 if (entry == null) { break;}
>                 if (entry.isDirectory()) {continue;}
>                 // convert the stream to a markable one (for mime finding etc)
>                 InputStream cis = ApertureExt.convertToMarkableStream(zis);
>                 String zipenmime = ApertureExt.findMime(entry.getName(), cis);
>                 Extractor extractor = ApertureExt.findExtractor(zipenmime);
>                 if (extractor == null) {continue;}
>                 RDFContainer zentryres =
> ApertureExt.doApertureExtraction(entry.getName(), extractor,
> zipenmime, ContentTypeInfo.getCharsetFromContentType(zipenmime), cis);
>                 ApertureExt.addAll(result, zentryres);
>             }
>     }
>
> hope this helps
> javi
>
>
> On Dec 4, 2007 2:55 PM, Leo Sauermann <leo...@df...> wrote:
> >
> >  It was jm who said at the right time 04.12.2007 14:00 the following words:
> >  More feedback...
> >
> > I was adding gzip text extraction to my code using aperture, and as it
> > is mostly related, tar extraction too. I had to add
> >
> > <description>
> >  <mimeType>application/x-tar</mimeType>
> >  <extensions>tar</extensions>
> > </description>
> >
> >  in the mimetypes.xml that is inside the jar, as no reference to tar
> > was found. Would it be possible to add this for next version of
> > aperture?
> >
> >  this is possible.
> >
> >  the core aperture developers had a telco two weeks ago about the zipfile
> > problem and we scetched a solution based on "microcrawlers".
> >  Your code and approach is a good input we need, based on your throurough
> > experience I would also humbly ask you to review our idea and give feedback
> > via the list:
> >  http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects
> >
> >
> >  Also, can somebody just ack that I am sending the emails properly to the
> > list?
> >
> >  ack! its good input.
> >
> >  we listen :-)
> >
> >  best
> >  Leo
> >
> >  thanks
> >
> > On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote:
> >
> >
> >  Hello,
> >
> > As I have upgraded from an older alpha(2) to the new beta, here is my
> > feedback. I only use aperture for text extraction.
> >
> > 1. Text files like a Hello.java gets no extractor assigned. I am not
> > sure here what the old behaviour was, but I thought it always used
> > PlainExtractor when the stream was text and no other extractor was
> > found, is that correct?
> >
> > 2. Regarding MimeExtractor and HtmlExtractor, before I used my own
> > versions. I compared them to the new default versions in aperture and
> > I still get different results (for my own needs my versions are
> > better). I just wanted to know wether you guys have some sort of html
> > files and email files that you use for benchmarking the text
> > extraction so I can further compare my version agains the default
> > version with those files.
> >
> > thanks,
> > javi
> >
> >
> >  -------------------------------------------------------------------------
> > SF.Net email is sponsored by: The Future of Linux Business White Paper
> > from Novell. From the desktop to the data center, Linux is going
> > mainstream. Let it simplify your IT future.
> > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
> > _______________________________________________
> > Aperture-devel mailing list
> > Ape...@li...
> > https://lists.sourceforge.net/lists/listinfo/aperture-devel
> >
> >
> >
> >  --
> > ____________________________________________________
> > DI Leo Sauermann http://www.dfki.de/~sauermann
> >
> > Deutsches Forschungszentrum fuer
> > Kuenstliche Intelligenz DFKI GmbH
> > Trippstadter Strasse 122
> > P.O. Box 2080 Fon: +49 631 20575-116
> > D-67663 Kaiserslautern Fax: +49 631 20575-102
> > Germany Mail: leo...@df...
> >
> > Geschaeftsfuehrung:
> > Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
> > Dr. Walter Olthoff
> > Vorsitzender des Aufsichtsrats:
> > Prof. Dr. h.c. Hans A. Aukes
> > Amtsgericht Kaiserslautern, HRB 2313
> > ____________________________________________________
> >
> >
>