Thread: [Aperture-devel] aperture 1.0.1 beta for text extraction

Brought to you by: cfmfluit, leo_sauermann, mylka, reuschling

aperture-devel

[Aperture-devel] aperture 1.0.1 beta for text extraction

From: jm <jmu...@gm...> - 2007-11-21 17:54:46

Hello,

As I have upgraded from an older alpha(2) to the new beta, here is my
feedback. I only use aperture for text extraction.

1. Text files like a Hello.java gets no extractor assigned. I am not
sure here what the old behaviour was, but I thought it always used
PlainExtractor when the stream was text and no other extractor was
found, is that correct?

2. Regarding MimeExtractor  and HtmlExtractor, before I used my own
versions. I compared them to the new default versions in aperture and
I still get different results (for my own needs my versions are
better). I just wanted to know wether you guys have some sort of html
files and email files that you use for benchmarking the text
extraction so I can further compare my version agains the default
version with those files.

thanks,
javi

Re: [Aperture-devel] aperture 1.0.1 beta for text extraction

From: jm <jmu...@gm...> - 2007-12-04 13:00:07

More feedback...

I was adding gzip text extraction to my code using aperture, and as it
is mostly related, tar extraction too. I had to add

<description>
	<mimeType>application/x-tar</mimeType>
	<extensions>tar</extensions>
</description>

 in the mimetypes.xml that is inside the jar, as no reference to tar
was found. Would it be possible to add this for next version of
aperture?

Also, can somebody just ack that I am sending the emails properly to the list?

thanks

On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote:
> Hello,
>
> As I have upgraded from an older alpha(2) to the new beta, here is my
> feedback. I only use aperture for text extraction.
>
> 1. Text files like a Hello.java gets no extractor assigned. I am not
> sure here what the old behaviour was, but I thought it always used
> PlainExtractor when the stream was text and no other extractor was
> found, is that correct?
>
> 2. Regarding MimeExtractor  and HtmlExtractor, before I used my own
> versions. I compared them to the new default versions in aperture and
> I still get different results (for my own needs my versions are
> better). I just wanted to know wether you guys have some sort of html
> files and email files that you use for benchmarking the text
> extraction so I can further compare my version agains the default
> version with those files.
>
> thanks,
> javi
>

Re: [Aperture-devel] aperture 1.0.1 beta for text extraction

From: Leo S. <leo...@df...> - 2007-12-04 13:56:16

It was jm who said at the right time 04.12.2007 14:00 the following words:
> More feedback...
>
> I was adding gzip text extraction to my code using aperture, and as it
> is mostly related, tar extraction too. I had to add
>
> <description>
> 	<mimeType>application/x-tar</mimeType>
> 	<extensions>tar</extensions>
> </description>
>
>  in the mimetypes.xml that is inside the jar, as no reference to tar
> was found. Would it be possible to add this for next version of
> aperture?
>   
this is possible.

the core aperture developers had a telco two weeks ago about the zipfile 
problem and we scetched a solution based on "microcrawlers".
Your code and approach is a good input we need, based on your throurough 
experience I would also humbly ask you to review our idea and give 
feedback via the list:
http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects

> Also, can somebody just ack that I am sending the emails properly to the list?
>   
ack! its good input.

we listen :-)

best
Leo
> thanks
>
> On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote:
>   
>> Hello,
>>
>> As I have upgraded from an older alpha(2) to the new beta, here is my
>> feedback. I only use aperture for text extraction.
>>
>> 1. Text files like a Hello.java gets no extractor assigned. I am not
>> sure here what the old behaviour was, but I thought it always used
>> PlainExtractor when the stream was text and no other extractor was
>> found, is that correct?
>>
>> 2. Regarding MimeExtractor  and HtmlExtractor, before I used my own
>> versions. I compared them to the new default versions in aperture and
>> I still get different results (for my own needs my versions are
>> better). I just wanted to know wether you guys have some sort of html
>> files and email files that you use for benchmarking the text
>> extraction so I can further compare my version agains the default
>> version with those files.
>>
>> thanks,
>> javi
>>
>>     
>
> -------------------------------------------------------------------------
> SF.Net email is sponsored by: The Future of Linux Business White Paper
> from Novell.  From the desktop to the data center, Linux is going
> mainstream.  Let it simplify your IT future.
> http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>   


-- 
____________________________________________________
DI Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo...@df...

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
____________________________________________________

Re: [Aperture-devel] aperture 1.0.1 beta for text extraction

From: jm <jmu...@gm...> - 2007-12-05 10:47:10

Leo, thanks for the reply. I can certainly give my input in case it
helps even a little bit.

I had already seen that wiki page (but I think not the last version
anyway), but I wrote my code some time ago, with an alpha or aperture
an no such wiki page existed then.

In my code I am only using Extractors, no Crawlers or DataObjects. I
only work with inputstreams, no files etc, and at the time I started
implementing I took that decision after looking a bit at some
examples, don't remember anymore the details. And maybe its was not
the best decision, but using only extractors works fine so far for me.

My approach for zip, gzip and tar has been to create extractors for
these types, and the associated factories. Then I use a custom
ExtractorRegistryImpl and add my extractor factories there (not sure
if it is the intended way, but works). In my custom
ExtractorRegistryImpl I also add a couple of my own extractors I use
to replace the ones in aperture (mime and html).

Each extractors code is mostly trivial, here is the zip one without
exception management etc, gzip and tar are pretty similar:
     public void extract(org.ontoware.rdf2go.model.node.URI id,
InputStream is, Charset charset, String mimeType, RDFContainer result)
            throws ExtractorException {
        ZipInputStream zis = new ZipInputStream(is);
            while (true) {
                ZipEntry entry = zis.getNextEntry();
                if (entry == null) { break;}
                if (entry.isDirectory()) {continue;}
                // convert the stream to a markable one (for mime finding etc)
                InputStream cis = ApertureExt.convertToMarkableStream(zis);
                String zipenmime = ApertureExt.findMime(entry.getName(), cis);
                Extractor extractor = ApertureExt.findExtractor(zipenmime);
                if (extractor == null) {continue;}
                RDFContainer zentryres =
ApertureExt.doApertureExtraction(entry.getName(), extractor,
zipenmime, ContentTypeInfo.getCharsetFromContentType(zipenmime), cis);
                ApertureExt.addAll(result, zentryres);
            }
    }

hope this helps
javi

On Dec 4, 2007 2:55 PM, Leo Sauermann <leo...@df...> wrote:
>
>  It was jm who said at the right time 04.12.2007 14:00 the following words:
>  More feedback...
>
> I was adding gzip text extraction to my code using aperture, and as it
> is mostly related, tar extraction too. I had to add
>
> <description>
>  <mimeType>application/x-tar</mimeType>
>  <extensions>tar</extensions>
> </description>
>
>  in the mimetypes.xml that is inside the jar, as no reference to tar
> was found. Would it be possible to add this for next version of
> aperture?
>
>  this is possible.
>
>  the core aperture developers had a telco two weeks ago about the zipfile
> problem and we scetched a solution based on "microcrawlers".
>  Your code and approach is a good input we need, based on your throurough
> experience I would also humbly ask you to review our idea and give feedback
> via the list:
>  http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects
>
>
>  Also, can somebody just ack that I am sending the emails properly to the
> list?
>
>  ack! its good input.
>
>  we listen :-)
>
>  best
>  Leo
>
>  thanks
>
> On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote:
>
>
>  Hello,
>
> As I have upgraded from an older alpha(2) to the new beta, here is my
> feedback. I only use aperture for text extraction.
>
> 1. Text files like a Hello.java gets no extractor assigned. I am not
> sure here what the old behaviour was, but I thought it always used
> PlainExtractor when the stream was text and no other extractor was
> found, is that correct?
>
> 2. Regarding MimeExtractor and HtmlExtractor, before I used my own
> versions. I compared them to the new default versions in aperture and
> I still get different results (for my own needs my versions are
> better). I just wanted to know wether you guys have some sort of html
> files and email files that you use for benchmarking the text
> extraction so I can further compare my version agains the default
> version with those files.
>
> thanks,
> javi
>
>
>  -------------------------------------------------------------------------
> SF.Net email is sponsored by: The Future of Linux Business White Paper
> from Novell. From the desktop to the data center, Linux is going
> mainstream. Let it simplify your IT future.
> http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>
>
>
>  --
> ____________________________________________________
> DI Leo Sauermann http://www.dfki.de/~sauermann
>
> Deutsches Forschungszentrum fuer
> Kuenstliche Intelligenz DFKI GmbH
> Trippstadter Strasse 122
> P.O. Box 2080 Fon: +49 631 20575-116
> D-67663 Kaiserslautern Fax: +49 631 20575-102
> Germany Mail: leo...@df...
>
> Geschaeftsfuehrung:
> Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
> ____________________________________________________
>
>

Re: [Aperture-devel] aperture 1.0.1 beta for text extraction

From: Leo S. <leo...@df...> - 2007-12-05 16:49:12

It was jm who said at the right time 05.12.2007 11:47 the following words:
> Leo, thanks for the reply. I can certainly give my input in case it
> helps even a little bit.
>
> I had already seen that wiki page (but I think not the last version
> anyway), but I wrote my code some time ago, with an alpha or aperture
> an no such wiki page existed then.
>
> In my code I am only using Extractors, no Crawlers or DataObjects. I
> only work with inputstreams, no files etc, and at the time I started
> implementing I took that decision after looking a bit at some
> examples, don't remember anymore the details. And maybe its was not
> the best decision, but using only extractors works fine so far for me.
>   
if your application works on a single file as input, extractors are fine.

> My approach for zip, gzip and tar has been to create extractors for
> these types, and the associated factories. Then I use a custom
> ExtractorRegistryImpl and add my extractor factories there (not sure
> if it is the intended way, but works).
you can just change the xml file and create the registry with the 
modified file.

>  In my custom
> ExtractorRegistryImpl I also add a couple of my own extractors I use
> to replace the ones in aperture (mime and html).
>
> Each extractors code is mostly trivial, here is the zip one without
> exception management etc, gzip and tar are pretty similar:
>      public void extract(org.ontoware.rdf2go.model.node.URI id,
> InputStream is, Charset charset, String mimeType, RDFContainer result)
>             throws ExtractorException {
>         ZipInputStream zis = new ZipInputStream(is);
>             while (true) {
>                 ZipEntry entry = zis.getNextEntry();
>                 if (entry == null) { break;}
>                 if (entry.isDirectory()) {continue;}
>                 // convert the stream to a markable one (for mime finding etc)
>                 InputStream cis = ApertureExt.convertToMarkableStream(zis);
>                 String zipenmime = ApertureExt.findMime(entry.getName(), cis);
>                 Extractor extractor = ApertureExt.findExtractor(zipenmime);
>                 if (extractor == null) {continue;}
>                 RDFContainer zentryres =
> ApertureExt.doApertureExtraction(entry.getName(), extractor,
> zipenmime, ContentTypeInfo.getCharsetFromContentType(zipenmime), cis);
>                 ApertureExt.addAll(result, zentryres);
>             }
>     }
>
> hope this helps
>   
yep, nice inspiration.

I notice you made a big helper object "ApertureExt" that gathers all 
"useful bits" for you as static methods.
this is a good idea, we may add something like it to aperture sooner or 
later.
(either doing it like in openrdf's "Rio", or like Rdf2Go's "RDF2Go" class.)

best
Leo
> javi
>
> On Dec 4, 2007 2:55 PM, Leo Sauermann <leo...@df...> wrote:
>   
>>  It was jm who said at the right time 04.12.2007 14:00 the following words:
>>  More feedback...
>>
>> I was adding gzip text extraction to my code using aperture, and as it
>> is mostly related, tar extraction too. I had to add
>>
>> <description>
>>  <mimeType>application/x-tar</mimeType>
>>  <extensions>tar</extensions>
>> </description>
>>
>>  in the mimetypes.xml that is inside the jar, as no reference to tar
>> was found. Would it be possible to add this for next version of
>> aperture?
>>
>>  this is possible.
>>
>>  the core aperture developers had a telco two weeks ago about the zipfile
>> problem and we scetched a solution based on "microcrawlers".
>>  Your code and approach is a good input we need, based on your throurough
>> experience I would also humbly ask you to review our idea and give feedback
>> via the list:
>>  http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects
>>
>>
>>  Also, can somebody just ack that I am sending the emails properly to the
>> list?
>>
>>  ack! its good input.
>>
>>  we listen :-)
>>
>>  best
>>  Leo
>>
>>  thanks
>>
>> On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote:
>>
>>
>>  Hello,
>>
>> As I have upgraded from an older alpha(2) to the new beta, here is my
>> feedback. I only use aperture for text extraction.
>>
>> 1. Text files like a Hello.java gets no extractor assigned. I am not
>> sure here what the old behaviour was, but I thought it always used
>> PlainExtractor when the stream was text and no other extractor was
>> found, is that correct?
>>
>> 2. Regarding MimeExtractor and HtmlExtractor, before I used my own
>> versions. I compared them to the new default versions in aperture and
>> I still get different results (for my own needs my versions are
>> better). I just wanted to know wether you guys have some sort of html
>> files and email files that you use for benchmarking the text
>> extraction so I can further compare my version agains the default
>> version with those files.
>>
>> thanks,
>> javi
>>
>>
>>  -------------------------------------------------------------------------
>> SF.Net email is sponsored by: The Future of Linux Business White Paper
>> from Novell. From the desktop to the data center, Linux is going
>> mainstream. Let it simplify your IT future.
>> http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
>> _______________________________________________
>> Aperture-devel mailing list
>> Ape...@li...
>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>
>>
>>
>>  --
>> ____________________________________________________
>> DI Leo Sauermann http://www.dfki.de/~sauermann
>>
>> Deutsches Forschungszentrum fuer
>> Kuenstliche Intelligenz DFKI GmbH
>> Trippstadter Strasse 122
>> P.O. Box 2080 Fon: +49 631 20575-116
>> D-67663 Kaiserslautern Fax: +49 631 20575-102
>> Germany Mail: leo...@df...
>>
>> Geschaeftsfuehrung:
>> Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>> Amtsgericht Kaiserslautern, HRB 2313
>> ____________________________________________________
>>
>>
>>     


-- 
____________________________________________________
DI Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo...@df...

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
____________________________________________________

Re: [Aperture-devel] aperture 1.0.1 beta for text extraction

From: jm <jmu...@gm...> - 2007-12-10 12:23:21

One thing I forgot, it's a detail, but anyway....In case of pdf
extraction I am seeing some errors when trying to extract pdf with
pdfbox. Now I use a second library jpod, to try as a fallback in case
pdfbox got nothing.

On Dec 5, 2007 11:47 AM, jm <jmu...@gm...> wrote:
> Leo, thanks for the reply. I can certainly give my input in case it
> helps even a little bit.
>
> I had already seen that wiki page (but I think not the last version
> anyway), but I wrote my code some time ago, with an alpha or aperture
> an no such wiki page existed then.
>
> In my code I am only using Extractors, no Crawlers or DataObjects. I
> only work with inputstreams, no files etc, and at the time I started
> implementing I took that decision after looking a bit at some
> examples, don't remember anymore the details. And maybe its was not
> the best decision, but using only extractors works fine so far for me.
>
> My approach for zip, gzip and tar has been to create extractors for
> these types, and the associated factories. Then I use a custom
> ExtractorRegistryImpl and add my extractor factories there (not sure
> if it is the intended way, but works). In my custom
> ExtractorRegistryImpl I also add a couple of my own extractors I use
> to replace the ones in aperture (mime and html).
>
> Each extractors code is mostly trivial, here is the zip one without
> exception management etc, gzip and tar are pretty similar:
>      public void extract(org.ontoware.rdf2go.model.node.URI id,
> InputStream is, Charset charset, String mimeType, RDFContainer result)
>             throws ExtractorException {
>         ZipInputStream zis = new ZipInputStream(is);
>             while (true) {
>                 ZipEntry entry = zis.getNextEntry();
>                 if (entry == null) { break;}
>                 if (entry.isDirectory()) {continue;}
>                 // convert the stream to a markable one (for mime finding etc)
>                 InputStream cis = ApertureExt.convertToMarkableStream(zis);
>                 String zipenmime = ApertureExt.findMime(entry.getName(), cis);
>                 Extractor extractor = ApertureExt.findExtractor(zipenmime);
>                 if (extractor == null) {continue;}
>                 RDFContainer zentryres =
> ApertureExt.doApertureExtraction(entry.getName(), extractor,
> zipenmime, ContentTypeInfo.getCharsetFromContentType(zipenmime), cis);
>                 ApertureExt.addAll(result, zentryres);
>             }
>     }
>
> hope this helps
> javi
>
>
> On Dec 4, 2007 2:55 PM, Leo Sauermann <leo...@df...> wrote:
> >
> >  It was jm who said at the right time 04.12.2007 14:00 the following words:
> >  More feedback...
> >
> > I was adding gzip text extraction to my code using aperture, and as it
> > is mostly related, tar extraction too. I had to add
> >
> > <description>
> >  <mimeType>application/x-tar</mimeType>
> >  <extensions>tar</extensions>
> > </description>
> >
> >  in the mimetypes.xml that is inside the jar, as no reference to tar
> > was found. Would it be possible to add this for next version of
> > aperture?
> >
> >  this is possible.
> >
> >  the core aperture developers had a telco two weeks ago about the zipfile
> > problem and we scetched a solution based on "microcrawlers".
> >  Your code and approach is a good input we need, based on your throurough
> > experience I would also humbly ask you to review our idea and give feedback
> > via the list:
> >  http://aperture.wiki.sourceforge.net/CrawlersThatCrawlDataObjects
> >
> >
> >  Also, can somebody just ack that I am sending the emails properly to the
> > list?
> >
> >  ack! its good input.
> >
> >  we listen :-)
> >
> >  best
> >  Leo
> >
> >  thanks
> >
> > On Nov 21, 2007 6:54 PM, jm <jmu...@gm...> wrote:
> >
> >
> >  Hello,
> >
> > As I have upgraded from an older alpha(2) to the new beta, here is my
> > feedback. I only use aperture for text extraction.
> >
> > 1. Text files like a Hello.java gets no extractor assigned. I am not
> > sure here what the old behaviour was, but I thought it always used
> > PlainExtractor when the stream was text and no other extractor was
> > found, is that correct?
> >
> > 2. Regarding MimeExtractor and HtmlExtractor, before I used my own
> > versions. I compared them to the new default versions in aperture and
> > I still get different results (for my own needs my versions are
> > better). I just wanted to know wether you guys have some sort of html
> > files and email files that you use for benchmarking the text
> > extraction so I can further compare my version agains the default
> > version with those files.
> >
> > thanks,
> > javi
> >
> >
> >  -------------------------------------------------------------------------
> > SF.Net email is sponsored by: The Future of Linux Business White Paper
> > from Novell. From the desktop to the data center, Linux is going
> > mainstream. Let it simplify your IT future.
> > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
> > _______________________________________________
> > Aperture-devel mailing list
> > Ape...@li...
> > https://lists.sourceforge.net/lists/listinfo/aperture-devel
> >
> >
> >
> >  --
> > ____________________________________________________
> > DI Leo Sauermann http://www.dfki.de/~sauermann
> >
> > Deutsches Forschungszentrum fuer
> > Kuenstliche Intelligenz DFKI GmbH
> > Trippstadter Strasse 122
> > P.O. Box 2080 Fon: +49 631 20575-116
> > D-67663 Kaiserslautern Fax: +49 631 20575-102
> > Germany Mail: leo...@df...
> >
> > Geschaeftsfuehrung:
> > Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
> > Dr. Walter Olthoff
> > Vorsitzender des Aufsichtsrats:
> > Prof. Dr. h.c. Hans A. Aukes
> > Amtsgericht Kaiserslautern, HRB 2313
> > ____________________________________________________
> >
> >
>