Menu

TutorialCrawlingExample

2010-03-10
2013-05-13
  • Soh Guan Hoe

    Soh Guan Hoe - 2010-03-10

    Hi all, I have run the TutorialCrawlingExample.java to try parse various file formats like M$ Powerpoint, M$ Word, M$ Excel, HTML, Adobe Pdf and performance was ok and then I interpret the output.

    Except for some extracted text that are displayed as square boxes, there are a lot of special character string at the end of each line of text, 
 Is the special string a newline representation ? It occurs for all above file formats I have tried. Or is it because I am testing on Windows platform ?

    Secondly, if I would like to do extra processing on the extracted text, I will need to subclass those <File>Extractor class and override the extract(…) method is it ?

    Thanks.

     
  • Soh Guan Hoe

    Soh Guan Hoe - 2010-03-10

    Hi all I manage to remove the &#xD; but I don't know why. Basically the Syntax.Trix change to Syntax.Turtle will remove it I also don't know why. Upon close inspection, Syntax is from the package org.ontoware.rdf2go.model. So I need go to the other URL to take a look.

    Can I have a small suggestion, can the next Aperture release package all the API Javadocs and relevant source code and documentation from rdf2go as one release ? It is so troublesome for one to toggle to and fro trying and understand the Aperture framework.

    Thanks.

     
  • Soh Guan Hoe

    Soh Guan Hoe - 2010-03-10

    Hi all, I have a question to ask. Can I use Apache Tika extraction code to plug into Aperture Extractor interface ? That is I intend to use Aperture Crawling features but for Extraction I use Apache Tika code. Will I be violating any license issues with Aperture or Tika ?

    It is not I am not satisfied with Aperture default Extractor code but I do see some better implementation from Tika in some Extractor.

    Thanks.

     
  • Antoni Mylka

    Antoni Mylka - 2010-03-10

    1. &#xD; are artifacts introduced by the Turtle serialization. They come up because the original content contain windows line breaks and those line breaks are extracted and placed in the RDF database. If you want to remove them, a simple replaceAll should do the trick.

    2. About RDF2Go javadocs: we'll se what we can do about it.

    3. You won't be violating any licenses unless you remove the copyright headers or release the code (either Tika or Aperture) with a different license or a statement that you own the copyright. Apart from that you can mix and match at will. As for using Tika extractors in Aperture we've looked at this ourselves. The basic reason why this hasn't been done before is that Aperture works with RDF triples, while Tika works with XML, - SAX events. You'd have to write a class that would accept Tika's SAX events, convert them into triples and put the triples in the resulting Model/RDFContainer. We'd be more than happy to help, since such work would probably benefit us all.

     
  • Soh Guan Hoe

    Soh Guan Hoe - 2010-03-11

    3. You won't be violating any licenses unless you remove the copyright headers or release the code (either Tika or Aperture) with a different license or a statement that you own the copyright. Apart from that you can mix and match at will. As for using Tika extractors in Aperture we've looked at this ourselves. The basic reason why this hasn't been done before is that Aperture works with RDF triples, while Tika works with XML, - SAX events. You'd have to write a class that would accept Tika's SAX events, convert them into triples and put the triples in the resulting Model/RDFContainer. We'd be more than happy to help, since such work would probably benefit us all.

    That is great! I look forward for the above features in the next release. It makes me wonder why Tika developers have not approached Aperture developers to co-develop a common framework for crawl and extract. This can reduce much duplication effort. Hmmm…

     
  • Antoni Mylka

    Antoni Mylka - 2010-03-11

    Let's say that the current situtation is due to certain differences in background.
    They come from Lucene community, where the most important thing is the text
    We are from Semantic Web community, where the most important thing are the triples.

    Tika is optimized to get text, and very basic metadata
    Aperture is optimized to get arbitrarily complex metadata and text.

    There is certainly much common ground, and developing an adapter that would allow tika parsers to be used in aperture is certainly not impossible. It has been floating on our todo list for a long while never quite getting to the top.

    On the other hand Tika doesn't use Aperture components because of the Apache policy that no Apache component may depend on anytihing that's not in the central maven repository. We're not in the central maven repository because we on the other hand depend on a lot of libraries we have repackaged by ourselves. (including those from Eclipse Orbit repository).

    There is a list of ideas we want to incorporate in Aperture 2. This will be one of the more important ones, but for the time being it will stay as it is.

     

Log in to post a comment.