|Kevin C. Bombardier wrote:|
> 1. When using the Aperture File System Crawler GUI
> (aperture-2006.1-alpha-3.zip) and I have xls in the directory, the Trix
> output file does not seem to have all the data from each of the rows
> (fullText xml tag). It has some text but not the data from the rows and
> columns. I have checked Determine MIME type and Extract document text
> and metadata.
> It prints the following message(s) in the startup window.
> INFO: regular POI-based processing failed, falling back to heuristic
> string extraction for file:/C:/eclipse/workspace/test/28PyrmdNE2.xls
> Jan 19, 2007 12:03:05 PM
> org.semanticdesktop.aperture.extractor.util.PoiUtil extractAll
This message occurs when Apache POI, the library that we use for
extracting information from MS Office files, fails to process the
Fortunately the text in Excel documents is (always?) embedded as regular
human-readable text, so we use a heuristic processor as a fall-back
mechanism that tries to detect all parts of the binary stream that look
like regular text. For example, it identifies bytes that represent
alphanumeric characters and that seem to form words, it suppresses
common font names that are also often encoded as regular text, etc. This
process is highly heuristic and does not work in 100% of the cases.
Still, it's the best we can do and often preferable over no output at all.
If possible, feel free to send me an example document for which this
extraction is missing vital parts and I can see if there's anything we
can do about it.
> 2. Once the data has been extracted to XML can the XML document be
> rendered/viewed the same as the original it was extracted from? I would
> like to keep the single XML extract around and not the original. I will
> eventually parse out the XML to individual files but that is later on.
> I can keep both the original and the XML extract but if I do not need to
> then I will not. I was told that I should be able to view the images
> and formulas (or anything else from the original document) if I have the
> correct XML plugins/libs in the XML extracted file (MathXML, ...)
I am not 100% sure I understand your question. Do you want to present
the information by rendering the resulting XML using stylesheets,
instead of displaying the original document? Perhaps this is doable but
I wouldn't be surprised it this is rather suboptimal. The extractors in
Aperture are developed with full-text *indexing* in mind, so issues like
text layout are of minor importance and stuff like images and formulas
aren't even extracted at all at the moment. For human consumption, the
original document may be much better readable, but you should test this
on the types of documents that you use.
Also know that Aperture does not really extract to XML, it extracts to
RDF, which has a different data model (in short: a type of labeled
directed graph rather than a labeled tree) that can be serialized in a
variety of ways, for example using one of the two XML-based formats
(rdfxml and TriX). Depending on your processing needs and capabilities,
one of the other formats may be easier to handle. For example, N-Triples
is well suited for processing with regular expressions.
Finally, the file crawler UI is only meant as an example application, we
encourage people to program against the Aperture APIs instead. For
example, when you implement your own CrawlerHandler, you will receive
your metadata as a sequence of DataObjects, each representing a
file/webpage/email/... Here you can take care of e.g. filtering,
reorganizing and storing metadata.
> 3. The File Inspector GUI allows you to changed the type of metadata
> output. If I pick a format at the beginning can I change it down the
> road (ie. I decide on Trix now but want to change to Turtle or another
> serialized format down the road)? How hard would it be to convert from
> one to the other?
"down the road"? The File Inspector lets you change it at any time, the
extraction results are immediately updated. Or do you mean something else?
> 4. I tried to crawl my test directory that had 1.2GB of data in it,
> 4400 files (mixture of xls, pdf, doc, odt, xml, ps, ppt, rtf) and while
> it started out good, it ran out of java memory at file 78. I increased
> the jvm to init with 512MB and grow to 1GB. It looks like it made it to
> the last file "Crawling completed, saving results..." but I get a stack
> trace in the startup window with OutoFMemory error. The Trix output
> file is 86.2 MB. It does not look like it finished correctly? It did
> only take 15 minutes to get the completed messages though.
> Any information on how much it can handle (# of docs on one crawl,
> types, sizes, memory, ...) Pretty much any perfomance related information.
I'm not surprised to see this happen. Again, the file crawler UI is
meant as a coding example, it has been kept as simple as possible.
Because of this and some historic reasons (the state of Sesame 2 at the
time this code was written), it uses a data structure that holds all
extracted information (full-text and metadata) in RAM and only writes it
to disk at the end of the entire crawl. Clearly, this doesn't scale even
Currently, Sesame 2 has progressed a lot and it now contains a stable
disk-based RDF store. We can update the example code to use this native
store to improve scalability. However, I would recommend you look into
the CrawlerHandler API (see the tutorials on aperture.sourceforge.net),
for example because of your next question. As Aperture focuses on
providing middleware components that handle crawling and extraction
tasks, I'm hesitant to make the examples too complex.
> 5. Has anyone looked into multi-threading this?
To the best of my knowledge: no, not yet. This is again something you
would have to develop yourselves by implementing your own CrawlerHandler.
> 6. Is there a way/file to configure the metadata that is pull out?
At the moment not. The proper way to handle this IMO is to have a
CrawlerHandler that suppresses certain metadata from entering a
persistent store. Performance-wise there is little to gain from
configuring the Extractors to output only a subset of the information,
it's often an all-or-nothing matter.
> I know that is a lot of questions, hopefully someone has the time and
> some answers.
> If I am going to be using this I would be happy to help in any way once
> I get up to speed and get going.
That would be great! We're looking forward to hear your feedback.