1.  I can send you any of the xls if you tell me an address.  I have a lot of xls that did not generate any full text.
 
2.  Yes you understand my XML extraction and rendering question correctly and I figured the original document was going to be needed to view rather than using only the XML and xslt and css.  You validated what I was thinking.  I am mostly interested in extracting all the text from lots of different datasources.
 
3.  I am not sure what format is best for our output yet.  If I choose a format and I was wrong I did not know the impact down the road (fileinspector was able to shown each format at the click of a button).  We currently use plain text.  I need to read up on RDF formats.  Right now all I really was interested in was plain text output to be an input for us (to index/query).
 
4.  I think I am going to just need to dive in with the examples and write my own little test app.  I will know more as soon as I find time write and test it.  Basically all I am going to do is:
a.  poll a file directory with command line app
b.  extract all contents to plain text
 
Eventually I would like to use wget and some of its options to crawl/poll ( --proxy-user, --proxy-password), ftp (or really sftp), and have the option for output to go to a single file (as it does now) or one ouput file per file extracted from.
 
I would be interested in getting/using the alpha-4 since I am just starting out.
 
Thanks for the help so far
 
 
From: Christiaan Fluit <christiaan.fluit@ad...> - 2007-01-22 05:32
Kevin C. Bombardier wrote:
> 1. When using the Aperture File System Crawler GUI
> (aperture-2006.1-alpha-3.zip) and I have xls in the directory, the Trix
> output file does not seem to have all the data from each of the rows
> (fullText xml tag). It has some text but not the data from the rows and
> columns. I have checked Determine MIME type and Extract document text
> and metadata.
>
> It prints the following message(s) in the startup window.
>
> INFO: regular POI-based processing failed, falling back to heuristic
> string extraction for file:/C:/eclipse/workspace/test/28PyrmdNE2.xls
> Jan 19, 2007 12:03:05 PM
> org.semanticdesktop.aperture.extractor.util.PoiUtil extractAll

This message occurs when Apache POI, the library that we use for
extracting information from MS Office files, fails to process the
document contents.

Fortunately the text in Excel documents is (always?) embedded as regular
human-readable text, so we use a heuristic processor as a fall-back
mechanism that tries to detect all parts of the binary stream that look
like regular text. For example, it identifies bytes that represent
alphanumeric characters and that seem to form words, it suppresses
common font names that are also often encoded as regular text, etc. This
process is highly heuristic and does not work in 100% of the cases.
Still, it's the best we can do and often preferable over no output at all.

If possible, feel free to send me an example document for which this
extraction is missing vital parts and I can see if there's anything we
can do about it.

> 2. Once the data has been extracted to XML can the XML document be
> rendered/viewed the same as the original it was extracted from? I would
> like to keep the single XML extract around and not the original. I will
> eventually parse out the XML to individual files but that is later on.
> I can keep both the original and the XML extract but if I do not need to
> then I will not. I was told that I should be able to view the images
> and formulas (or anything else from the original document) if I have the
> correct XML plugins/libs in the XML extracted file (MathXML, ...)

I am not 100% sure I understand your question. Do you want to present
the information by rendering the resulting XML using stylesheets,
instead of displaying the original document? Perhaps this is doable but
I wouldn't be surprised it this is rather suboptimal. The extractors in
Aperture are developed with full-text *indexing* in mind, so issues like
text layout are of minor importance and stuff like images and formulas
aren't even extracted at all at the moment. For human consumption, the
original document may be much better readable, but you should test this
on the types of documents that you use.

Also know that Aperture does not really extract to XML, it extracts to
RDF, which has a different data model (in short: a type of labeled
directed graph rather than a labeled tree) that can be serialized in a
variety of ways, for example using one of the two XML-based formats
(rdfxml and TriX). Depending on your processing needs and capabilities,
one of the other formats may be easier to handle. For example, N-Triples
is well suited for processing with regular expressions.

Finally, the file crawler UI is only meant as an example application, we
encourage people to program against the Aperture APIs instead. For
example, when you implement your own CrawlerHandler, you will receive
your metadata as a sequence of DataObjects, each representing a
file/webpage/email/... Here you can take care of e.g. filtering,
reorganizing and storing metadata.

> 3. The File Inspector GUI allows you to changed the type of metadata
> output. If I pick a format at the beginning can I change it down the
> road (ie. I decide on Trix now but want to change to Turtle or another
> serialized format down the road)? How hard would it be to convert from
> one to the other?

"down the road"? The File Inspector lets you change it at any time, the
extraction results are immediately updated. Or do you mean something else?

> 4. I tried to crawl my test directory that had 1.2GB of data in it,
> 4400 files (mixture of xls, pdf, doc, odt, xml, ps, ppt, rtf) and while
> it started out good, it ran out of java memory at file 78. I increased
> the jvm to init with 512MB and grow to 1GB. It looks like it made it to
> the last file "Crawling completed, saving results..." but I get a stack
> trace in the startup window with OutoFMemory error. The Trix output
> file is 86.2 MB. It does not look like it finished correctly? It did
> only take 15 minutes to get the completed messages though.
>
> Any information on how much it can handle (# of docs on one crawl,
> types, sizes, memory, ...) Pretty much any perfomance related information.

I'm not surprised to see this happen. Again, the file crawler UI is
meant as a coding example, it has been kept as simple as possible.
Because of this and some historic reasons (the state of Sesame 2 at the
time this code was written), it uses a data structure that holds all
extracted information (full-text and metadata) in RAM and only writes it
to disk at the end of the entire crawl. Clearly, this doesn't scale even
remotely.

Currently, Sesame 2 has progressed a lot and it now contains a stable
disk-based RDF store. We can update the example code to use this native
store to improve scalability. However, I would recommend you look into
the CrawlerHandler API (see the tutorials on aperture.sourceforge.net),
for example because of your next question. As Aperture focuses on
providing middleware components that handle crawling and extraction
tasks, I'm hesitant to make the examples too complex.

> 5. Has anyone looked into multi-threading this?

To the best of my knowledge: no, not yet. This is again something you
would have to develop yourselves by implementing your own CrawlerHandler.

> 6. Is there a way/file to configure the metadata that is pull out?

At the moment not. The proper way to handle this IMO is to have a
CrawlerHandler that suppresses certain metadata from entering a
persistent store. Performance-wise there is little to gain from
configuring the Extractors to output only a subset of the information,
it's often an all-or-nothing matter.

> I know that is a lot of questions, hopefully someone has the time and
> some answers.
>
> If I am going to be using this I would be happy to help in any way once
> I get up to speed and get going.

That would be great! We're looking forward to hear your feedback.


Regards,

Chris
 
 


Want to start your own business? Learn how on Yahoo! Small Business.