It was Antoni Myłka who said at the right time 11.12.2007 18:04 the following words:
First of all great mail.

Leo Sauermann pisze:
It was Antoni Myłka who said at the right time 08.12.2007 00:55 the 
following words:
Leo Sauermann pisze:

The FileDataObjects have in their contract that the getContent() method 
must return a mark-able inputstream.
javadoc: "The returned InputStream is required to support marking 
(markSupported() must return true)."

Hence the code in CrawlerHandlerBase at the moment is BROKEN and WRONG:
This line is superflous and not needed and breaks our ideas (line 100):
BufferedInputStream buffer = new 
BufferedInputStream(object.getContent(), bufferSize);

because if we DON'T create another buffer around the already buffered 
stream, it all works:
we call getContent() in CrawlerHandlerBase and rely on the contract of 
FileDataObject of having Mark-enabled streams.
We call this as done already for mime-type detection:

object.getContent().mark(minimumArrayLength + 10); // add some for safety
// do mimetype stuff
// now reset BOTH the underlying content in the dataobject AND the 

I don't quite get this bit. Why does anyone need to reset both if they 
are the same object.
that was wrong. What I wanted to say is that line 100 is BAD because it creates two stacked buffered streams, but the inner stream is not reset() when the outer one is reset(), the code you have written below is the one that solves the problem.
InputStream stream = object.getContent();
stream.mark(minimumArrayLength + 10);
String mimeType = getMimeType(stream);
this is how it should be .
stream used for mimetype detection - after this all is fine again:

// now this works:

Internally, the object().getFile() method has a problem:
mark() only supports marking up to a limit. For some resources, the 
limit can be maxint :-/
The problem here is, that the FileDataObject would have to call 
mark(maxint), save to file, then reset(), to guarantee that both methods 
getFile() and getContent() work after calling getFile() once.
To circumvent this, it may be necessary to save the whole stream to the 
file, when once the crawlerhandler callsed getFile(),
and on subsequent calls to getContent(), the FileDataObject may return a 
FileOutputStream based on the now-available temp-file (or on the 
already-existing filesystem file for filecrawlers)

* remove line 100 from crawlerhandlerbase - it breaks our planned 
* and document at FileDataObject:
"the getContent() method is ideally only called once for extracting the 
content of the stream using an Extractor. For mimetype detection or 
other reasons, crawler handlers or other users may <i>peek</i> into the 
content by calling getContent().mark(buffersize) and 
getContent().reset(). If you work on the content stream before the 
actual extraction, you must always reset() it.
Calls to getContent() will <b>always return the same instance</b>, no 
copies of the stream object are made."

then I think the problem is solved as we planned it to be solved in our 
initial architecture.

Yes it is. But then we have a convenience method in the FileDataObject 
class that works only if the stream has been properly reset, i.e. the 
internal correctness of the data object state does not depend on the 
DataObject class itself. If the user adheres to the instructions in this 
javadoc nothing should happen, but if he/she doesn't, we can't throw any 
exception, nor signal the abnormal situation. More below...
yes, thats a pity, but its ok as it is in the name of performance.

more by me below....

... more below ....

also, please delete FileExtractorRegsitry and move its functionality to 
ExtractorRegistry, we agreed not to do this but instead add methods to 
ExtractorRegistry (keep it simple).
I've chosen this approach because somehow I found it cleaner than
public void add(ExtractorFactory factory);
public void remove(ExtractorFactory factory);
public Set get(String mimeType);
public Set getAll();

public void add(FileExtractorFactory factory);
public void remove(FileExtractorFactory factory);
public Set getFileExtractors(String mimeType);
public Set getAllFileExtractors();

Two sets of four methods. Each subtly different. There is no common 
for the file extractors and the ordinary extractors. The get method 
cannot return a collection of both (the sets are unparametrized). The 
get() method cannot be renamed because we don't want to break existing 
code. So get() vs. getFileExtractors() are named somehow unsymetrically. 
  This is API aesthetics that is confusing IMHO. Are you sure?
Yes I am "pretty" sure. Aperture is already Frameworky enough,
keep it simple.

 I would also suggest to move all contents from the package 
org.semanticdesktop.aperture.fileextractor to
org.semanticdesktop.aperture.extractor, this is also superflous.
It just makes everything slightly more complicated.

if you care for synchronisation of the methods, deprecate the get() 
methods (leaving them working)
and add getExtractor() and getFileExtractor() methods to the 

but keep the overall amount of classes to a manageable minimum :-)))

also, its much better for us having ONE config file listing the 
Extractors & FileExtractors


thanks for agreeing, don't want to force you to anything, but I think its a good optmiziation.

I added this comment to the ticket:
regarding the other comment: Do it as planned and discussed.

it is possible assuming this: (X) marks changes, I replaced "user" with 
1. The FileDataObject is created with an InputStream
2. The crawlerhandler obtains the input stream with the getContent method
3. ... and passes it to a MimeType identifier
4. the MimeTypeIdentifier determines a mime type
4.1: (X) The crawlerhandler MUST call mark/reset before applying a 
MimeType identifier (or our rewindable buffering stream) and must call 
4.2 The crawlerhandler calls ExtractorRegistry.getExtractor, no anwer. 
Then ExtractorRegistry.GetFileExtractor, a FileExtractor is available
5. the crawlerhandler calls FileExtractor.extract(), which in itself 
calls getFile()
6. the getFile method takes the stream and tries to create a temp file

use a File.createTmpFile() for the stored and buffered file.
mark the files to be deleted on jvm exit (File.markDelete or something 
like this its called), as safeguard when somebody forgets to call 
Yes, but as I wrote. Imagine a bug that will appear if the user does NOT 
call reset() in 4.1. The getFile will create a wrong temporary file 
(silently) and the FileExtractor will fail with some kind of a parse 
error. Everyone will think that either the file is corrupted or that 
Aperture is buggy (both statements are true BTW :) ). Hardly anyone will 
think of an error in their own app.
the getContent() method has to be better documented, as said above.
The described bugs can appear, but as this is a crucial part of aperture 
in terms of performance, we must not change it.
We can issue a log-warning when the user calls FileDataObject.getContent():

if (content instanceof BufferedInputStream && 
(((BufferedInputStream)content).pos > 0))
 log.warning("possible programming error before "...printstacktrace..": 
maybe you forgot to call reset() on the content buffer. When buffering 
the stream, use a local variable and call mark() and reset() on the 
local variable to remove this warning, this warning is only printed on 

This may be a solution, but I'd rather rewrite it so that this check is 
performed on getContent() and getFile(). It's actually more important on 
getFile(). We'll need to workaround the fact that the pos field is 
protected, not public. I would suggest a following solution with two 
internal classes

// a buffered wrapper that exposes the getPos() method
private class ApertureBIS extends BufferedInputStream {
     public ABIS(InputStream stream) {
     int getPos() { return pos; }

//this is to reuse the existing buffered input streams
//so that the pos is exposed but we don't create additional
//DBIS - DelegationBufferedInputStream
private class ApertureDBIS extends BufferedInputStream {
      private BufferedInputStream internalStream;
      public ApertureDBIS(BufferedInputStream stream) {
          this.internalStream = stream;
      // the internalStream.pos is available because we inherit
      // from BufferedInputStream
      int getPos() {return internalStream.pos}
      // all other methods are delegated to the internalStream

This allows us to throw exceptions from getFile and getContent if they 
haven't been reset.

I'll try to implement something like this and see what happens.
ok, that looks weird to the eye, but is logic in the code and implementation.
I would guess you have to change a few things in the extractors also, this is a pity, but both classes do a good work.

looking forward browsing my crawled MP3s again ( :-)


Antoni Mylka

SF.Net email is sponsored by:
Check out the new Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
Aperture-devel mailing list


DI Leo Sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:

Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313