Thread: RE: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

Brought to you by: dejank, santibegue, triphop, vharcq

babeldoc-devel

RE: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

From: McDonald, B. <Bru...@ba...> - 2003-09-23 15:56:31

This issue is really problematic generally!   We did build in a .done file solution into the directory scanner which looks for the existance of the .done flag file.  This file indicates that the transfer has completed.  Why dont you look into this?

-----Original Message-----
From: David Kinnvall [mailto:dav...@al...]
Sent: Tuesday, September 23, 2003 10:52 AM
To: bab...@li...
Subject: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

Hi people!

I am getting bitten by the fact that a slow writer using FTP to
upload files to our server for processing using Babeldoc via the
DirectoryScanner sometimes, due to timing obviously, causes the read
file to be incomplete, thereby causing parsing/transformation errors
and so forth.  (Long sentence, yeah I know... ;-p )

To fix this, I have looked at various alternatives, including potential
ftp daemon settings (nope), kernel module (not very nice), the fam thing
(from SGI, old Perl module, didn't work) to avoid the read from taking
place until the file was written completely.

After failing enough at the above trials...I turned to our beloved
Babeldoc instead, for a dirty(?) little hack:

I extended the DirectoryScanner to take another config parameter
'minLastModified', default 0, specififying the least number of
milliseconds that must have passed since the file's last modification
timestamp, as reported by the File instance.

I am working under the assumption that the modification timestamp is
indeed correctly updated during the ftp upload, and it seems to be the
case, at least on my Red Hat 7.2 installation where I am testing this.

Using <scanner_name>.minLastModified = 5000 I have yet to see any failed
reads so far. But then again, I have only tested it for about an hour or
so... :-)  I'll know better tomorrow what the actual effect will be.

Diff (unified) against DirectoryScanner.java (per current CVS) attached.

Note: I haven't made any changes to the I18N stuff, and I do use a
logInfo() here and there, for debugging, so please treat the patch as
a proof-of-concept needing some (hopefully little) cleanup.

Any hope of getting something like this into Babeldoc proper?

(Oh, yeah, I haven't had the time to properly follow up on the last
  mail I sent about the SqlPreparedWriter and FileTransfer stages.
  Sorry about that. Still pondering if it makes sense to merge the
  Sql thing into the standard SqlWriter, since my prepared version
  has no support for multiple statements nor comments in it...I'll
  get back to the list on that in the (hopefully) near future.)

Regards,

David Kinnvall

Re: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

From: David K. <dav...@al...> - 2003-09-23 16:37:55

Attachments: DirectoryScanner.diff

Hi, Bruce,

McDonald, Bruce wrote:
> This issue is really problematic generally! We did build in a .done file solution into the directory scanner which looks for the existance of the .done flag file. This file indicates that the transfer has completed. Why dont you look into this?

Hm...there is no .done file solution in the DirectoryScanner.

There is one in the FileWriterPipelineStage, though, but that
is of no real help in the (our) DirectoryScanner case, since:

- Someone else is doing the writing, and can be slow at
it, to make things worse...perhaps even via modem. :-)
- I can't make the writer (a different organization)
change their system to provide an additional done file
to indicate completion in this case, unfortunately.
- The scanner submits the read document, in whatever state
it is, to the pipeline queue, for processing, and then
lets go of it, hence I cannot go back to the scanner if
I detect in the PipelineStage that something is wrong,
at least not in any clean way that I can see.

Have I missed anything?

The configurable delay approach in the DirectoryScanner seems
to work so far, at least for me, and I can't really see any
obviously cleaner solution when there is no possibility of
modifying the sender's behavior.

A bit more research indicates that there is really no clean
way (even at the OS level) in Linux to detect when a file has
been completely written and closed. So it seems to me at this
time that the delay might be a viable trick.

I have modified the patch every so slightly so that it checks
the modification timestamp after each delay, to cover the case
with a really slow writer appending a number of bytes each and
every second, for some time. Manually tested using "touch". :-)

Adjusted patch attached, FWIW.

/David

Re: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

From: Dejan K. <dej...@nb...> - 2003-09-24 09:00:59

Hi David,
I like your idea. In fact I had problems like you but we have solved it by
ignoring corrupt files ;). Since we have used XML files, we would know if
file is incomplete. These files would be ignored in pipeline processing. But
that is not good solution and I think your solution is better and more
general. Altough it is not perfect, I believe it could be usefull in most
situations.

However, I think this option should not be mandatory and configuration shoud
work without specifying it. (false should be specified in
DirectoryScannerInfo.getTypeSpecificOptions() for this option. Some other
options should not be mandatory so I have changed this).

Also, I don't like the fact that worker will sleep as long as the file is
being modified ignoring other files that may arrive in the meantime. What do
you think about ignoring fresh files and continuing processing other files.
So new files may not be process in first or second doScan() method call.
Also, what's happening if upload is aborted? Will file remain on file
system? What if user decide to delete file while worker is sleeping?  Have
you tested with these (maybe rare but still real) situations?

Dejan


---- Original Message -----
From: "David Kinnvall" <dav...@al...>
To: "McDonald, Bruce" <Bru...@ba...>;
<bab...@li...>
Sent: Tuesday, September 23, 2003 6:37 PM
Subject: Re: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner


> Hi, Bruce,
>
> McDonald, Bruce wrote:
> > This issue is really problematic generally!   We did build in a .done
file solution into the directory scanner which looks for the existance of
the .done flag file.  This file indicates that the transfer has completed.
Why dont you look into this?
>
> Hm...there is no .done file solution in the DirectoryScanner.
>
> There is one in the FileWriterPipelineStage, though, but that
> is of no real help in the (our) DirectoryScanner case, since:
>
> - Someone else is doing the writing, and can be slow at
>    it, to make things worse...perhaps even via modem. :-)
> - I can't make the writer (a different organization)
>    change their system to provide an additional done file
>    to indicate completion in this case, unfortunately.
> - The scanner submits the read document, in whatever state
>    it is, to the pipeline queue, for processing, and then
>    lets go of it, hence I cannot go back to the scanner if
>    I detect in the PipelineStage that something is wrong,
>    at least not in any clean way that I can see.
>
> Have I missed anything?
>
> The configurable delay approach in the DirectoryScanner seems
> to work so far, at least for me, and I can't really see any
> obviously cleaner solution when there is no possibility of
> modifying the sender's behavior.
>
> A bit more research indicates that there is really no clean
> way (even at the OS level) in Linux to detect when a file has
> been completely written and closed. So it seems to me at this
> time that the delay might be a viable trick.
>
> I have modified the patch every so slightly so that it checks
> the modification timestamp after each delay, to cover the case
> with a really slow writer appending a number of bytes each and
> every second, for some time. Manually tested using "touch". :-)
>
> Adjusted patch attached, FWIW.
>
> /David
>


----------------------------------------------------------------------------
----


> Index: DirectoryScanner.java
> ===================================================================
> RCS file:
/cvsroot/babeldoc/babeldoc/modules/scanner/src/com/babeldoc/scanner/worker/D
irectoryScanner.java,v
> retrieving revision 1.20
> diff -u -r1.20 DirectoryScanner.java
> --- DirectoryScanner.java 12 Sep 2003 01:09:16 -0000 1.20
> +++ DirectoryScanner.java 23 Sep 2003 16:30:44 -0000
> @@ -98,6 +98,7 @@
>   public static final String BUFFER_LEN = "bufferLen";
>   public static final String INCLUDE_SUB_DIRS = "includeSubfolders";
>   public static final String FILTER_FILENAME = "filter";
> +    public static final String MIN_LAST_MODIFIED = "minLastModified";
>
>   public DirectoryScanner() {
>   super(new DirectoryScannerInfo());
> @@ -112,6 +113,12 @@
>   /** flag to include sub directories */
>   private boolean includeSubDirs = false;
>
> +    /** Minimum time in ms since file was last modified.
> +     *  Attempts to guard against incomplete reads when
> +     *  the writer of the file is "slow".
> +     */
> +    private int minLastModified = 0;
> +
>   /**
>   * This method will scan for new documents. It will queue documents by
>   * itself, so it will return null no matter how many documents found!
> @@ -165,6 +172,12 @@
>   // Dont catch or do anything here - this means that the default is
accepted.
>   }
>
> +
setMinLastModified(this.getInfo().getIntValue(MIN_LAST_MODIFIED));
> +
> +        LogService.getInstance().logInfo(
> +            "Minimum time since last modification of files will be " +
getMinLastModified() + " ms"
> +        );
> +
>   //Add filename filter if exist
>   addFilter(FILTER_FILENAME);
>   }
> @@ -273,6 +286,21 @@
>   //getting message from file
>   byte[] data = new byte[1024];
>   long modified = new Date(file.lastModified()).getTime();
> +
> +            // avoid (if configured) incomplete reads due to slow
writer(s)
> +            long now = System.currentTimeMillis();
> +            while(getMinLastModified() > (now - modified)) {
> +                try {
> +                    long interval = getMinLastModified() - (now -
modified);
> +                    LogService.getInstance().logInfo("Sleeping " +
interval + " ms, since file was too fresh...");
> +                    Thread.currentThread().sleep(interval);
> +                } catch (java.lang.InterruptedException ie) {
> +                    // Ignore.
> +                }
> +                now = System.currentTimeMillis();
> +                modified = new Date(file.lastModified()).getTime();
> +            }
> +
>   fis = new FileInputStream(file);
>   baos = new ByteArrayOutputStream((int) file.length());
>
> @@ -319,6 +347,15 @@
>   public void setIncludeSubDirs(boolean includeSubDirs) {
>   this.includeSubDirs = includeSubDirs;
>   }
> +
> +    public int getMinLastModified() {
> +        return minLastModified;
> +    }
> +
> +    public void setMinLastModified(int minLastModified) {
> +        if(minLastModified < 0) minLastModified = 0;
> +        this.minLastModified = minLastModified;
> +    }
>  }
>
>  /**
> @@ -382,6 +419,14 @@
>   null,
>   true,
>   I18n.get("scanner.DirectoryScannerInfo.option.filter")));
> +
> +        options.add(
> +            new ConfigOption(
> +                DirectoryScanner.MIN_LAST_MODIFIED,
> +                IConfigOptionType.INTEGER,
> +                null,
> +                true,
> +                "Minimum time in ms since file was last modified
(attempts to guard against incomplete reads)"));
>
>   return options;
>   }
>

Re: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

From: David K. <dav...@al...> - 2003-09-24 09:46:47

Attachments: log.txt

Dejan Krsmanovic wrote:

> Hi David,

Hi Dejan,

> I like your idea. In fact I had problems like you but we have solved it by
> ignoring corrupt files ;). Since we have used XML files, we would know if
> file is incomplete. These files would be ignored in pipeline processing. But
> that is not good solution and I think your solution is better and more
> general. Altough it is not perfect, I believe it could be usefull in most
> situations.

Yes, it definitely helps in our situation. But as you say, it is most
certainly not perfect.

> However, I think this option should not be mandatory and configuration shoud
> work without specifying it. (false should be specified in
> DirectoryScannerInfo.getTypeSpecificOptions() for this option. Some other
> options should not be mandatory so I have changed this).

Whoops. That wasn't intentional. Of course it should be optional.

> Also, I don't like the fact that worker will sleep as long as the file is
> being modified ignoring other files that may arrive in the meantime. What do
> you think about ignoring fresh files and continuing processing other files.

Yeah, that sounds like a better idea.

> So new files may not be process in first or second doScan() method call.

Yup. I will take a second look at the code and see if I can come
up with something passable for this.

> Also, what's happening if upload is aborted? Will file remain on file
> system? What if user decide to delete file while worker is sleeping?  Have
> you tested with these (maybe rare but still real) situations?

Hm...in our specific case:

- Aborted upload -> file remains, uploader tries again later.
-> Will lead to failed parsing attempt in our case... :-/

- Delete file during sleep
-> In my current patch this will (probably) fail.
-> When your suggested change is in there, I guess that the
    scanner won't see the file any longer, and it will therefore
    not be processed. Which is better, I think. (At least logical.)

- I have, as you can see, not tested with these situations yet.

I'll try to recode my change according to your idea, and run a
few tests, including the specific cases above, and see how it
goes. If it looks ok I'll get back to the list with an updated
patch.


I have another question, regarding the multithreading support:

- When trying to use the threadpool pipeline processor I get
   into massive problems down the pipeline, where stages fail
   spectacularly that when using the sync pipeline works fine.

- Is there any known failure/problem modes when using the
   threadpool pipeline processor?

Extract from log-entries in my console attached. Background:
- Two DirectoryScanner's feeding one pipeline each.
- The two pipelines use the threadpool processor with a thread
   count of 2, each.
- The two pipelines call a third pipeline for final common
   processing as their last stage. The third pipeline also uses
   the threadpool processor, with a thread count of 4.
- The failure as shown in the attached log fragment occurs in
   one of the first pipelines, in the third stage "extract_fid",
   which is an XpathExtract stage.

Any idea(s) about cause/solution/further research in this case?

> Dejan

David

Re: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

From: David K. <dav...@al...> - 2003-09-24 13:40:55

Attachments: scanner_module.patch

Dejan Krsmanovic wrote:

> Hi David,

Hi again, Dejan, and list,

[snipped Dejan's suggestions regarding the DirectoryScanner]

I have read source, coded and tested a bit, and come up with a
new version of the patch to the DirectoryScanner. This one:

- Integrates the file-age criteria into the filter mechanism
   already present in the DirectoryScanner

- Creates a new private method boolean acceptFile(File file)
   that gets called instead of the original acceptEntry, but
   calls acceptEntry itself, in addition to the age check,
   allowing for easily extending acceptFile to do more things

- Changes the name from minLastModified to minimumFileAge, both
   in the code as well as for the configuration parameter

- Makes the parameter optional with default value of 0 (zero),
   i.e it behaves exactly as before and is backwards compatible

- Externalizes the description of the configuration parameter
   into the messages.properties for the scanner module

- Does not sleep any more, since it merely ignores the file if
   it is too young, until next scanner phase, when it gets a new
   chance to be considered for processing again

- Gracefully handles the removal of files between scannings,
   since it has no memory of what files were there last time

- Totally fails to handle the case where the sender deliberately
   (or not) uploads an invalid file... ;-)

Patch, against both messages.properties and DirectoryScanner.java
is attached. Please review and comment.

Regards,

David

Re: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

From: Dejan K. <dej...@nb...> - 2003-09-24 14:36:47

Hi David,
I have applied your patch (I hope it works correctly!).

Regarding to your question about using threadpool pipeline processor I have
to say that this is still new ferature in development phase so there could
be such errors. Anyway, stack trace you sent looks like error in particular
stage (extract_fid) so it maybe some configuration issue...
I guess Bruce could help you more on this.
Anyway, if you find some Babeldoc's bug please send us a patch. Or we could
grant you CVS access permissions if you are interested!

Dejan

Re: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner

From: David K. <dav...@al...> - 2003-09-24 14:54:59

Dejan Krsmanovic wrote:

> Hi David,

Hi Dejan!

> I have applied your patch (I hope it works correctly!).

Thanks! It's chugging along happily here, albeit not with very
heavy traffic, but nonetheless. I'll shout loudly if I discover
any problems related to these changes, of course, but they are
rather localized, so hopefully...

> Regarding to your question about using threadpool pipeline processor I have
> to say that this is still new ferature in development phase so there could
> be such errors. Anyway, stack trace you sent looks like error in particular
> stage (extract_fid) so it maybe some configuration issue...
> I guess Bruce could help you more on this.

Well, the stage works fine when using the sync processor, and
it is very simple indeed...the config looks like this:

# XPath Extract - save the original file_id attribute
extract_fid.stageType=XpathExtract
extract_fid.nextStage=file_writer
extract_fid.XPath.orig_file_id=/newsitem/@file_id

I have only looked briefly at the source of the threadpool
processor so far, so I don't really have any informed idea
about what could be going wrong here. I will look further at
both my code, the pipeline configuration and the threadpool
source to see if I can find any clues...

> Anyway, if you find some Babeldoc's bug please send us a patch. Or we could
> grant you CVS access permissions if you are interested!

I will happily send patches for suitable scrutiny. I don't think
I dare meddle directly with the CVS yet, until I feel a bit more
comfortable with the overall codebase of Babeldoc. I hate making
changes that cause unexpected regressions elsewhere in the code,
and even moreso when it's someone else's. :-) Maybe a bit later?

Anyway, thanks much for applying the change! Certainly makes it
easier to keep synchronized with the Babeldoc core in CVS. I'll
try to clean up the code for my (two, so far) pipeline stages
as well, and submit them in a couple of weeks, for consideration.

> Dejan

David