|
From: Dejan K. <dej...@nb...> - 2003-09-24 09:00:59
|
Hi David,
I like your idea. In fact I had problems like you but we have solved it by
ignoring corrupt files ;). Since we have used XML files, we would know if
file is incomplete. These files would be ignored in pipeline processing. But
that is not good solution and I think your solution is better and more
general. Altough it is not perfect, I believe it could be usefull in most
situations.
However, I think this option should not be mandatory and configuration shoud
work without specifying it. (false should be specified in
DirectoryScannerInfo.getTypeSpecificOptions() for this option. Some other
options should not be mandatory so I have changed this).
Also, I don't like the fact that worker will sleep as long as the file is
being modified ignoring other files that may arrive in the meantime. What do
you think about ignoring fresh files and continuing processing other files.
So new files may not be process in first or second doScan() method call.
Also, what's happening if upload is aborted? Will file remain on file
system? What if user decide to delete file while worker is sleeping? Have
you tested with these (maybe rare but still real) situations?
Dejan
---- Original Message -----
From: "David Kinnvall" <dav...@al...>
To: "McDonald, Bruce" <Bru...@ba...>;
<bab...@li...>
Sent: Tuesday, September 23, 2003 6:37 PM
Subject: Re: [Babeldoc-devel] Avoiding incomplete reads in DirectoryScanner
> Hi, Bruce,
>
> McDonald, Bruce wrote:
> > This issue is really problematic generally! We did build in a .done
file solution into the directory scanner which looks for the existance of
the .done flag file. This file indicates that the transfer has completed.
Why dont you look into this?
>
> Hm...there is no .done file solution in the DirectoryScanner.
>
> There is one in the FileWriterPipelineStage, though, but that
> is of no real help in the (our) DirectoryScanner case, since:
>
> - Someone else is doing the writing, and can be slow at
> it, to make things worse...perhaps even via modem. :-)
> - I can't make the writer (a different organization)
> change their system to provide an additional done file
> to indicate completion in this case, unfortunately.
> - The scanner submits the read document, in whatever state
> it is, to the pipeline queue, for processing, and then
> lets go of it, hence I cannot go back to the scanner if
> I detect in the PipelineStage that something is wrong,
> at least not in any clean way that I can see.
>
> Have I missed anything?
>
> The configurable delay approach in the DirectoryScanner seems
> to work so far, at least for me, and I can't really see any
> obviously cleaner solution when there is no possibility of
> modifying the sender's behavior.
>
> A bit more research indicates that there is really no clean
> way (even at the OS level) in Linux to detect when a file has
> been completely written and closed. So it seems to me at this
> time that the delay might be a viable trick.
>
> I have modified the patch every so slightly so that it checks
> the modification timestamp after each delay, to cover the case
> with a really slow writer appending a number of bytes each and
> every second, for some time. Manually tested using "touch". :-)
>
> Adjusted patch attached, FWIW.
>
> /David
>
----------------------------------------------------------------------------
----
> Index: DirectoryScanner.java
> ===================================================================
> RCS file:
/cvsroot/babeldoc/babeldoc/modules/scanner/src/com/babeldoc/scanner/worker/D
irectoryScanner.java,v
> retrieving revision 1.20
> diff -u -r1.20 DirectoryScanner.java
> --- DirectoryScanner.java 12 Sep 2003 01:09:16 -0000 1.20
> +++ DirectoryScanner.java 23 Sep 2003 16:30:44 -0000
> @@ -98,6 +98,7 @@
> public static final String BUFFER_LEN = "bufferLen";
> public static final String INCLUDE_SUB_DIRS = "includeSubfolders";
> public static final String FILTER_FILENAME = "filter";
> + public static final String MIN_LAST_MODIFIED = "minLastModified";
>
> public DirectoryScanner() {
> super(new DirectoryScannerInfo());
> @@ -112,6 +113,12 @@
> /** flag to include sub directories */
> private boolean includeSubDirs = false;
>
> + /** Minimum time in ms since file was last modified.
> + * Attempts to guard against incomplete reads when
> + * the writer of the file is "slow".
> + */
> + private int minLastModified = 0;
> +
> /**
> * This method will scan for new documents. It will queue documents by
> * itself, so it will return null no matter how many documents found!
> @@ -165,6 +172,12 @@
> // Dont catch or do anything here - this means that the default is
accepted.
> }
>
> +
setMinLastModified(this.getInfo().getIntValue(MIN_LAST_MODIFIED));
> +
> + LogService.getInstance().logInfo(
> + "Minimum time since last modification of files will be " +
getMinLastModified() + " ms"
> + );
> +
> //Add filename filter if exist
> addFilter(FILTER_FILENAME);
> }
> @@ -273,6 +286,21 @@
> //getting message from file
> byte[] data = new byte[1024];
> long modified = new Date(file.lastModified()).getTime();
> +
> + // avoid (if configured) incomplete reads due to slow
writer(s)
> + long now = System.currentTimeMillis();
> + while(getMinLastModified() > (now - modified)) {
> + try {
> + long interval = getMinLastModified() - (now -
modified);
> + LogService.getInstance().logInfo("Sleeping " +
interval + " ms, since file was too fresh...");
> + Thread.currentThread().sleep(interval);
> + } catch (java.lang.InterruptedException ie) {
> + // Ignore.
> + }
> + now = System.currentTimeMillis();
> + modified = new Date(file.lastModified()).getTime();
> + }
> +
> fis = new FileInputStream(file);
> baos = new ByteArrayOutputStream((int) file.length());
>
> @@ -319,6 +347,15 @@
> public void setIncludeSubDirs(boolean includeSubDirs) {
> this.includeSubDirs = includeSubDirs;
> }
> +
> + public int getMinLastModified() {
> + return minLastModified;
> + }
> +
> + public void setMinLastModified(int minLastModified) {
> + if(minLastModified < 0) minLastModified = 0;
> + this.minLastModified = minLastModified;
> + }
> }
>
> /**
> @@ -382,6 +419,14 @@
> null,
> true,
> I18n.get("scanner.DirectoryScannerInfo.option.filter")));
> +
> + options.add(
> + new ConfigOption(
> + DirectoryScanner.MIN_LAST_MODIFIED,
> + IConfigOptionType.INTEGER,
> + null,
> + true,
> + "Minimum time in ms since file was last modified
(attempts to guard against incomplete reads)"));
>
> return options;
> }
>
|