Menu

#285 Delete replaced normalised files after reprocessing

open-postponed
nobody
5
2012-10-30
2012-10-30
No

Currently when a reprocessing job is performed the new normalised xena files are added to the repository. However if there are any previous normalised files for the same input files then these are retained. We only really care about the binary xena files and the latest normalised xena files so normalised files that are replaced should be deleted. This should either occur when the reprocessing job is completed or at some later stage after this.

The only exception to deleting previous normalisation files would be if we decide to allow more than one normalisation for further risk management (e.g. such as normalising to two different file formats). Any superseded normalisation files are of no use to us and just use up space in the repository.

Discussion

  • John

    John - 2012-10-30

    As far as I can recall, the original plans for reprocessing required that all 'intermediate' AIPs were to be retained. This was to be able to prove that we had not 'destroyed the essence' of the original. That should be thoroughly investigated prior to doing much work on this request.

    Keeping each of the AIPs also allows for further reprocessing of an intermediate AIP - for example, if you ingest a doc file, you now have that doc file and a odt v1.2 normalised version. If you reprocess the doc files when odt v1.3 comes out, etc, you might end up with a situation where you cannot go directly from doc to the current version of odt. You would then have to go to the previous generation of odt files and reprocess that one.

    However, there is a possiblity that reprocessing an original file results in an open version identical to the one generated at ingest time. I cannot see any reason to keep identical copies.

    Let me know if I am not making any sense - its hard to describe in a small text box :)

     
  • Terry O'Neill

    Terry O'Neill - 2012-10-30

    I don't really see how having an intermediate has any effect on proving that we have not destroyed the original as we always keep the original so we can compare directly with it.

    What you say about processing from intermediates makes sense. The particular case you give would not be an issue because in this situation you would have your original doc file binary normalised and your odt v1.2 normalised version. You would then reprocess to odt v1.3 while both the doc binary normalisation and the odt v1.2 normalisation exist. It would only be once you complete your odt v1.3 normalisation that you would remove the odt v1.2 normalisation. Thus the disadvantage in this case would be that you would only have the latest odt v1.3 and if a later conversion would only work from the odt v1.2 then you would not have it to do this. This seems like the type of situation that would occur shortly after a new format version was released but not yet well supported. This sort of situation could be catered for by adding a defined period to keep intermediate normalisations before deleting them. It also does not seem like it would be likely to cause issue except possibly for situations in which we change the type of preservation format or there is some divergence in the preservation format that we are using. Of course if we are worried about such divergence in preservation formats then a better solution would be to allow for multiple 'current' normalisations for these different formats.

    I have changed the status to postponed for now to go over the possible issues. Another possible alternative to allow saving disk space due to reprocessing is to have a process that would identify and allow for deletion of low value intermediate normalisations. This could then be run any time space on the digital repository is in short supply.

     
  • Terry O'Neill

    Terry O'Neill - 2012-10-30
    • status: open --> open-postponed
     

Log in to post a comment.