At the moment, when pages change (currenly only
detected when URLs change) we store backwards diffs of
the xml files in scrapedxml/deabtes/*.diff*. These
need backing up.
We should probably also store and backup backwards
diffs of the HTML.
At the moment we don't back up diffs at all. We should
backup the xml diffs, but the HTML ones are less important.
Need the XML files themselves for old gids, so URLs don't
break. We should back up diffs and actual XML files.
Valuable historic data like chgpages (all the old lists of
ministerships etc.) is in CVS, hence backed up across the
world via SF.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Logged In: YES
user_id=202102
All HTML page changes can now be detected, and diffs stored
in cmpages/
Are the diffs backed up yet?
Logged In: YES
user_id=91098
At the moment we don't back up diffs at all. We should
backup the xml diffs, but the HTML ones are less important.
Need the XML files themselves for old gids, so URLs don't
break. We should back up diffs and actual XML files.
Valuable historic data like chgpages (all the old lists of
ministerships etc.) is in CVS, hence backed up across the
world via SF.