The question arose this week: are the EpiDoc Guidelines ("latest" version) archived regularly enough by the Internet Archive that the Wayback Machine could serve as a reliable archive of each dated/numbered release? If not, how would we go about making it so?
Quick investigation suggests that one can only request a page be crawled by WM, not a whole site, and that while the EpiDoc GLs front page is crawled several times a year, most pages inside the site have not been archived since the last release. (That may also just be because they didn't change in that release, but I don't know if we can rely on that?)
One possibility might be that we (programmatically?) construct a list of all the pages in /latest/ (it's <100 isn't it?) and construct the urls to ask WM to archive them, which we write into the release process, to be run a couple days after the tested release. Would that be potential abuse of archive.org? I can't decide...
I'm going to take a look at the Web Archive Triage scripts that Ryan Baumann's published on GitHub. They can probably be used for this purpose.
Ryan's code will work great. I'll undertake to build a list of all our guidelines URLs that can be used with same.
Presumably this list should be generated from the Guidelines XML and/or HTML, because the pages in the site will change occasionally…
Bumped -> Future.
@paregorios do you intend to implement this in time for the September 2020 release freeze?
Sadly, no. Relinquishing.
Bumped -> Future.
It appears that nothing has changed with respect to direct support from Wayback Archive for requesting site-wide crawls. They do offer a site-wide archiving service (Archive-it.org), but it is fee-based. Currently, only single-page archiving requests are available.
At present, there are c. 185 individual pages in the GLs. Thus, Ryan's scripts (or similar), which trigger single-page archiving requests, seem a feasible solution. However, its CLI means it can't be integrated into the OxygenXML-bound GL generation process until we adopt XProc, which, as discussed in the context of BR #172, would require the release technician to have Oxygen version 22 or later ( the earliest version that supports XProc). Consequently, it would have to be deemed a potentially delegatable task, de-coupled from the GL generation process.
Going forward it doesn't seem unreasonable that at least one person on the release team will have Oxygen 22+, since that release is now over a year old, so I think we can keep this in place. (Assuming it works and we understand what to do with it. Was it ever added to the ReleaseProcess?)
@filosam agreed to take a look at this and report back for the February EDAG, with consultation with @sarcanon and others as needed re final functionality. (Does the XProc scenario create a script? Does it need to be run? Should we then add it to ReleaseProc)
Dear @sarcanon, should we tackle this for the 9.6 release?