Menu

#120 Investigate getting Guidelines crawled by Internet Archive

9.6
accepted
4
2024-03-06
2016-11-17
No

The question arose this week: are the EpiDoc Guidelines ("latest" version) archived regularly enough by the Internet Archive that the Wayback Machine could serve as a reliable archive of each dated/numbered release? If not, how would we go about making it so?

Related

Request Features: #104

Discussion

  • BODARD Gabriel

    BODARD Gabriel - 2016-11-17

    Quick investigation suggests that one can only request a page be crawled by WM, not a whole site, and that while the EpiDoc GLs front page is crawled several times a year, most pages inside the site have not been archived since the last release. (That may also just be because they didn't change in that release, but I don't know if we can rely on that?)

    One possibility might be that we (programmatically?) construct a list of all the pages in /latest/ (it's <100 isn't it?) and construct the urls to ask WM to archive them, which we write into the release process, to be run a couple days after the tested release. Would that be potential abuse of archive.org? I can't decide...

     
  • BODARD Gabriel

    BODARD Gabriel - 2016-12-16
    • assigned_to: Tom Elliott
     
  • Tom Elliott

    Tom Elliott - 2017-01-17
    • status: unread --> accepted
     
  • BODARD Gabriel

    BODARD Gabriel - 2017-10-17
    • Group: future --> 9.0
     
  • Tom Elliott

    Tom Elliott - 2017-10-20
    • Group: 9.0 --> future
     
  • Tom Elliott

    Tom Elliott - 2018-03-07

    I'm going to take a look at the Web Archive Triage scripts that Ryan Baumann's published on GitHub. They can probably be used for this purpose.

     
  • Tom Elliott

    Tom Elliott - 2018-03-07

    Ryan's code will work great. I'll undertake to build a list of all our guidelines URLs that can be used with same.

     
    • BODARD Gabriel

      BODARD Gabriel - 2020-06-16

      Presumably this list should be generated from the Guidelines XML and/or HTML, because the pages in the site will change occasionally…

       
  • BODARD Gabriel

    BODARD Gabriel - 2019-01-22
    • Group: future --> 9.1
     
  • Scott Vanderbilt

    • Group: 9.1 --> future
     
  • Scott Vanderbilt

    Bumped -> Future.

     
  • Tom Elliott

    Tom Elliott - 2020-01-21
    • Group: future --> 9.2
     
  • BODARD Gabriel

    BODARD Gabriel - 2020-08-07

    @paregorios do you intend to implement this in time for the September 2020 release freeze?

     
  • Tom Elliott

    Tom Elliott - 2020-08-07

    Sadly, no. Relinquishing.

     
  • Tom Elliott

    Tom Elliott - 2020-08-07
    • assigned_to: Tom Elliott --> nobody
     
  • BODARD Gabriel

    BODARD Gabriel - 2020-08-07
    • Group: 9.2 --> future
    • Priority: 3 --> 4
     
  • Scott Vanderbilt

    • assigned_to: Scott Vanderbilt
     
  • Scott Vanderbilt

    Bumped -> Future.

     
  • Scott Vanderbilt

    It appears that nothing has changed with respect to direct support from Wayback Archive for requesting site-wide crawls. They do offer a site-wide archiving service (Archive-it.org), but it is fee-based. Currently, only single-page archiving requests are available.
    At present, there are c. 185 individual pages in the GLs. Thus, Ryan's scripts (or similar), which trigger single-page archiving requests, seem a feasible solution. However, its CLI means it can't be integrated into the OxygenXML-bound GL generation process until we adopt XProc, which, as discussed in the context of BR #172, would require the release technician to have Oxygen version 22 or later ( the earliest version that supports XProc). Consequently, it would have to be deemed a potentially delegatable task, de-coupled from the GL generation process.

     
  • BODARD Gabriel

    BODARD Gabriel - 2021-11-19

    Going forward it doesn't seem unreasonable that at least one person on the release team will have Oxygen 22+, since that release is now over a year old, so I think we can keep this in place. (Assuming it works and we understand what to do with it. Was it ever added to the ReleaseProcess?)

     
  • BODARD Gabriel

    BODARD Gabriel - 2022-01-18
    • labels: --> release process
    • assigned_to: Scott Vanderbilt --> Martina Filosa
     
  • BODARD Gabriel

    BODARD Gabriel - 2022-01-18

    @filosam agreed to take a look at this and report back for the February EDAG, with consultation with @sarcanon and others as needed re final functionality. (Does the XProc scenario create a script? Does it need to be run? Should we then add it to ReleaseProc)

     
  • BODARD Gabriel

    BODARD Gabriel - 2022-01-18
    • Group: future --> 9.4
     
  • BODARD Gabriel

    BODARD Gabriel - 2023-06-15
    • Group: 9.4 --> 9.6
     
  • Martina Filosa

    Martina Filosa - 2024-03-06

    Dear @sarcanon, should we tackle this for the 9.6 release?

     

Log in to post a comment.