Menu

#295 Archive content and add cached document links to renderer

v1.5
open
nobody
None
1
2016-01-31
2016-01-31
Eric Twose
No

Feature request: Have you considered adding the ability to archive web pages and other documents, and to provide links in the renderer to the cached content, like google does, or a simplified version of what the wayback machine does, without multiple dated snapshots?

  • Need to check that we are allowed to archive the content, in the server's response headers.
  • Want to be able in subsequent crawls to choose whether to keep the old archive or update it.
  • The archive shouldn't be updated if the content is now dead or we're redirected to a custom 404 page; etc.

An anonymous user suggested something similar and provided an attachment, htmlparser.java here (milestone v1.2).

(The project that I'm working on needs to be self-sufficient and provide a broad range of services if part or all of the public internet is compromised or in some way inaccessible).

Discussion

  • Eric Twose

    Eric Twose - 2016-01-31

    Sorry, I'm new to OpenSearchServer. I realize now that I have the original content in the crawlcache.

    I guess I'd need to hook into the renderer somehow, get the crawlcache item filepath, then insert a hyperlink to the cached content.

    As the original poster suggested, any html document would need to have a html base pointing to the original site, if not present (or else requests will go to the server running OpenSearchServer).

     

    Last edit: Eric Twose 2016-01-31
  • Eric Twose

    Eric Twose - 2016-01-31

    No doubt javascript, styles and images would change or disappear from the original host over time. I don't know if calls to external resources could be stripped out, or the content could be simplified and made more readable, as services like wallabag do?

     

Log in to post a comment.