Finding broken links (and adding redirects) significantly improves the usability of a website. Assuming the logs are in the standard place and you have permissions to read them the command:
zcat /var/log/apache2/access.log.*.gz | grep ' 404 ' | awk '{print $7,$11}' | sort | uniq -c | sort -n
will give you a list of missing URLs on the website, the URL(s) pointing to them and a count.
Catgorising as AMBER
At Council face-to-face assigned to JC, GREEN, JC to ask DS to make the TEI-C webserver logs available somewhere we can see them; then action on MH to write script to generate lists of bad links of various types. JC also to check google analytics/Webmaster tools. Deadline to report back by next conference call.
Currently, the log directory for HTTP files on tei-c.org is readable by root only. I'll see whether Shayne is willing to loosen up the permissions, or whether he has a better suggestion for monitoring the errors.
Logs are now world-readable under /var/log/httpd
I'll still need to be able to log into the server, though, won't I? Could they be copied to an external location or into the web space?
Having looked at the logs, there's very little that we need to worry about here. I think someone should run linkchecker against the TEI site once in a while, but anyone can do that any time. I suggest closing this ticket.
Last edit: Martin Holmes 2014-01-24
Ticket closed as per council recommendation; do this occasionally.
Actually we have a built-in checker that runs once a week on the Jenkins servers, so all we have to do is monitor the results from that.