SourceForge Infrastructure and Service Restoration

By Community Team July 18th, 2015

Downtime can often present a rare view to the infrastructure of a site. Service restoration for the Slashdot Media sites (sourceforge.net and slashdot.org) commenced Thursday after a storage fault. We’re providing a greater depth of information about our restoration activities to help keep you informed. We’re also providing some detail about our infrastructure to accurately convey the state of our infrastructure in light of some third-party misinformation.

Slashdot Media sites are both proponents of Open Source and backed by Open Source.

Our server platform is CentOS Linux.
We use an Open Source virtualization platform and have in recent years achieved a 75%+ reduction in physical server count through widespread virtualization.
We use an Open Source storage platform, Ceph, with spinning disks and SSD.
The storage backing our services is a mix of ext4, XFS and NFS.
Our backup solution is Open Source, backing on to popular cloud storage platforms.
Our sites use Open Source database platforms including MongoDB and flavors of MySQL and PostgreSQL.
We leverage scalable data solutions including Hadoop and ElasticSearch.
Slashdot is backed by Perl. SourceForge is backed by Python. Both language stacks are entirely Open Source.
And the SourceForge developer services are backed by the Apache Allura code base, which we Open Sourced and delivered to the Apache incubation process.

The Slashdot Media sites experienced an outage commencing last Thursday. We responded immediately and confirmed the issue was related to filesystem corruption on our storage platform. This incident impacted all block devices on our Ceph cluster. We consulted with our storage vendor when forming our next steps. We have since been working 24×7 on data restoration, data validation, and service recovery.

Our response to date has been methodical and focused on safe restoration of data and service. To enable this response we split our team in half, with one portion of the team working to expedite service restoration, and one portion of the team working on data validation and restoration.

During the early hours of our outage, both Slashdot and SourceForge ran from our lightweight Disaster Recovery (DR) environment.
Service was restored same-day (Thursday) for Slashdot, as well as SourceForge-related WordPress sites. This included validation by Slashdot Editorial and Engineering teams.
Friday morning we continued work on restoration of our Engineering and operations infrastructure, to facilitate production changes for SourceForge. This included validation by Slashdot and SourceForge Engineering teams.
Friday afternoon we worked to restore functionality for the SourceForge site and Friday evening the SourceForge site was brought back online. Downloads, project summary pages, the software directory, search, and the site front page had previously been served through the DR environment and were restored to full function Friday evening. Additional validation was performed by SourceForge Engineering.
Work has continued today (Saturday) on validation of SourceForge developer services data, and will continue until services are restored. We’re working methodically and merging data from our latest backups with data from the local filesystem. Data validation largely includes checks via cryptographic sums and signatures.
We’ll be bringing services back online as the validation of backing data is completed, and anticipate bringing additional services online through mid-week. The data involved in our developer services is among the largest we house, and it takes time to perform filesystem checks and to restore data from backups. Using separate mounts, both steps are occurring concurrently to minimize the timetable for restore.
We’re prioritizing the project web service (used by many projects using custom vhosts), mailing lists, and the ability to upload data to our download service. Downloads (40+ TB of data) are already fully functional (as of Friday night), but we are not currently accepting new releases. The Allura platform (ticketing, forums, wiki) will be brought online when the first of these services is ready to come back online.
We’re holding SCM service restoration for last, and will be prioritizing Git service to be first within that process based on its fast verification path. Holding SCM restoration for last allows us to take a cautious approach and to free our staff to interact with developers if any concerns exist when the service is re-enabled.

We’ll continue working 24×7 on service restoration until all services are back online. Watch for updates via the SourceForge Blog (sourceforge.net/blog) and Twitter (https://twitter.com/sfnet_ops).