SourceForge Infrastructure and Service Restoration update for 7/28

On 7/16, Slashdot Media sites (including Slashdot and SourceForge) experienced a storage fault. Work has continued 24×7 on service restoration. Updates have been provided as each key service component was restored. We’ve provided two prior updates (7/18, 7/22, 7/24) summarizing our infrastructure and service restoration status. This is our fourth large update.

The format for this update has changed. Since we are well-past the 50% mark on service restoration, we will be providing updates only on service outages mitigated since 7/24 and ETA detail on outstanding service outages.

All services except SourceForge Developer Services were fully restored on or before 7/24. Services are online except those listed here as outstanding. For full service listing, see our 7/24 update.

Recently restored

  • Project Web service for k* projects is back online.
  • Allura-backed Subversion service is online
  • Classic (non-Allura) Git service is online.
  • Classic (non-Allura) Subversion (SVN) service is online.
  • Classic (non-Allura) Mercurial (Hg) service is online.

Outstanding

  • File upload service data has been reconstructed and is pending final copying, ETA for service restoration is end of day 7/31.
  • Classic (non-Allura) Bzr service is pending data validation. ETA for service restoration is end of day 8/3. Dataset is undergoing analysis, particularly to identify previously-migrated repositories.
  • CVS service data is pending validation, and infrastructure is being brought back online. ETA for service restoration is end of day 8/3. Data analysis is in-progress, to be followed by restore.  Validation of CVS data requires a greater degree of manual validation than other SCMs.
  • Interactive shell service is offline pending availability of all other service data. This service will be the last to come online. ETA for service restoration is end of day 8/3.

Additional notes

  • Targeted communications were sent to projects using Allura-backed Subversion service where we were able to identify commits occurred between time of backup and time of incident.  These projects were supplied commit metadata (committer, date, commit message) to aid in re-capture of these changes.
  • Post-mortem activity is anticipated after data restoration is completed.
  • Scheduled (and pre-announced) downtime of Developer Services occurred on 7/28 to support maintenance on our NFS servers. This downtime was completed successfully and ahead of schedule.
  • One additional Ceph-backed database is being migrated to the recently-provisioned SSD-backed database cluster.
  • Additional storage has been onboarded to support service restoration activities. In some cases we currently have three copies of production data to maintain during restoration.
  • Users on “Classic” non-Allura-backed SCM services should anticipate an upcoming pre-announced migration to Allura-backed service (which was restored first).

Work continues 24×7. Thank you for your continued support and patience.

4 Responses to “SourceForge Infrastructure and Service Restoration update for 7/28”

  1. david Jul 29, 2015 at 6:55 pm #

    As you perform the “manual validation” of the CVS data, make sure you check the line endings for text files. It appears that you may be restoring them all using unix format (\n) rather than preserving the original endings (which may be \r\n).

  2. Levi Saraiva Moura Jul 30, 2015 at 12:36 pm #

    Companheiros, bom dia. Gostaria de saber porque voces desabilitaram o Add Files dos meus projetos. Não estou conseguindo fazer update para atualização de minhas planilhas. No aguardo. Obrigado. Brasil – Vila Velha – ES 30/07/2015..

  3. Luis Gomez. Aug 3, 2015 at 11:58 pm #

    It was my pleasure to work with you today, you have a great website, and responsible operators, developers and believe all your team is in respectful harmony, I will try to visit you and bug more often, thank you, Sincerely: LRG.

  4. Gerhard Gonter Aug 27, 2015 at 5:08 am #

    Thanks for the good work! Is there a root cause analysis available to read already? This might be of interest for others who operate or consider to run CEPH installations. From my perspective it looked like you lost the whole CEPH cluster and had to rebuild that entirely and restore from tape. GG