SourceForge Infrastructure and Service Restoration update for 7/22

On 7/16, Slashdot Media sites (including Slashdot and SourceForge) experienced a storage fault.  Work has continued 24×7 on service restoration.  Updates have been provided as each key service component was restored.  We’ve provided one large update summarizing our infrastructure and service restoration status, and are providing a second large update with this post.

High-level status of all Slashdot Media sites and services as of 7/22:
  • Slashdotmedia.com – online
  • Slashdot.org – online
  • Slashdot Engineering infrastructure – online
  • Slashdot Media’s WordPress sites – online
  • SourceForge Engineering infrastructure – online
  • Slashdot Media operations infrastructure – online
  • SourceForge databases – online
  • SourceForge download service – online
  • SourceForge Directory services (project summary page, download pages, search, front page, directory) – online
  • SourceForge Developer Services – partially restored (see detailed status below)
In-depth status of SourceForge Developer Services as of 7/22:
  • SourceForge site’s Developer pages backed by Apache Allura (tickets, wikis, forums) – online
  • SourceForge Mailing List services (email, web archives, archiving) – online
  • SourceForge Project Web service – offline, filesystem checks complete, 22 project letters restored to date (all except jkms), data validation and per-letter service resumption pending, ETA 7/22 for restored letters, remaining four to follow pending restore.
  • SourceForge User Web service – offline, filesystem checks complete, 23 user letters restored to date (all except bhl), data validation pending, ETA 7/23, service resumption planned when all letters ready
  • SourceForge File Upload service – offline, filesystem checks complete, cryptographic summing in-progress, data preparation in-progress. Filesystem checks complete.  Cryptographic sums of files on disk at 75% completion with expected summing completion on 7/23. Data preparation in-progress and at 10% completion, ETA to follow (to be re-estimated when we allocate increased I/O to the data prep tasks on 7/23).
  • SourceForge Allura Git service – offline, filesystem checks complete, all project data restored, data validation (repository presence check 100%, repository data presence check 100%, ‘git fsck’ of 10% representative from non-empty repositories 100%).  Git validation was aided by its feature set.  Final data validation pending and ETA 7/22 for resumption of service.
  • SourceForge Allura Mercurial (Hg) service – offline, filesystem checks complete, all project data restored.  Data validation (repository presence check, repository data presence check, and repository validation to occur and ETA 7/23 for service resumption.
  • SourceForge Allura Subversion (SVN) service – offline, filesystem checks complete, data restoration at 50%.  Restoration priority after Git and Hg services.  ETA TBD, Future update will provide ETA.
  • SourceForge non-Allura SCM platforms and CVS service – offline, filesystems checks and data restoration have not commenced. Priority given to modern SCMs which include internal data validation mechanisms; and those repositories fully backed by Apache Allura. Service restoration ETA TBD.

Engagement with our storage platform vendor will continue and post-mortem activity is anticipated after data restoration is completed. The team continues split operation between data restoration and service restoration as to expedite return to full service.  Knowledge capture has been continuous throughout this outage and will drive continuous improvement.

We intend to continue our existing communications approach — incremental updates will be provided on individual service restoration, and large updates (like this one) will be provided with additional metrics and technical details as work progresses.

Work continues 24×7 on restoration of SourceForge file upload, SCM, and project web services.

Thank you for your continued support and patience.

29 Responses to “SourceForge Infrastructure and Service Restoration update for 7/22”

  1. Ryan Jul 22, 2015 at 3:12 pm #

    What does “storage fault” actually mean in this context?

    • rgaloppini Jul 22, 2015 at 5:01 pm #

      You can find more info in a previous blog post.

  2. frodo Jul 22, 2015 at 3:57 pm #

    @SFnetops I am trying to download Maxima but it says you are still in maintenance mode. WTH???

    • jbarrett Jul 22, 2015 at 5:07 pm #

      You can do download the maxima project here: https://sourceforge.net/projects/maxima/ Thanks SF Support

  3. Vinícius Adriano Machado Jul 22, 2015 at 4:00 pm #

    Thanks for support, i am waiting for complete my work about jasper studio! soon as possible fixed this please 🙂

    • hendrik Jul 22, 2015 at 4:29 pm #

      For me downloading from http://sourceforge.net/projects/maxima works without problems.

  4. Dan Wetzl Jul 22, 2015 at 4:28 pm #

    Been trying to access phplot code repository/discussion board for a week, any chance it will be fixed soon? Need this for a project I inherited from somebody who left, and don’t have any experience with it.

  5. kwittry Jul 22, 2015 at 4:47 pm #

    Waiting to be able to download symmetricDS source code! Good luck!

  6. narjesia stellarium Jul 22, 2015 at 4:51 pm #

    To which service domain does the Stellarium website stellarium.org belong? It is still not brought back online.

  7. Mark Jul 22, 2015 at 4:55 pm #

    How are you going to prevent this from happening again? This is a very important question. We cannot afford week long outages every month. Also, It would I think be a good idea to go into more detail about what exactly caused this problem, since so many people have been so inconvienced I think that disclosure is due here.

    • jbarrett Jul 22, 2015 at 5:16 pm #

      You are correct, we will be releasing a full detail analysis of what happened and what we’re going to do to prevent this from happening in the future. For right now our main focus is going to be on restoring all of our services. Once all of our services are restored we will then be able to do a post-mortem of the situation. Thanks SourceForge Support

  8. Flaimbaited Jul 22, 2015 at 5:01 pm #

    I have repos in Git and Hg, but prioritizing them over SVN is extremely arrogant, presumptuous, prejudicial, and just wrong. Very disappointing bias. Just because Git and Hg have offline commit capability does not make them superior — there are valid requirements and modern development models based on the intrinsic centralization features of SVN. Capability-wise, Git and Hg checkouts/clones are minimally impacted by this service outage. SVN checkouts, however, are dead in the water until you restore the central repo. The fact that you’re placing the lowest priority on those most affected by the outage says a lot about how you value your users and their technology requirements. Very disappointing.

    • hendrik Jul 22, 2015 at 6:57 pm #

      One of the big features of git over SVN is that is has built in cryptographic checksums. So if a git repository has some low level data corruption, it will be noticed automatically on fetch or push. SVN (and CVS) on the other hand, will happily serve you corrupted data, destroying your local working copies. To summarize: There is very little risk in making potentially corrupted git repositories available. But doing that for SVN or CVS is an extremely bad idea. Just remember the last big CVS incident a couple of years ago. As part of the recovery, an old backup was used as base, and the current status was rsynced to it; but the –delete – option was forgotten. Thus deleted files suddenly showed up again, resulting in all kinds of wired compiler and runtime errors. Great fun. Luckily we catched that within a couple of minutes and SF immediately disabled CVS access again before a huge number of people had an opportunity to do cvs update. In this situation, there is another advantage of git over CVS (I think this is true for SVN as well): Server-side git repositories have a small number of files as git uses blob files for storage. Writing a huge number of files, tends to be significantly slower than writing a small number of files with the same total size, due to file system and seek overhead.

      • Flaimbaited Jul 23, 2015 at 12:13 am #

        jbarrett, your response is far superior rationale than what was implied by the above status update (and hendrik’s response to my complaint). I don’t agree with that prioritization strategy for the aforementioned reason that it delayed restoring those *most* harmfully impacted, but it’s at least a strategy with somewhat agnostic rationale. It’s the language and implication that any of the repo technologies were being restored based on some notion of superiority that comes across as ignorant bias and outright offensive. Thanks for your response; I do not envy being in your damage-control shoes but appreciate the massive efforts in play. hendrik, SVN also has built in checksums both client-side and server-side. It will report “Checksum mismatch while reading representation” if the server-side is corrupted. Moreover, client attempts to update that file would result in a similar checksum mismatch error. SVN does not have a nice userland recovery mechanism like git fsck because the likelihood of repo corruption is minimized to one instance — responsibility is delegated to traditional backup/recovery mechanisms. Regardless, it’s a complete non-sequitur response — the sum of any SCM’s features does not make it superior or inferior without knowing the context and requirements in which that SCM is applied. You have absolutely no way of knowing that context. To prioritize on those grounds or claim superiority is utter bandwagon BS.

  9. Nagodar Jul 22, 2015 at 5:05 pm #

    You must be working very hard! good luck and keep on!

    • jbarrett Jul 22, 2015 at 5:37 pm #

      Nagodar, Thanks we have and we appreciate the support and well wishes. Thanks SourceForge Support

  10. You guy are working? Jul 22, 2015 at 5:12 pm #

    Storage fault means some nerd either tinkered with the server settings and actually managed to kernel panic the server or he thought since he saw a video on youtube it would be okay to hot swap a storage node. Either way storage fault means our bad fellas.

  11. jbarrett Jul 22, 2015 at 5:34 pm #

    Flaimbaited, Ideally we would like to restore all the services at once, but that simply wasn’t possible. The order in which services were restored was based upon how much time it would take to restore the service, how popular the service is to the community, and size of the data that needed to be restored. When we had the opportunity to restore services we took it. That’s why services have been restored in the order they have been restored. We have been working as fast as we can to restore all of our services and will continue to work 24/7 until all of our services have been restored. If you have any other questions or concerns please feel free to contact us: https://sourceforge.net/support  Thanks SourceForge Support

  12. tullio Jul 22, 2015 at 5:42 pm #

    i am trying to download and install GnuWin32 but it says”This sourceforge.net website is temporarily offline. We are working hard to restore its availability.” where i can download it? sincerely

  13. Alberto Jul 22, 2015 at 8:13 pm #

    Hello!! Thanx for your efforts to solve the problem with the failure. I got this error after the restoration: Warning: session_start() [function.session-start]: open(/home/web-sessions/e/t/r/sess_etrvumgpkfo4goprge04ndo6m6, O_RDWR) failed: Permission denied (13) in /home/project-web/youronlineshop/htdocs/index.php on line 10

    • Alberto Jul 23, 2015 at 6:59 pm #

      It is working fine already

  14. Daniel Seagraves Jul 22, 2015 at 10:01 pm #

    I am an admin for NASSP (Project Apollo), and we were discussing a move to Git before this outage began. If CVS has last priority and is going to be down for weeks, is there any way I can get someone to just email me our CVS repository from whatever you have? I could use it to complete our migration. Our change rate was very low until very recently, and I can reconstruct the most recent changes from my working copy, I just would like to be able to save as much history as I can in the migration.

  15. Adam J Richardson Jul 23, 2015 at 4:27 am #

    Hi guys, Hope you get the Subversion service up again soon! I haven’t been able to commit my project for a week and I’m getting itchy. 😛 Thanks.

  16. Harry Thijssen Jul 23, 2015 at 4:40 am #

    It must be a hell working at your site yet. Thanks for the work and good luck. I guess we can learn a lot from the post mortum analyses.

  17. Paul Jul 23, 2015 at 8:21 am #

    I am just trying to download Dr Java onto my Mac. Can I do this now? I got a message that said Dr Java was damaged.

  18. 111 Jul 23, 2015 at 8:32 am #

    winscp cant use

  19. Porlock Jul 23, 2015 at 8:34 am #

    You blog post seems to distinguish between “Allura” and “non-Allura” backed SCM systems. But how can I tell whether my subversion repository is or isn’t Allura based? The term doesn’t mean anything to me!

  20. JMMCG Jul 27, 2015 at 4:25 am #

    To SF support: thank you so much for your excellent work in resolving this major outage! I have been a happy cvs user (started back in 2001-ish!) then migrated my repos to svn in 2009-ish (at SF’s advice, using their useful HOWTO). In all of those years this is the first time such a fault has impacted my 5-odd projects. Well done in maintaining the outstanding reliability! I look forward to confirming the restoration of my repos (they looked available on the 26th July). Yay!

  21. vivek Jul 27, 2015 at 2:49 pm #

    File Upload : I can see the file upload button but I add a folder I get an error “Unable to create top-level folder ” When this bug will be fixed ? I am working on this project http://sourceforge.net/projects/dataquality/