Scheduled maintenance: Lessons learned

By | September 28th, 2010

Sorry and Thank You
Last week we had a maintenance window that was scheduled for 12 hours. Instead, our core services were offline for a day and a half with some backups being throttled for an additional day.

I am very sorry for any inconvenience that this caused. Rest assured, we take any disruptions in service very seriously. I also wanted to thank our customers; I was amazed at how calm and supportive they were during this time.

As of Saturday at 3pm, everything has been working and the service for all users is live. Please note that at no time was your backed up data at risk.

What follows is more technical detail on what happened and what we intend to do.

The Original Maintenance Plan
We worked on our “central authority” cluster that maintains customer metadata, handles billing, prepares restores, etc. This is unrelated to the storage pods where all the backed up data is stored.

The maintenance plan was to migrate the metadata to another server running an upgraded OS and then to update permissions on that data. Operations like this with large volumes of data take time. We estimated the time based on a previous maintenance and wrote a multi-threaded script to update the permissions in attempt to accelerate the process.

What Happened
Due to the large data growth since our previous maintenance, we did not properly account for the time required. Then, the permissions update script failed to update all the files because there were too many threads. Rather than trying to fix the multi-threaded script in a rush, we ran the script single-threaded. This took quite a bit longer to run, but was safer than trying to rewrite code in a hurry.

We brought the site back up on Friday afternoon, but all customers starting to backup concurrently overwhelmed the system. We brought the service down briefly and started slowly allowing customers to backup again. (During this time, restores and other services were fully functional.) By 3pm Saturday, all customers were fully operational.

Taking a step back, we had a few basic lessons learned:

1. Estimate better.
This is not just an “eat your vegetables” approach. We have the data to produce better estimates. Specifically, we will factor in data growth rather than using previous maintenance experience.

2. Limit to 20 threads the permissions updating process.
Threads are good. Too many threads are not.

3. Bring the site back online in stages.
We have a lot of users. They have a lot of data. When the site comes online there is a massive flood of data and requests that strain the service. Bring users back incrementally.

Again, thank you for your patience and we hope to keep helping you protect your data for a long time to come.

Gleb Budman
Co-founder and CEO of Backblaze. Founded three prior companies. He has been a speaker at GigaOm Structure, Ignite: Lean Startup, FailCon, CloudCon; profiled by Inc. and Forbes; a mentor for Teens in Tech; and holds 5 patents on security.

Follow Gleb on: Twitter / LinkedIn / Google+
Category: Backblaze Bits