Data Center Adaptability During COVID-19

There have been many heroes during the COVID-19 pandemic, people who toiled tirelessly to keep essential services running despite the obstacles. We thank them all. For Backblaze, one group that was essential to our business was our Data Center staff. While many of us could work from home this past year, it’s not yet possible for us to replace a hard drive over Zoom or Google Meet.

We’d like to share a peek inside our data centers to see how our staff adapted and persevered over this past year to allow Backblaze to meet the continuously growing data storage requirements of our customers.

“Everyone, please go home.”

On March 6, 2020 at 3:24 p.m. PST, Backblaze CEO Gleb Budman asked all on-site employees to go home due to the pandemic. Everyone. Over the next couple of days specific departments were designated essential: Physical Media, Facilities, and our Data Center teams. Our Physical Media team manages our USB restore program, our Facilities team is charged with securing our buildings and equipment, and our data center technicians keep our data storage equipment operational.

Each of these functions would have to be on-site to do some portion of their job, so each team was tasked with forming a plan to achieve their on-site tasks with as little person-to-person contact as possible, preferably none.

At the time, precious little information was available other than videos on how to wash your hands and cough in your elbow. The definition of social distancing was still being debated and the only thing everyone knew for sure was there wasn’t enough toilet paper or PPE.

There’s Something in the Water

The Backblaze Data Center staff is a wonderful combination of people of all ages from all walks of life—recent college grads and techies to be sure, but also veterans, truck drivers, event planners, and more. For reasons only a social scientist could guess, between February 2020 and August 2020, nearly half of the Data Center staff went out on maternity or paternity leave, which in California ranges up to 12 weeks. Staffing and scheduling around leave is challenging under normal circumstances, but during a pandemic there was no “normal,” so it fell to our data center managers to not only keep the wheels on the bus, but to go faster.

Defining Our New Normal

At the time everyone was sent home in early March, Cheryl was the interim data center manager at SAC 0, our largest data center. She had been appointed to the position a few weeks earlier when the incumbent data center manager went out on paternity leave. Cheryl, Darren, and Jon, our SAC 1 and PHX 0 data center managers, respectively, met virtually with Larry, our global data center manager, to define our operating procedures in the new COVID-19 environment.

Cheryl, Interim Data Center Manager at SAC 0.

Data Center Assignments

In the Sacramento area, the two data centers shared several personnel who moved between them as workload dictated. On occasion, one or two techs would travel to PHX 0 to help out as well. To minimize contact, this practice was suspended and all data center techs were assigned to work at a specific data center.

Work Areas

In each data center, Backblaze storage servers are in one or more work areas or cages. It was decided that no more than one person would be assigned to a work area at a time whenever possible and socially distanced when that was not possible. For example, one person can use a server lift to install or remove a storage server in a rack, but a second person should be nearby if there are any issues grappling with the 150-pound server.

Getting On-site

We are a tenant in each of our data centers with each one having a slightly different set of rules for COVID-19. We standardized our checks across the board. All workers would be temperature checked each day upon arrival, and that check was recorded and compared to previous readings. A worker showing signs of being ill (coughing, fever, etc.) was sent home to self-quarantine for two weeks. As testing became available, a negative test was required to return to work.

Biometric scanning is used for access to all data centers. There are different types of checks depending on the scanner: thumbprints, fingerprints, handprints, hand scans, eye scans, and so on. During COVID-19, hands in particular dried out from constant cleaning or puffed up with constant moisturizing, both subtly changing the biometric readings. The systems are built to correct for this, but employees heard “try a different finger” a lot from the guards at the different data centers.

Personal Protective Equipment

The team decided that all workers would wear masks and gloves unless gloves needed to be removed to perform a specific task. The Backblaze Facilities group was in charge of acquiring PPE and had the foresight to order many of the needed items weeks ahead of the office shutdown. Even so, supplies were thin the first few weeks. At one point Larry was able to find and buy face shields at a local liquor store. Whatever it took, everyone had the protection they needed.

Maternity and Paternity Returnees

For new parents returning from leave, it was decided that they should work from home until both Backblaze leadership and the employee were confident that the employee could be on-site at a data center without risk to the new child at home.

Assignment Priorities

There are three general categories of work for the data center techs: projects, recurring tasks, and maintenance requests. Projects are tasks like migrating hard drives, recurring tasks include activities such as quarterly inventory checks and wiping decommissioned hard drives, and maintenance requests are things like replacing failed hard drives and crash-carting a non-responsive server.

We initially put all projects on hold to focus on the list of maintenance requests that had stacked up. Once we understood what the on-site skeleton crew could accomplish, we would introduce planned tasks as there was time. For sidelined projects, we prioritized them and planned to restart them only when additional techs were able to return onsite.

A New Kind of Clean

Cleanliness is key in a data center, but COVID-19 added a new dimension as everything someone touched needed to be wiped down when they were finished. This was especially true for the first few months of the pandemic as scientists tried to determine how COVID-19 was transmitted from person to person. This created new cleaning experiences, like swabbing a moist alcohol laden towelette over a hard drive you had just replaced or wiping the buttons on the server lift controller. The challenge was to keep everything clean without compromising the delicate electronics involved.

Other Changes

For obvious reasons, food and drink are not allowed on data center floors—instead there is the break room, an oasis where the staff can snack, chat, and rest. With COVID-19, chairs and sometimes tables were removed from the break rooms to discourage gathering and to meet social distancing requirements. Community snacks like donuts or a bunch of bananas were replaced with individually wrapped, single serving items. All the while, there was the great Covid coffee quandary: single serve versus a pot of coffee. As long as someone wiped things down after each use/cup, did it matter? The debate rages on.

Getting Back to Business

At SAC 0, Cheryl and the staff, or what was left of the staff, put their plan into play: one on-site shift seven days a week, with two data center techs each shift, one for each work area in SAC 0 and that included Cheryl, who would be there every day. Coverage for off hours was on-call with alerts using the standard call rotation process for anyone who could go on-site.

The first order of business was to get the facility set up for COVID-19 protocols, social distancing, masks, temperature checks, etc. That was followed by catching up on any outstanding maintenance requests such as replacing failed hard drives. The Backblaze Vault software layer was designed and built anticipating drive failure and similar malfunctions. While Cheryl and her counterparts at the other data centers were dealing with scheduling changes, COVID-19 protocols, and the like, the Backblaze services continued to operate unaffected. Over the next few days, Cheryl and her crew caught up on maintenance requests and the new normal started to settle in.

Living the New Normal

Most of the tasks performed by the data center technicians are physical—requiring the person to be on-site—but some tasks could be done from home over the private VPN: file system checks, scheduling activities, and completing maintenance tickets, to name a few.

Additionally, the home-bound or working-from-home techs were able to take advantage of a wide range of online training courses. Backblaze backed this effort by providing access to, and funding for, sessions on Linux, power management, network security, programming, supervisory skills, and more.

Upon returning to the DC, Cheryl immediately noticed the emptiness. She spoke of the many days when she would check in a technician in the data center in the morning, take and record their temperature, and then not see that person until they checked out at the end of the day. She had the occasional video call with management or with a tech working from home, but there were many hours of nothing more than the hum of hard drives and fans to break the silence. Still, she noted that while storage servers were not the best conversationalists, they do listen very well.

One of the first operational challenges was with the assembly and testing of new Backblaze Storage Pods, our storage servers. A manufacturer makes and packages the parts and ships them to us to assemble, add drives, test, and so on. This is done by our data center techs. Before COVID-19, each data center would bring together several team members to create an assembly line where 20 and sometimes 40 servers would be assembled at a time. Now, a single person was assigned to build and test one server at a time with each completed server being wiped down and readied for deployment as part of a Backblaze Vault, which consists of 20 Storage Pods. Even with this change, the data center techs were able to meet our storage deployment schedule over the past year. This was critical as the thirst for cloud storage continued unabated the entire time.

Another challenge was debugging systems, in particular when a system required crash cart intervention. This doesn’t happen much with operational systems, but when you are testing some new hardware or software, sometimes things go wonky and crash cart debugging is needed. With COVID-19 protocols in play, this required a bit of creative thinking.

Cheryl recounted an occasion where engineers wanted to debug some drive migration code they were testing that kept crashing the target system. The engineers and other related staff would normally be on-site working shoulder to shoulder with Cheryl, but that was no longer an option. She had to be their eyes and ears, and fingers. She started by connecting the crash cart to the affected server and then positioned a laptop on the server lift so the engineers could virtually see the crash cart screen. She then set up a phone or tablet to show needed checklists and send messages as needed to other parties. With everything set up, she was ready to type in commands on the crash cart keyboard relayed to her by the engineers and perform any mechanical steps such as recycling power or swapping a defective board as directed. Problem solved.

In addition to her regular duties, Cheryl managed the drive migration efforts at SAC 0 throughout the pandemic. As drives age or show signs of declining durability in our environment, we retire and replace these drive models with newer, typically higher density, hard drives. Prior to COVID-19, Engineering redesigned the entire drive migration process and it was up to Cheryl to design and implement the data center operations side of the process. Once she had the process operational, she had to virtually train the staff in the other data centers so they could follow suit.

There were other twists and turns as the days went by. A few weeks into the pandemic, Darren, our SAC 1 data center manager, went on paternity leave and was replaced by Jack. A few months after that Jack replaced Jon, our PHX 1 data center manager, for several weeks, and Elliott became the interim data center manager for SAC 1. Elliott normally worked the weekend shift, Friday through Monday 4:30 a.m. to 3:00 p.m. each day, although most days 3:00 p.m. was more like 5:00 p.m. When he was appointed data center manager for SAC 1 in Jack’s absence, he shifted to the conventional 8-5 p.m. Monday to Friday schedule for the duration, and then back again when Jack returned, never missing a day along the way.

One Year In

Slowly over the past year, on-site staffing has increased at each of the data centers through a combination of new hires and returning employees. Recently, some folks have been able to snag a vaccine shot and others are scheduled. Masks, social distancing, temperature checks, and so on are still required, and the tables are still missing from the Backblaze break room, but everyone’s adjusted. There are multiple people on each shift and in each work area, and projects like updating our warehouse management system are in full swing. Things are as normal as they can be.

Over this past year, we asked for a lot from our employees who had to be on-site while the rest of us hunkered down in our houses. As Cheryl reflected on the past year, she wasn’t sure how she made it through. “I didn’t know I had it in me,” she observed. Then she shook her head, smiled, and allowed herself, perhaps for the first time, to ponder what she had accomplished. Indeed, our Physical Media group, our Facilities staff, and the entire Data Center team, especially those folks on-site day after day throughout the pandemic, met the new normal head on and preserved, and Backblaze is better for it. We humbly say thank you.

print

About Andy Klein

Andy Klein is the Principal Cloud Storage Storyteller at Backblaze. He has over 25 years of experience in technology marketing and during that time, he has shared his expertise in cloud storage and computer security at events, symposiums, and panels at RSA, SNIA SDC, MIT, the Federal Trade Commission, and hundreds more. He currently writes and rants about drive stats, Storage Pods, cloud storage, and more.