Optimizing Data Center Performance Through Firmware Updates

How and Why Advanced Devices Go Through Evolution in the Field

By Jason Feist, Seagate Senior Director for Technology Strategy and Product Planning

One of the most powerful features in today’s hard drives is the ability to update the firmware of deployed hard drives. Firmware changes can be straight-forward, such as changing a power setting, or as delicate as adjusting the height a read/write head flies above a spinning platter. By combining customer inputs, drive statistics, and a myriad of engineering talents, Seagate can use firmware updates to optimize the customer experience for the workload at hand.

In today’s guest post we are pleased to have Jason Feist, Senior Director for Technology Strategy and Product Planning at Seagate, describe how the Seagate ecosystem works.

— Andy Klein

Storage Devices for the Data Center: Both Design-to-Application and In-Field Design Updates Are Important

As data center managers bring new IT architectures online, and as various installed components mature, technology device makers release firmware updates to enhance device operation, add features, and improve interoperability. The same is true for hard drives.

Hardware design takes years; firmware design can unlock the ability for that same hardware platform to persist in the field at the best cost structure if updates are deployed for continuous improvement over the product life cycle. In close and constant consultation with data center customers, hard drive engineers release firmware updates to ensure products provide the best experience in the field. Having the latest firmware is critical to ensure optimal drive operation and data center reliability. Likewise, as applications evolve, performance and features can mature over time to more effectively solve customer needs.

Data Center Managers Must Understand the Evolution of Data Center Needs, Architectures, and Solutions

Scientists and engineers at advanced technology companies like Seagate develop solutions based on understanding customers’ applications up front. But the job doesn’t end there; we also continue to assess and tweak devices in the field to fit very specific and evolving customer needs.

Likewise, the data center manager or IT architect must understand many technical considerations when installing new hardware. Integrating storage devices into a data center is never a matter of choosing any random hard drive or SSD that features a certain capacity or a certain IOPS specification. The data center manager must know the ins and outs of each storage device, and how it impacts myriad factors like performance, power, heat, and device interoperability.

But after rolling out new hardware, the job is not done. In fact, the job’s never done. Data center devices continue to evolve, even after integration. The hardware built for data centers is designed to be updated on a regular basis, based on a continuous cycle of feedback from ever-evolving applications and implementations.

As continued in-field quality assurance activities and updates in the field maintain the device’s appropriate interaction with the data center’s evolving needs, a device will continue to improve in terms of interoperability and performance until the architecture and the device together reach maturity. Managing these evolving needs and technology updates is a critical factor in achieving the expected best TCO (total cost of ownership) for the data center.

It’s important for data center managers to work closely with device makers to ensure integration is planned and executed correctly, monitoring and feedback is continuous, and updates are developed and deployed. In recent years as cloud and hyperscale data centers have evolved, Seagate has worked hard to develop a powerful support ecosystem for these partners.

The Team of Engineers Behind Storage Integration

The key to creating a successful program is to establish an application engineering and technical customer management team that’s engaged with the customer. Our engineering team meets with large data center customers on an ongoing basis. We work together from the pre-development phase to the time we qualify a new storage device. We collaborate to support in-field system monitoring, and sustaining activities like analyzing the logs on the hard drives, consulting about solutions within the data center, and ensuring the correct firmware updates are in place on the storage devices.

The science and engineering specialties on the team are extensive and varied. Depending on the topics at each meeting, analysis and discussion requires a breadth of engineering expertise. Dozens of engineering degrees and years of experience are on hand, including engineers expert in firmware, servo control systems, mechanical, tribology, electrical, reliability, and manufacturing; The titles of experts contributing include computer engineers, aerospace engineers, test engineers, statisticians, data analysts, and material scientists. Within each discipline are unique specializations such as ASIC engineers, channel technology engineers, and mechanical resonance engineers who understand shock and vibration factors.

The skills each engineer brings are necessary to understand the data customers are collecting and analyzing, how to deploy new products and technologies, and when to develop changes that’ll improve the data center’s architecture. It takes this team of engineering talent to comprehend the intricate interplay of devices, code, and processes needed to keep the architecture humming in harmony from the customer’s point of view.

How a Device Maker Works With a Data Center to Integrate and Sustain Performance and Reliability

After we establish our working team with a customer and when we’re introducing a new product for integration into a customer data center, we meet weekly to go over qualification status. We do a full design review on new features, consider the differences from the previous to the new design and how to address particular asks they may have for our next product design.

Traditionally, storage component designers would simply comply with whatever the T10 or T13 interface specification says. These days, many of the cloud data centers are asking for their own special sauce in some form, whether they’re trying to get a certain number of IOPS per terabyte, or trying to match their latency down to a certain number — for example, “I want to achieve four or five 9’s at this latency number; I want to be able to stream data at this rate; I want to have this power consumption.”

Recently, working with a customer to solve a specific need they had, we deployed Flex dynamic recording technology, which enables a single hard drive to use both SMR (Shingled Magnetic Recording) and CMR (Conventional Magnetic Recording, for example Perpendicular Recording) methods on the same drive media. This required a very high-level team integration with the customer. We spent great effort going back and forth on what an interface design should be, what a command protocol should be, and what the behavior should be in certain conditions.

Sometimes a drive design is unique to one customer, and sometimes it’s good for all our customers. There’s always a tradeoff; if you want really high performance, you’re probably going to pay for it with power. But when a customer asks for a certain special sauce, that drives us to figure out how to achieve that in balance with other needs. Then — similarly to when an automaker like Chevy or Honda builds race car engines and learns how to achieve new efficiency and performance levels — we can apply those new features to a broader customer set, and ultimately other customers will benefit too.

What Happens When Adjustments Are Needed in the Field?

Once a new product is integrated, we then continue to work closely from a sustaining standpoint. Our engineers interface directly with the customer’s team in the field, often in weekly meetings and sometimes even more frequently. We provide a full rundown on the device’s overall operation, dealing with maintenance and sustaining issues. For any error that comes up in the logs, we bring in an expert specific to that error to pore over the details.

In any given week we’ll have a couple of engineers in the customer’s data center monitoring new features and as needed debugging drive issues or issues with the customer’s system. Any time something seems amiss, we’ve got plans in place that let us do log analysis remotely and in the field.

Let’s take the example of a drive not performing as the customer intended. There are a number of reliability features in our drives that may interact with drive response — perhaps adding latency on the order of tens of milliseconds. We work with the customer on how we can manage those features more effectively. We help analyze the drive’s logs to tell them what’s going on and weigh the options. Is the latency a result of an important operation they can’t do without, and the drive won’t survive if we don’t allow that operation? Or is it something that we can defer or remove, prioritizing the workload goal?

How Storage Architecture and Design Has Changed for Cloud and Hyperscale

The way we work with cloud and data center partners has evolved over the years. Back when IT managers would outfit business data centers with turn-key systems, we were very familiar with the design requirements for traditional OEM systems with transaction-based workloads, RAID rebuild, and things of that nature. Generally, we were simply testing workloads that our customers ran against our drives.

As IT architects in the cloud space moved toward designing their data centers made-to-order, on open standards, they had a different notion of reliability and doing replication or erasure coding to create a more reliable environment. Understanding these workloads, gathering these traces and getting this information back from these customers was important so we could optimize drive performance under new and different design strategies: not just for performance, but for power consumption also. The number of drives populating large data centers is mind boggling, and when you realize what the power consumption is, you realize how important it is to optimize the drive for that particular variable.

Turning Information Into Improvements

We have always executed a highly standardized set of protocols on drives in our lab qualification environment, using racks that are well understood. In these scenarios, the behavior of the drive is well understood. By working directly with our cloud and data center partners we’re constantly learning from their unique environments.

For example, the customer’s architecture may have big fans in the back to help control temperature, and the fans operate with variable levels of cooling: as things warm up, the fans spin faster. At one point we may discover these fan operations are affecting the performance of the hard drive in the servo subsystem. Some of the drive logging our engineers do has been brilliant at solving issues like that. For example, we’d look at our position error signal, and we could actually tell how fast the fan was spinning based on the adjustments the drive was making to compensate for the acoustic noise generated by the fans.

Information like this is provided to our servo engineering team when they’re developing new products or firmware so they can make loop adjustments in our servo controllers to accommodate the range of frequencies we’re seeing from fans in the field. Rather than having the environment throw the drive’s heads off track, our team can provide compensation to keep the heads on track and let the drives perform reliably in environments like that. We can recreate the environmental conditions and measurements in our shop to validate we can control it as expected, and our future products inherit these benefits as we go forward.

In another example, we can monitor and work to improve data throughput while also maintaining reliability by understanding how the data center environment is affecting the read/write head’s ability to fly with stability at a certain height above the disk platter while reading bits. Understanding the ambient humidity and the temperature is essential to controlling the head’s fly height. We now have an active fly-height control system with the controller-firmware system and servo systems operating based on inputs from sensors within the drive. Traditionally a hard drive’s fly-height control was calibrated in the factory — a set-and-forget kind of thing. But with this field adjustable fly-height capability, the drive is continually monitoring environmental data. When the environment exceeds certain thresholds, the drive will recalculate what that fly height should be, so it’s optimally flying and getting the best air rates, ideally providing the best reliability in the field.

The Benefits of in-Field Analysis

These days a lot of information can be captured in logs and gathered from a drive to be brought back to our lab to inform design changes. You’re probably familiar with SMART logs that have been traditional in drives; this data provides a static snapshot in time of the status of a drive. In addition, field analysis reliability logs measure environmental factors the drive is experiencing like vibration, shock, and temperature. We can use this information to consider how the drive is responding and how firmware updates might deal with these factors more efficiently. For example, we might use that data to understand how a customer’s data center architecture might need to change a little bit to enable better performance, or reduce heat or power consumption, or lower vibrations.

What Does This Mean for the Data Center Manager?

There’s a wealth of information we can derive from the field, including field log data, customers’ direct feedback, and what our failure analysis teams have learned from returned drives. By actively participating in the process, our data center partners maximize the benefit of everything we’ve jointly learned about their environment so they can apply the latest firmware updates with confidence.

Updating firmware is an important part of fleet management that many data center operators struggle with. Some data centers may continue to run firmware even when an update is available, because they don’t have clear policies for managing firmware. Or they may avoid updates because they’re unsure if an update is right for their drive or their situation.

Would You Upgrade a Live Data Center?

Nobody wants their team to be responsible for allowing a server go down due to a firmware issue. How will the team know when new firmware is available, and whether it applies to specific components in the installed configuration? One method is for IT architects to set up a regular quarterly schedule to review possible firmware updates of all data center components. At the least, devising a review and upgrade schedule requires maintaining a regular inventory of all critical equipment, and setting up alerts or pull-push communications with each device maker so the team can review the latest release notes and schedule time to install updates as appropriate.

Firmware sent to the field for the purpose of updating in-service drives undergoes the same rigorous testing that the initial code goes through. In addition, the payload is verified to be compatible with the code and drive model that’s being updated. That means you can’t accidentally download firmware that isn’t for the intended drive. There are internal consistency checks to reject invalid code. Also, to help minimize performance impacts, firmware downloads support segmented download; the firmware can be downloaded in small pieces (the user can choose the size) so they can be interleaved with normal system work and have a minimal impact on performance. The host can decide when to activate the new code once the download is completed.

In closing, working closely with data center managers and architects to glean information from the field is important because it helps bring Seagate’s engineering team closer to our customers. This is the most powerful piece of the equation. Seagate needs to know what our customers are experiencing because it may be new and different for us, too. We intend these tools and processes to help both data center architecture and hard drive science continue to evolve.

IAmA subreddit June 21

Jason Feist will join other Seagate engineers 10am June 21 on the IAmA subreddit for an AMAA inviting you to find out more about the article’s topic: “We are Seagate research scientists and engineers. Ask us almost anything about integrating advanced storage in data centers.”

Host account/handle: seagate_surfer
Hosting subreddit: https://www.reddit.com/r/IAmA/
Date: June 21, 2018
Time: 10am-11am PDT