The Network and Data Center (DC) teams recently partnered to perform a power validation for the network devices at the Michigan Academic Computing Center (MACC). The focus of the effort was to conduct a field audit to evaluate the current state of the network device power, identify areas of improvement, and make recommendations for future changes.
This validation is part of a reliability initiative at Information and Technology Services and was sponsored by Andy Palms, executive director of infrastructure, Jim Behm, executive director of enterprise systems, and Bob Jones, executive director of support services.
Redundancy is key to reliability
Maintaining clean, uninterrupted power to network devices is a core requirement: If devices lose power, nothing works. For that reason, most network devices have two power supplies. The extra power supply is redundant. If power is lost on one, the entire load is carried by the other power supply.
A power supply can go down for a variety reasons:
- The power supply itself might fail
- A rack’s power strip, also called a power distribution unit (PDU), might fail
- A failure could occur farther up the electrical infrastructure
At the MACC, the upper level electrical infrastructure has a high level of redundancy. That is why during major and minor preventative maintenance events, the DC team can maintain clean and protected power to the rack devices. Equipment can be de-energized and power can be transferred during maintenance while keeping devices from losing power at the rack level.
The MACC has redundant and independent uninterruptable power systems (UPS), with enough capacity in either system to maintain the data center load without interruption if one system goes down due to a failure or planned maintenance event. Each rack has two PDU’s. One power supply on a network device goes to one PDU, while the other power supply is fed from the second PDU. When the rack PDU’s are fed from different floor PDU’s in this manner, this is called true redundancy and covers any single point of failure all the way up through the power infrastructure.
There are several different possible power configurations in addition to the true redundant configuration. For example, the dual power supplies on the network device might go back to only one static switch floor PDU. This configuration provides protection for most failure modes, but does not provide protection if a single static switch floor PDU has a catastrophic failure.
First steps in audit and recommendations
The Network team provided the DC team with a detailed list of all network devices in the MACC that were to be power audited. The DC team then went to every device to trace out and document the current state of the power circuits and their configuration. Another aspect of failover redundancy is to check the capacity of each rack PDU to ensure that if one goes down, the other PDU and power supply can maintain the electrical load that will be on it. After surveying all the network devices, the team evaluated each device and categorized its current state.
- A green classification was for the true redundant configuration.
- A yellow classification was that an improvement could be made in order to get it to the true redundant configuration.
- A red classification indicated that the device was susceptible to a single point of failure and action was needed.
- There was also an orange classification, but this was for single power supplied network devices that were for rack PDU monitoring. These devices could be improved in power configuration, but it was not necessarily needed since a loss of PDU monitoring would not cause a significant incident (SI).
With these classifications, the DC team also documented power and/or network solutions for Yellow, Red, and Orange classifications.
ITS management will review and approve which solutions and recommendations to implement. The Network team and DC team will continue to work together to plan, coordinate, and implement these improvements. In doing so, we will most likely avoid future SI’s with respect to a device losing power and going down when not expected, causing disruption to services.
A big thanks to both the Network and Data Center teams for their efforts with this power validation to support the reliability improvement initiative.
Brian Antosh and Eric Lakin, ITS Infrastructure, contributed to this story