On December 7, due to an outage with AWS US-East-1, the Information and Technology Services (ITS) Server Center’s Amazon Connect call center software could not receive calls or chats, leaving callers getting only a busy signal and chatters getting no response at all. While multiple systems were impacted (including Canvas, MiVideo, and Code42 Crashplan), the Service Center turned to their Disaster Recovery (DR) plan to quickly switch to another Amazon region for backup service. This enabled the Service Center to quickly bring back reliable service and receive calls until the outage was resolved several hours later.
The Disaster Recovery Process is focused on documenting detailed procedures to enable the recovery of the IT infrastructure, systems, and applications vital to an organization after a disaster. A “disaster” is an extreme, prolonged disruption to daily operations for your team or organization, such as those caused by a natural disaster, widespread power outage, or even ransomware. Planning for the worst case scenario also enables teams to adapt more quickly to less intense disruptions.
When creating the Disaster Recovery Plan for your team or organization, follow the Disaster Recovery planning process:
- Write it: Identify the business-critical systems that need a Disaster recovery plan and dedicate the time needed to document how to recover this system back to an operational state.
- Ask “What if disaster strikes tomorrow?” Your team’s DR plan should cover interruptions that could happen now rather than waiting until after the next upgrade or when you believe your team may have downtime.
- Actively pursue feedback from your team, your stakeholders, and other teams that have work and services that are integrated with your team’s work and services.
- Test it: Exercise the plan in a tabletop, drill, or practical test. This ensures everyone with a role in the plan understands their responsibilities, and will allow you to see holes or misunderstandings in your plan.
- Maintain it: Dedicate working time to yearly review the document, as well when systems are updated or other factors change.
While creating a Disaster Recovery Plan can be challenging, it is vital work that will enable your team, and your organization, to respond immediately in a crisis. The ITS Service Center’s work in creating their own DR plan not only aided the Service Center, but also each member of the U-M community as they were able to receive the customer support during an outage that affected many services on campus. The ITS Service Center prioritized Disaster Recovery work and the entire university community experienced the benefit.
For more information on how to create a Disaster Recovery Plan for your team, refer to the following resources: