UW Information Technology

September 19, 2018

The biggest disaster recovery exercise you never noticed

Concept image of a human hand stopping a line of dominoes from falling.You rely on your UW NetID for easy access to many systems at the University. But what happens if your UW NetID suddenly stops working because the systems that support it go down during an emergency, such as an earthquake?

On August 15, that scenario was tested, when the systems that make up the UW NetID’s “single sign-on” (SSO) capability — used for MyUW, intranets, UW Libraries, Canvas and much more — were turned off in Seattle and switched to the UW’s TierPoint data center near Spokane, which serves as an emergency backup for critical UW systems. And nobody noticed.

“If a major outage or crisis hits western Washington, the infrastructure that enables use of the UW NetID proved it can take over seamlessly from a backup location in eastern Washington,” said Nathan Dors, UW-IT’s Director of Identity and Access Management (IAM).

Image of a server room in a data center.Called a failover exercise, this validation of the systems’ “geographic diversity” is a key process within a robust disaster recovery program for UW critical systems developed and managed by UW-IT. Coordinated by UW-IT’s Technology Business Continuity, the program includes identifying the systems and applications most critical to the UW’s daily operations, duplicating them at TierPoint, and regularly testing them to make sure they switch over seamlessly.

“Just as if an emergency situation had knocked out systems in the Seattle data centers, the University’s systems that rely on the web SSO environment ‘failed over’ to TierPoint,” said Michael Brogan, IAM Technical Lead. No one in the UW user community reported any issues, he said.

Each aspect of the SSO infrastructure had previously been failed over individually, but this was the first time a comprehensive system exercise had been conducted.

“I believe this was the largest successful failover, in terms of user population and impact, of an actual production environment in UW-IT’s history,” said Mat McBride, UW-IT Disaster Recovery Manager. “And that’s a big deal given the complex architecture and interdependencies inherent in the SSO systems.”