Recovering a single system after a disaster can be difficult. Recovering hundreds is a mess. Recovering them in a consistent fashion, so that their data dependencies are all correct, is a huge task. And what if they run different operating systems? There is an effective answer.
The Problem
This customer has over 1,000 separate systems that are critical to ongoing operations and are located in a core data center. The systems are a combination of Mainframe, UNIX variations (AIX, HP-UX, LINUX, etc.), and Windows. They use SAN and NAS storage (the Mainframe CKD over FICON I include in the SAN), with over half a Petabyte of unique data and various additional copies. They have SAP and other systems that have data flowing between different them, often with different operating systems, with no easy way to perform data isolation (picture an interaction of web services, shipping, billing, and warehouse data and you get the general idea). The customer wanted to implement business restart, with clear business requirements.
Recovery Point Objective (RPO)
There is value in having the data during any site recovery be as current as possible. These are not public finance systems, so the value of a minute of data was not measured in millions of dollars, but it has real corporate value. The goal was set to lose less than 5 minutes of data.
These systems are critical to ongoing operation of customer-facing systems, including web sites. In the event of a true site disaster, there was a desire to have the systems recover quickly. Some systems are more customer-facing than others, so there were RTO objectives from 2-12 hours. Of course, the 'commutative property of Disaster Recovery' (any system that I depend on needs to be recovered as fast as I am so that I can use it) brought many systems down toward the 2 hour target. Since this short time window was not much longer than the time required to restart most of these systems, the data needed to be ready to go when the restart decision was made. Restoring from an alternate source was not going to be an option.
Recovery Geography Objective (RGO)
These systems support customers across the US and, with some functions, internationally as well. The recovery needed to be able to handle geographic issues such as the northeast power outage and still be available. The plan had to deal with the loss of power or communications not just for a site, but for a 500 mile radius.
The Solution
We proposed a solution of Symmetrix arrays with SRDF/A and Multi-Session Consistency (MSC). To handle this scale of unique data, there were 8 arrays at each site. SRDF/A provided the data replication, connected with Cisco FCIP technology over 4 OC-12 links. SRDF moves the data from the (outsourced) source site to the target.
SRDF MSC has the unique ability not only to create multiple groups within a Symmetrix system, but also to define groups acrosssystems, and then take actions on a group as a whole without sacrificing consistency. SRDF/A sends the differences between one point in time and the next (the 'delta set') from the source to the target with each cycle switch. By default, the cycles operate within a given array and are independent of each other. MSC has the ability to synchronize the cycle switches, so that multiple arrays will make the switch at the same point in time. This creates a single image of all of the data that is consistent from a write-dependency point of view. The delta sets are then transmitted to the remote site, and are either all applied (if they all make it completely) or all discarded. In this way, the target site always has a valid, consistent view of the data on disk across the arrays.
The big magic of MSC is that we are not trying to get all of the writes to the target in order. Given the level of I/O on these systems, the complexity of putting a precise enough time on each I/O without some common clock to reference would be massive. The common Mainframe solution for this problem is to use the clock from MVS to provide the timestamp on the I/Os, so that they can be replayed at the target site in the right order even of there are multiple arrays involved. In this case, there are many unrelated systems without the level of clock synchronization to make such a solution possible. By managing the cycle switch times across the arrays, SRDF MSC eliminates this problem.
Servers are located at both the source and target sites. People and processes are in place to deliver a quick restart after the declaration of a disaster at the source site.
The sites are over 600 miles apart, with separate power and telecommunications grids.
The Results
The customer has a single, consistent replica of over 500 TBs of data at over 600 miles that is within 2 minutes of the state of the source at any given time. While MVS is in control, the data also supports UNIX and Windows systems running a wide variety of applications. With this design, recovery is greatly simplified since all of the interconnecting systems will recover with data from the same point in time.
All of the business objectives have been met, and there are no near-term limits on scale or other items that would impact the future usefulness of this solution. They are free to add or change anything above the storage layer and maintain the needed business consistency.
Additional Observations
Most large businesses have a DR plan. However, many of these plans overlook critical items or assuming things that may not be accurate. It is difficult for many businesses to understand the value of the last X minutes of data. For a manufacturing concern, the value of the last hour of production and shipping, or even the last day, may not make a significant difference in the value of the business. For an online business, the value of the last hour may be very significant. For an investment or banking firm, losing the last hour of records may mean the end of the business. Figuring out where each part of a business lives on the recovery value spectrum can be an arduous task, and shortcuts can lead to bad decisions.
This customer realized that they did not want to own two data centers. So they found a good partner, and only have ownership of one.
They also considered the fact that the restart of the business after a disaster can actually be more difficult than ongoing operations. And if things are bad enough that you have to shut down a data center, there are probably real problems in the primary data center geography. Even if air travel or roads are available, employees may have family or other concerns that could prevent them from traveling to the alternate site. So how do you get all that expertise to the target site, and make it OK for them to stay there for what may be months?
They decided that they needed to staff both sites. They have an operational team for each. And the big difference from what many customers have done is that they have the engineering team at the target site. The primary data center is the outsourced one, supporting ongoing operations with a staff trained for that. The alternate data center is where the ongoing changes are designed, and where all the expertise lives to handle any challenges that come up in the event of a site failover. This is a very innovative solution to the problem of staffing the target site, and I believe more customers will follow this in the future.
Conclusion
This customer has a long-standing relationship with EMC. The companies have been doing business together for years, as EMC provides them with solutions that help them to meet their business needs. The customer takes the time to meet with EMC and discuss their needs in detail, including trips to meet with our management and engineering staff to help us better understand where they are going with IT and what we might do to help. By building a real partnership, we are able to work together to solve their ongoing challenges. SRDF MSC is an example of a technology that EMC has built to help our customers meet their business needs.