This customer has a data warehouse that drives internal research and is sold to external customers. The data needs to be accurate and available at all times. Their goal was a new storage architecture to maximize availability and data protection while optimizing operational complexity and cost. We were able to help them meet those goals.
The Problem
The starting size of the data warehouse was 25 TB, which would increase by about 10% per year. This was for 2 years of retained data. Some data was available to load throughout the day, while other sources would provide bulk feeds during the night. With 2 years of retention, the change rate for the database was under 2% per week. In general, backups every few days were OK, since there was a separate record of all data imported into the system and the ‘current’ image could be recreated from even a few day old copy within a matter of hours.
The business required that the data be available, and that backups be able to restore the system to a point in time up to 1 year old (there had been times of some small feeds of errant data that were corrected weeks later, so prudence dictated that extended recovery be possible). Given the value of the operational system, there needed to be space to perform a full restore while continuing production operations. This would also provide for space to conduct quarterly testing of the restore processes.
Given the value of the data, the business was interested in the option of having two full images of the data. This would allow for quick resumption of production operations should there be an environmental or other event at the primary location. This was, however, a desire with limited funding, and would only be included if it could be accomplished without greatly increasing the cost of the solution.
Looking to the future, there would be business value in being able to extend the data retention period from 2 years to 5 or even 7 years. The more the solution is able to scale at reasonable cost and without driving up complexity, the sooner it would be cost effective to keep the data active in the warehouse for a longer period of time.
We met this challenge by focusing on snap technology. This type of data warehouse that has large capacity and a low change rate is a perfect candidate for storage array snaps. We delivered a pair of Symmetrix arrays with TimeFinder/Snap and SRDF, each with 35 TB of usable capacity (protected with 7+1 RAID 5 on 15K 300G drives). This provided room for the current capacity plus another year’s growth. By including 5 TB of capacity for the save area (Symm SAV devices), multiple snaps were possible.
In order to be able to have the space to do full restores, building a clone in the same array was considered. However, this would also mean that the array would need to be sized for performing a full restore and testing on that restore without significant impact to production.
Given that there was also business value in being able to have a separate image of the data, the solution included a second Symmetrix array connected with SRDF. By placing the second image in a separate array, we were able to have a completely separate platform to conduct testing of restores or other operations on if needed. And when that was not needed, it was a business continuity solution with minimal additional cost. The two arrays were within the same campus, providing plenty of bandwidth at minimal latency.
The Results
The customer chose to make snaps 3 times per week, and they could do more if they found value in that later. The customer makes database snaps on Tuesday and Thursday evenings in the production (source) array, each kept for one week. They make alternating Saturday snaps from the target system, each kept for 2 weeks.
Backups are run weekly from the Saturday snap copy. After the backup, the Saturday snap is often mounted on a failover server for further data integrity testing, just to be sure. Quarterly restore testing is conducted against the target array, validating the backup and restore processes. Of course, this also invalidates the snaps on the target, as well as the business continuity options while this testing is conducted.
In the event of data problems, the various snaps are mounted to standby systems to validate the most recent alternate image that does not have the problem. Then a decision can be made to either repair production, or to update the snap with all of the most recent data and a repaired copy of the damaged data feed. If the repairs are done on the local snaps, the production database is shut down. The disk restore process is almost instant, and the production database is restarted on the restored image (with the data copy completing in the background).
In the event that the repairs are done on the target array, the process is only a bit more involved. The chosen snap copy is updated just like the local snap would be. SRDF is then suspended, and snap is locally restored. Once that restore is in process, the production database is shut down. An SRDF restore is initiated with the ‘invalidate local updates’ option, which is required since both the source and target have changed. SRDF then begins the differential resynchronization of the local data to match the desired target image. Once this resynchronization has started, the production database is restarted on the restored image (and as with the local snap, the data copy continues in the background).
In the event that a tape restore of an even older copy is needed, the target array is used as the restore location, and then updates are applied just like they would be to a target snap copy. Since SRDF is tracking changes between the source and target as a bit flag on each track, the restore will still be differential. However, since all of the data will be read back from the backup copy (even if it has not changed), the SRDF restore will effectively do a full array copy. The process remains the same, though with so much more data the copy will take longer to complete. And production can still be restarted while the copy is in process.
Note that for major database changes, SRDF is suspended between the arrays. In cases where there is a problem, SRDF can undo the updates at the storage level, ensuring that the change window can always be met even if something goes very wrong.
All of the devices are collected into device groups on the arrays. The Symmetrix Group Name Service (GNS) ensures that device group updates are kept consistent between the arrays, and are available to all hosts that may be used to control the replication operations. The group operations make management of the various copies simple.
Additional Options
The customer could run more snaps on the source or target. However, most data corruption is fixed by making changes to production, not by restoring from a copy. Since the daily change rate is so small and there are so few restores, there is no justification yet for performing the snaps more frequently. Since it can take days to discover the details behind a bad data import, doing daily snaps but only keeping them a couple of days was thought to be much less useful.
The customer could add ATA space for a clone at the target site, allowing for testing of restores without disrupting the disaster recovery copy.
The backups are currently being done to actual tape for vaulting offsite. A Data Domain backup target deduplication solution would both improve the speed of the backup and be able to replicate the changes to a remote location with minimal bandwidth. This sort of low change rate database will get the most benefit from backup deduplication technology. It should be very cost effective to keep 3 months of weekly backups on such a system.
Conclusion
This customer was able to meet all of their business requirements without buying an excess of infrastructure. The snaps give them quick recovery from multiple points in time with very little additional disk capacity. All the data recovery options can be executed while production continues with the best available data. SRDF provides a target system for backups and restore testing, and also provides a target for system recovery should the source site be compromised. With the proper tools, large warehouses are easy to store and protect.