RPO and RTO- a worked example

This information i read in Wikipedia (here's the link  ) , im in the midst of designing and writing a proposal for a small organizations for their backup and DR solutions.  In DR scenario, two most important things you must remember is RTO & RPO.  Below is the explanations from wiki page:




The above figure is an example of how RPO and RTO might pan out in a practical situation. Tape is used for backup in this example. The tapes are sent offsite once per day at around the same time, but this timing is not fully guaranteed. The offsiting operation does happen to occur at roughly the same time of day in the chart above. The daily backup offsiting tasks in this example are as follows:
  • A set of backups are made to tape, possibly via a disk staging area? The synchronisation point for each set of backups is late in the backup operation in this example as several large databases have to be backed up and all of them are required for a Synchronisation Point (this is typical of such systems).
  • After that the tapes have to be ejected, collated, and catalogued as they are boxed. It is often the case that offsiting operations are batched across a wide spectrum of systems at a data centre; generally the backups for all services have to wait for the very last one to be created and boxed before they can be sent to the loading bay for transport.
  • Pickups by offsite data repositories are expensive. Generally a daily pickup with a reasonably priced contract will have only an approximate time for pickup and will be predicated on the data centre being ready with the tapes when the van turns up- extra pickups will be generally too expensive to contemplate on a regular basis so a data centre must build contingency time into the preparation period before the pickup is due to occur.
All of which must be done before the pickup- and all of which must be included in the RPO calculation because the synchronisation point being sent offsite depends on backups that were started very near to the start of these activities. So: a recovered service, after a restore from one of these daily backups, will be very likely to start up as at the end of the online day perhaps 13 or so hours or more, before the restored tapes were driven away from the Production data centre.
Against this background, suppose that a Major Incident occurs just before an offsiting pick up (worst case) and as always the assumption is "total site loss, instantly"- so the prepared backups never leave the site. In this case the RPO is set to 48 hours- only twice the normal offsiting cycle. As it happens, on this occasion pickups have been regular for a while and you might make the mistake of thinking that because two offsiting operations have occurred within the RPO period noted above, you have two sets of tapes you might be able to use and still be within the RPO. This is not the case- the earlier set of tapes will produce a recovered service as at a recovery point that is much older than it needs to be to meet the 48 hour RPO. In this example perhaps 12 or 13 hours over that time. In this example, consider the effect of the latest set of offsited tapes being rendered useless by a critically defective tape in the set (perhaps a 5-10% chance?)- as you can see by the example above, you can now NOT meet the RPO at all. Tape capacity is increasing all the time- fewer tapes mean that individual tape defects damage more backed up data.
To complete the picture, the RTO is noted above too. In this case the service was recovered well before the RTO limit was hit. It is however interesting to contemplate the fact that in this example the RTO does NOT start just after the Major Incident. In this example, as often there is in reality, there is seemingly too much delay. A quick decision to go to invocation of the ITSC Plan is always the best decision; in principle... The rule in setting an RTO should be that the RTO is the longest period of time the business can do without the IT Service in question. On the back of this appropriately economic decisions must be taken at the design stage about how the IT Service is built and run. It must be allowed however that some time has to be spent in making the decision to invoke the ITSC Plan, this decision time is an unknown variable- remember too there are often quite large sums of money spent immediately the decision to invoke is taken- staff being called in for extended periods of 24 hour working cover and large fees charged by some recovery service providers. In the example, there is the almost inevitable fudge that the RTO is set to the maximum time the business can do without the service whilst knowing full well that there is very likely to be a period of decision making before it.


1 comments:

What is RTO and RPO said...

Thanks for sharing all concepts of AWS Disaster Recovery. Also you nicely explained what is RTO and RPO.