AWS Disaster Recovery: Ensuring Business Continuity

AWS Disaster Recovery Overview

AWS Disaster Recovery Key AWS services

Key factors for Disaster Planning

AWS Disaster Recovery Scenarios

AWS Disaster Recovery Scenarios Options

AWS Disaster Recovery Pilot Light

AWS Disaster Recovery Warm Standby

AWS Disaster Recovery Multi-Site

AWS Disaster Recovery Overview

- AWS Disaster Recovery whitepaper highlights AWS services and features that can be leveraged for disaster recovery (DR) processes. AWS Disaster Recovery significantly minimize the impact on data, system, and overall business operations.
- It outlines best practices to improve your DR processes, from minimal investments to full-scale availability and fault tolerance. This describes how AWS services can be used to reduce cost and ensure business continuity during a DR event.
- Disaster recovery (DR) is about preparing for and recovering from a disaster. Any event that has a negative impact on a company’s business continuity or finances could be termed a disaster. One of the AWS best practice is to always design your systems for failures

AWS Disaster Recovery for Key AWS services

1. 1. Region
    - AWS services are available in multiple regions around the globe, and the DR site location can be selected as appropriate, in addition to the primary site location
  2. Storage

Amazon S3

- - - provides a highly durable (99.999999999%) storage infrastructure designed for mission-critical and primary data storage.
    - stores Objects redundantly on multiple devices across multiple facilities within a region

Amazon Glacier

- - - provides extremely low-cost storage for data archiving and backup.
    - Objects are optimized for infrequent access, for which retrieval times of several (3-5) hours are adequate.

Amazon EBS

- - - provides the ability to create point-in-time snapshots of data volumes.
    - Snapshots can then be used to create volumes and attached to running instances

Amazon Storage Gateway

- - - a service that provides seamless and highly secure integration between on-premises IT environment and the storage infrastructure of AWS.

AWS Import/Export

- - - accelerates moving large amounts of data into and out of AWS by using portable storage devices for transport bypassing the Internet
    - transfers data directly onto and off of storage devices by means of the high-speed internal network of Amazon

Compute

Amazon EC2

- - - - provides resizable compute capacity in the cloud which can be easily created and scaled.
      - EC2 instance creation using Preconfigured AMIs
      - EC2 instances can be launched in multiple AZs, which are engineered to be insulated from failures in other AZs

Amazon Route 53

- - - - is a highly available and scalable DNS web service
      - includes a number of global load-balancing capabilities that can be effective when dealing with DR scenarios for e.g. DNS endpoint health checks and the ability to failover between multiple endpoints

Elastic IP

- - - - addresses enables masking of instance or Availability Zone failures by programmatically remapping
      - addresses are static IP addresses designed for dynamic cloud computing.

Elastic Load Balancing (ELB)

- - - - performs health checks and automatically distributes incoming application traffic across multiple EC2 instances

Amazon Virtual Private Cloud (Amazon VPC)

- - - - allows provisioning of a private, isolated section of the AWS cloud where resources can be launched in a defined virtual network

Amazon Direct Connect

- - - - makes it easy to set up a dedicated network connection from on-premises environment to AWS

Databases

- - - RDS, DynamoDb, Redshift provided as a fully managed RDBMS, NoSQL and data warehouse solutions which can scale up easily
    - DynamoDB offers cross region replication
    - RDS provides Multi-AZ and Read Replicas and also ability to snapshot data from one region to other

Deployment Orchestration

CloudFormation

- - - gives developers and systems administrators an easy way to create a collection of related AWS resources and provision them in an orderly and predictable fashion

Elastic Beanstalk

- - - is an easy-to-use service for deploying and scaling web applications and services

OpsWorks

- - - is an application management service that makes it easy to deploy and operate applications of all types and sizes.
    - Environment can be defined as a series of layers, and each layer can be configured as a tier of the application.
    - has automatic host replacement, so in the event of an instance failure it will be automatically replaced.
    - can be used in the preparation phase to template the environment, and combined with AWS CloudFormation in the recovery phase.
    - Stacks can be quickly provisioned from the stored configuration to support the defined RTO.

Key factors for Disaster Planning

Recovery Time Objective (RTO) – The time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA) for e.g. if the RTO is 1 hour and disaster occurs @ 12:00 p.m (noon), then the DR process should restore the systems to an acceptable service level within an hour i.e. by 1:00 p.m

Recovery Point Objective (RPO) – The acceptable amount of data loss measured in time before the disaster occurs. for e.g., if a disaster occurs at 12:00 p.m (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 a.m.

Disaster Recovery Scenarios

- - Disaster Recovery scenarios can be implemented with the Primary infrastructure running in your data center in conjunction with the AWS
  - Disaster Recovery Scenarios still apply if Primary site is running in AWS using AWS multi region feature.
  - Combination and variation of the below is always possible.

Disaster Recovery Scenarios Options

1. 1. Backup & Restore (Data backed up and restored)
  2. Pilot Light (Only Minimal critical functionalities)
  3. Warm Standby (Fully Functional Scaled down version)
  4. Multi-Site (Active-Active)

For the DR scenarios options, RTO and RPO reduces with an increase in Cost as you move from Backup & Restore option (left) to Multi-Site option (right)

Backup & Restore

AWS can be used to backup the data in a cost effective, durable and secure manner as well as recover the data quickly and reliably.

Backup phase

In most traditional environments, data is backed up to tape and sent off-site regularly. This process takes longer time to restore the system in the event of a disruption or disaster

1. 1. 1. Amazon S3 can be used to backup the data and perform a quick restore and is also available from any location
    2. AWS Import/Export can be used to transfer large data sets by shipping storage devices directly to AWS bypassing the Internet
    3. Amazon Glacier can be used for archiving data, where retrieval time of several hours are adequate and acceptable
    4. AWS Storage Gateway enables snapshots of the on-premises data volumes to be transparently copied into S3 for backup. It can be used either as a backup solution (Gateway-stored volumes) or as a primary data store (Gateway-cached volumes)
    5. AWS Direct connect can be used to transfer data directly from On-Premise to Amazon consistently and at high speed
    6. Snapshots of Amazon EBS volumes, Amazon RDS databases, and Amazon Redshift data warehouses can be stored in Amazon S3

Restore phase

Data backed up then can be used to quickly restore and create Compute and Database instances

Key steps for Backup and Restore:
1. Select an appropriate tool or method to back up the data into AWS.
2. Ensure an appropriate retention policy for this data.
3. Ensure appropriate security measures are in place for this data, including encryption and access policies.
4. Regularly test the recovery of this data and the restoration of the system.

AWS Disaster Recovery Pilot Light

In a Pilot Light Disaster Recovery scenario option a minimal version of an environment is always running in the cloud. It basically host the critical functionalities of the application for e.g. databases

In this approach :

1. 1. Maintain a pilot light by configuring and running the most critical core elements of your system in AWS for e.g. Databases where the data needs to be replicated and kept updated.
  2. During recovery, a full-scale production environment, for e.g. application and web servers, can be rapidly provisioned (using preconfigured AMIs and EBS volume snapshots) around the critical core
  3. For Networking, either a ELB to distribute traffic to multiple instances and have DNS point to the load balancer or preallocated Elastic IP address with instances associated can be used

Preparation phase steps :

1. 1. Set up Amazon EC2 instances or RDS instances to replicate or mirror data critical data
  2. Ensure that all supporting custom software packages available in AWS.
  3. Create and maintain AMIs of key servers where fast recovery is required.
  4. Regularly run these servers, test them, and apply any software updates and configuration changes.
  5. Consider automating the provisioning of AWS resources.

Recovery Phase steps :

1. 1. Start the application EC2 instances from your custom AMIs.
  2. Resize existing database/data store instances to process the increased traffic for e.g. If using RDS, it can be easily scaled vertically while EC2 instances can be easily scaled horizontally
  3. Add additional database/data store instances to give the DR site resilience in the data tier for e.g. turn on Multi-AZ for RDS to improve resilience.
  4. Change DNS to point at the Amazon EC2 servers.
  5. Install and configure any non-AMI based systems, ideally in an automated way.

AWS Dsaster Recoveru Warm Standby

- - In a Warm standby DR scenario a scaled-down version of a fully functional environment identical to the business critical systems is always running in the cloud
  - This setup can be used for testing, quality assurances or for internal use.
  - In case of an disaster, the system can be easily scaled up or out to handle production load.

Preparation phase steps :

1. 1. Set up Amazon EC2 instances to replicate or mirror data.
  2. Create and maintain AMIs for faster provisioning
  3. Run the application using a minimal footprint of EC2 instances or AWS infrastructure.
  4. Patch and update software and configuration files in line with your live environment.

Recovery phase Steps:

1. 1. Increase the size of the Amazon EC2 fleets in service with the load balancer (horizontal scaling).
  2. Start applications on larger Amazon EC2 instance types as needed (vertical scaling).
  3. Either manually change the DNS records, or use Route 53 automated health checks to route all the traffic to the AWS environment.
  4. Consider using Auto Scaling to right-size the fleet or accommodate the increased load.
  5. Add resilience or scale up your database to guard against DR going down

AWS Disaster Recovery Multi-Site

- - Multi-Site is an active-active configuration DR approach, where in an identical solution runs on AWS as your on-site infrastructure.
  - Traffic can be equally distributed to both the infrastructure as needed by using DNS service weighted routing approach.
  - In case of a disaster the DNS can be tuned to send all the traffic to the AWS environment and the AWS infrastructure scaled accordingly.

Preparation phase steps :

1. 1. Set up your AWS environment to duplicate the production environment.
  2. Set up DNS weighting, or similar traffic routing technology, to distribute incoming requests to both sites.
  3. Configure automated failover to re-route traffic away from the affected site. for e.g. application to check if primary DB is available if not then redirect to the AWS DB

Recovery phase steps :

1. 1. Either manually or by using DNS failover, change the DNS weighting so that all requests are sent to the AWS site.
  2. Have application logic for failover to use the local AWS database servers for all queries.
  3. Consider using Auto Scaling to automatically right-size the AWS fleet.