High Availability & Disaster Recovery in Oracle Cloud Infrastructure

Critical applications have the requirement to run 24/7 and tolerate hardware and software failure and even complete data center outages. Oracle Cloud Infrastructure with its’ regions, Availability Domains (AD), and Fault Domains (FD) provides the needed building blocks to design and run high availability and disaster recovery architectures for your applications and databases.

A region is a localized geographic area composed of one or more availability domains, each composed of three fault domains. Availability domains are isolated data centers, fault tolerant, and unlikely to fail simultaneously. Fault domains allow to distribute your resources across different physical hardware within a single availability domain. 

We will leverage resources within a single region to construct a highly available architecture. For a true disaster recovery solution use a second region to maintain the amount of distance between your primary and DR site according to your business continuity requirements.

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are two key metrics we have to consider first in order to develop and understand what is an appropriate solution that can maintain business continuity after an unexpected event:

  • RTO: targeted duration within which a business process must be restored after an outage (tolerated downtime).
  • RPO: measurement of the maximum tolerable amount of data to lose.

Oracle Cloud Infrastructure (OCI) provides many options for different RPO and RTO requirements.

Let’s start! First, we will consider different application designs and later add the database design to it.

Application – Basic Architecture

The minimum design is to have one compute instance running the application and back up the boot and block volumes to Object Storage.

In case of a server outage we will need to restore the compute instance from object storage to the same or different availability domain. RTO depends on the amount of data to be restored (minutes/hours) and RPO depends on the backup frequency (minutes/hours).

Use reserved public IPs to assign the same IP address to your new compute instance and avoid the need to change the clients’ or DNS configuration connecting to that application.

A smarter solution is to hide the application in a private subnet and use a load balancer in front of it. Load balancers are high available by design und are accessible by the same IP address.

Anyway, this doesn’t improve our RTO and RPO as we still have only one application server which is a Single Point of Failure (SPOF). It is a preparation for the next step.

Application – HA Architecture

For high availability we need multiple application server across different locations to avoid a SPOF. Depending on your design and requirements, you can spread the server across different FDs, ADs, or regions and implement the redundancy in either standby or active mode.

  • Standby mode: the secondary or standby server runs side-by-side with the primary. When the primary fails, the standby takes over. This mode is typically used for applications that need to maintain their states.

RTO is reduced to few seconds needed for the passive server to take over. RPO is zero as all data are replicated in real time to another location and there is no data loss in case of an outage. A backup strategy to Object Storage would be recommended anyway.

  • Active mode: all server are actively participating in performing the same tasks. When one of the server fails, the related tasks are simply distributed to another application server. This mode is typically used for stateless applications.

With this design we achieve a zero RTO and RPO for our application.

Great! So let’s take a look on the database HA and DR design now!

Database – Basic Architecture

The simplest design is to have one database server running in one location. In Oracle Cloud Infrastructure you can choose among Virtual Machine, dedicated Bare Metal, and even Exadata for intensive workloads to deploy your databases.

In this scenario, the database is our SPOF and will need to be restored from backup in case of an outage. RTO would depend on the amount of data to be restored (minutes/hours). RPO depends on the archive log backup frequency which is usually in a scope of minutes.

Database – Real Application Clusters (RAC)

Oracle RAC database provides horizontal scalability and high availability by adding more database server (database instances) that access the same database on a shared storage simultaneously in active-active mode.

By implementing RAC on Compute VMs in OCI, each RAC node will be placed in separate Fault Domain (FD). In case of a software or hardware failure on one of the nodes, the database is still accessible through the other nodes and we have no data loss (RPO=zero) and no service interruption (RTO=zero).

However, a storage failure or data center outage will need a restore from backup like in the previous design.

Database – Data Guard

To protect against a server AND a data center outage, we need to implement Data Guard which provides a secondary database stored in another data center and get synchronized in real time.

RTO would be the time needed for failover (usually few seconds) and no data loss (RPO=zero) by using Data Guard Maximum Protection mode which ensures that zero data loss occurs if the primary database fails.

Database – RAC & Data Guard

To benefit from both solutions and having RTO=zero in case of a server outage and RPO=zero in case of an AD outage, a RAC database can be implemented as a primary in one data center and synchronized by Data Guard to a standby RAC database in a secondary location.

This architecture combines the benefits of both RAC and Data Guard and it is the recommended architecture for Maximum Availability Architecture (MAA).

Standard Edition – Refreshable PDB Switchover

To achieve higher availability with Standard Edition on OCI, Refreshable PDB Switchover can be implemented. However, this doesn’t replace a Data Guard solution with all it’s benefits like Automatic Failover, Reinstate on Failover, Backup from Standby and many many others.

RPO can be as short as one minute and RTO the time needed for switchover.

Oracle Autonomous Database

Now if you think what database design should I choose? How it is maintained? And who should do this? I really don’t care about “how”! I just want a high available database for my application! Then the solution is very easy! Go for Autonomous Database and get an SLA of 99.95% and focus more on your application development and business.

In this example, the Autonomous Database is deployed on Shared Infrastructure within the Oracle Services Network.

Single AD Regions

If you are implementing your solution in an OCI region with a single AD, then spread your components to separate Fault Domains instead of Availability Domains, or choose a second region as a disaster recovery side.

Cross Regions HA & DR

In this scenario we are distributing the resources across different regions instead of different FDs or ADs within the same region.

With Remote VCN peering it is possible to connect two VCNs in different regions and communicate using private IP addresses without routing the traffic over the internet.

Database Cloud Service on OCI allows you to implement Data Guard across regions within seconds by very few clicks in the Console.

Failover of network traffic can be done automatically using OCI Traffic Management Steering Policies.

Hybrid Cloud

A standby Data Guard can also be implemented in Oracle Cloud while keeping the primary database in your local data center.

Another option is ship the backups to Oracle Cloud Object Storage and benefit from its’ high availability, high durability, and lower cost.

Conclusion

Depending on your requirements you can build a high available design across multiple Fault Domains, Availability Domains, or regions. Using an active-active application design and distribute the network traffic using a Load Balancer will allow you to avoid downtime in case of an outage. With Data Guard it is possible to create a standby database which is synchronized is real time to avoid data loss. For disaster recovery spread your resources across multiple regions to maintain the amount of distance between your primary and DR site according to your requirements.

Further Reading

Would you like to get notified when the next post is published?