Database Outage? Who cares???

Introduction

Have you ever experienced this before? You click on “Book Now” for a trip or a hotel reservation, but get “an unknown error has occurred”, or even no feedback at all? You are ultimately torn between packing your bags and repeating the booking process. To avoid such situations, Application Continuity is there.

By the way, I originally wrote this article for the German Technology Blog. To promote this powerful Oracle database features more, here once again in English.

Database High Availability

Oracle offers high availability and disaster recovery solutions such as Oracle Real Application Clusters (RAC) and Oracle (Active) Data Guard. In a 2-node RAC configuration for example, if one of the database instances or the entire database server fails, the database remains accessible through the other database instance.

But what does this outage mean from the application’s point of view? How should the application handle this event to make it transparent for the end-user? Let’s have a look at different scenarios.

In case of an outage at the database instance or server level, new connections will be forwarded to one of the remaining, survived instances. To avoid long TCP timeouts, use SCAN (recommended), SCAN VIPs, or the server VIPs in your connection string instead of the physical server IPs.

(FAILOVER=ON)
(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)
(ADDRESS=(PROTOCOL=TCP)(HOST=scan-name)(PORT=1521))

Any already established and inactive connections can be failed over to a surviving database instance using:

Oracle offers these technologies since version 10g.

Using FAILOVER_TYPE=SELECT in TAF allows the clients to survive outages transparently for the end-user if they were in the middle of a SELECT statement. After connection failover, the SELECT statement is re-executed and the cursor position is adjusted. The client continues fetching rows as if nothing had happened.

Now, what about the more interesting and complicated use case, when the session is in the middle of a transaction, doing inserts, updates, and deletes? This is where Transaction Guard and (Transparent) Application Continuity come into play.

Transaction Guard (TG)

Imagine a session that started a transaction, performed several DML operations, and executed a COMMIT at the end of the transaction. However, due to an outage, the session did not get any response from the database. After re-establishing the connection (this time to a survived database instance), the new session does not have any information about the status of the last transaction. Was the COMMIT successful, or not?

If you decide to re-execute the transaction, but the first one was committed successfully, you’ll end up with a duplicate hotel reservation, which is logical data corruption in the database. In case you don’t, but the first one has been failed, you’ll end up with no reservation at all, which <irony> heads to a very satisfied customer </irony>.

It is obvious that we need a way to figure out the status of the last transaction after a session failover. And this is what Transaction Guard exactly allows. Transaction Guard was released in version 12.1 and provides a protocol and API to query the status of the last transaction.

The status of the last transaction is stored in the database. After session failover, the new session queries the Commit Outcome of the last transaction using a Logical Transaction ID (LTXID).

Transaction Guard is available in Enterprise Edition and can be used with RAC, RAC One Node, and (Active) Data Guard Physical Standby.

The drawback of Transaction Guard is that it requires changing the application code, i.e. programming the functionality of querying the commit outcome and deciding whether or not to re-execute the last transaction.

Application Continuity (AC)

Using Application Continuity, as it is based on Transaction Guard, the same functionality is provided, but this time the execution is done automatically, without any changes in the application code! Is that great? Yes! It IS great!

When a transaction is re-executed (Replay) after an outage, no error messages are returned to the user, it just appears as no more than a delayed execution of the request.

Not only outages but also planned maintenance like patching or restarting a database instance, e.g. after changing static parameters, are now entirely transparent for the end-user. The DBAs don’t need to do that in a maintenance window at night or at weekends anymore.

Furthermore, Application Continuity is aware of functions that retrieve a new value every time they are executed (mutables), e.g. SYSDATE or NEXTVAL for sequences. In the case of a Replay, the value of the first transaction is reused.

Application Continuity tracks changes within the database, but not outside of the database, like creating a file on the file system or sending an email (side effects). In the case of a Replay, actions outside of the database are executed again. However, you can deactivate this behavior.

Application Continuity covers errors caused due to failure of foreground processes, network, storage, or database servers, etc. (recoverable errors). These failures are independent of the application logic. Errors produced by a user, e.g. entering an invalid value in a numeric field (unrecoverable errors), must be handled as usual by the application itself.

Starting with release 12.1, Application Continuity supports applications using JDBC Thin Driver. This is why you’ll find this feature called “Application Continuity for Java”. Release 12.2.0.1 supports Oracle Call Interface (OCI) and ODP.NET applications. In 12.2.0.2 (18c), further clients were supported. In all cases, the use of Oracle Connection Pools like the Universal Connection Pool for JDBC or the OCI Session Pool is mandatory. Oracle Connection Pools can also be configured for third-party applications such as IBM WebSphere, RedHat JBoss, or Apache Tomcat.

Application Continuity can be used with RAC, RAC One Node, and Active Data Guard Physical Standby databases, but not with logical replication technologies such as Oracle GoldenGate or Data Guard Logical Standby. In the case of logical replication, the primary and standby databases differ in terms of physical structures (e.g. table and row IDs), therefore no replay can be performed.

Transparent Application Continuity (TAC)

Introduced in release 12.2.0.2 (18c), Transparent Application Continuity additionally supports applications that do not use a connection pool. However, the use of connection pools is still recommended. In addition, Transparent Application Continuity automatically detects side effects and does not repeat them in case of replay.

Activation of AC and TAC

The activation is very simple:

Steps on the server side

1. Create a custom database service to be able to set the service attributes for high availability. Application Continuity cannot be used with the default service (DB name).

The activation of AC and TAC is controlled by the FAILOVERTYPE parameter. The TRANSACTION value activates AC, while AUTO activates TAC.

srvctl add service -db dbname -pdb pdbname -service servicename 
-preferred inst1 -available inst2 
-stopoption IMMEDIATE
-replay_init_time 600 -retention 86400 -drain_timeout 300 
-notification TRUE #activates FAN
-commit_outcome TRUE #activates TG
-failover_restore [LEVEL1 | AUTO] 
-failovertype [TRANSACTION | AUTO] 
#LEVEL1 and TRANSACTION activates AC, while AUTO activates TAC

srvctl start servie -db dbname -service servicename

2. For pre 19c databases: grant KEEP privileges for mutables and sequences

SQL> grant KEEP DATE TIME to username;
SQL> grant KEEP SYSGUID to username;
SQL> grant KEEP SEQUENCE ON schema.seqname TO username;

3. Grant execute privilege for the AC PL/SQL package

SQL> grant EXECUTE ON DBMS_APP_CONT TO username;

4. Open port 6200 for Oracle Notification Services (ONS)

Steps on the client side

  1. Use the following TNS connection string
TNSalias = 

(DESCRIPTION = 

   (CONNECT_TIMEOUT=120) (RETRY_COUNT=50)(RETRY_DELAY=3)(TRANSPORT_CONNECT_TIMEOUT=3)

 (ADDRESS_LIST =

   (LOAD_BALANCE=on)

   (ADDRESS = (PROTOCOL = TCP)(HOST=primary-scan)(PORT=1521)))

 (ADDRESS_LIST =

   (LOAD_BALANCE=on)

   (ADDRESS = (PROTOCOL = TCP)(HOST=standby-scan)(PORT=1521)))       

(CONNECT_DATA=(SERVICE_NAME = servicename)))

2. Configure the client

For JDBC Thin clients: replace the .jar files with the most recent set from Oracle. The set includes the ons.jar, ucp.jar, and the replay driver ojdbc8.jar/ojdbc11.jar. These must be replaced as a set to ensure correct interoperability.

For Oracle Call Interface (OCI) clients: replace all occurrences of OCIStmtPrepare with OCIStmtPrepare2 (OCIStmtPrepare was deprecated in 12.2 and will not be replayed), and add HA EVENTS=TRUE to the connection string.

Have a look at this White Paper for more details.

AC and TAC in the Oracle Cloud

For co-managed databases (DBCS, ExaCS), AC or TAC can be activated just as easily and in the same way as for conventional Oracle databases.

For Autonomous Databases on Dedicated Exadata Infrastructure, TAC is enabled by default for the transaction services _tp and _tpurgent. You can enable or disable AC and TAC with the PL/SQL Package DBMS_APP_CONT_ADMIN, e.g.:

SQL> execute DBMS_APP_CONT_ADMIN.ENABLE_AC('TPURGENT');
SQL> execute DBMS_APP_CONT_ADMIN.ENABLE_TAC('HIGH');
SQL> execute DBMS_APP_CONT_ADMIN.DISABLE_FAILOVER('HIGH');

For Autonomous Databases on Shared Exadata Infrastructure, AC is disabled by default. Since there is no access to the local database servers, the service parameters cannot be adjusted as usual with the srvctl utility. Oracle provides the ENABLE_APP_CONT and DISABLE_APP_CONT procedures in the DBMS_CLOUD_ADMIN PL/SQL package for this purpose:

SQL> BEGIN
DBMS_CLOUD_ADMIN.ENABLE_APP_CONT(
    service_name => 'N7PMWSC8TE8FJTY_ATP19C_high.atp.oraclecloud.com'
    );
END;
/

Use the following query to check the service configuration in the database:

SQL> SELECT name, failover_type, commit_outcome, retention_timeout 
FROM dba_services;

What is new in 21c?

Application Continuity is evolving with each new release. Have a look at the new features in 21c.

Conclusion

Oracle RAC and Data Guard provide high availability for the database (High Availability). Application Continuity provides high availability for the application (Continuous Availability). Planned maintenance and outages become transparent to the end-user.

  • Using TAF or FCF (version 10.1), inactive or active sessions incl. SELECT statements can be recovered automatically without producing an error message on the application side.
  • Using Transaction Guard (12.1), transactions can transparently be recovered. However, it requires changing the application code.
  • Application Continuity (12.1/12.2) uses Transaction Guard automatically in the background. This means that you do not need to change your application code. Activation is done by a simple configuration. However, the use of connection pools is required.
  • Transparent Application Continuity (18c) also works without connection pools, even though they are still recommended.

Further Reading

Would you like to get notified when the next post is published?