If you’ve run an exachk report, y0u may have seen the following message with regard to your databases:
Status | Type | Message | Status On | Details |
---|---|---|---|---|
FAIL | Database Check | Database parameter CLUSTER_INTERCONNECTS is NOT set to the recommended value | db01:dbm011, db02:dbm012 | View |
This check is commonly seen when a database is created on Exadata without using the custom “Exadata” templates included with the database creation assistant. These customized templates include a multitude of recommended parameter settings found in MOS note #1274318.1 (Exadata Setup/Configuration Best Practices) – one of which is the CLUSTER_INTERCONNECTS parameter. This parameter is used to determine which IP addresses will be used for communication between database instances on the cluster. If left, unset, the instance will default to the high availability IP addresses (HAIP) on interfaces defined by Grid Infrastructure to host the cluster interconnect.
What is HAIP?
HAIP is a feature that was introduced back in 11.2.0.2 which allows administrators to utilize multiple network interfaces for the cluster interconnect without needing to configure any kind of bonding. Interfaces identified within clusterware to be used for the interconnect will automatically receive IP addresses when clusterware starts. These IP addresses fall within the familiar 169.254.0.0/16 space, which is primarily seen when DHCP interfaces are unable to acquire and address. Because each of the interfaces will receive an IP, HAIP allows for easy active/active cluster interconnect configuration without needing to configure host and switch based bonding using LACP. On Exadata, Oracle recommends to not use this feature, hence the incident on the exachk report seen above. There is no way to disable the feature – the only way to ensure that it is not used is to manually set the CLUSTER_INTERCONNECTS parameter.
HAIP and Exadata
Like many Oracle features that can cause issues (I’m looking at you, automatic memory management), things will run fine for a long time, until they hit the breaking point. We have seen incidents where an unset CLUSTER_INTERCONNECTS parameter will run fine for months or years, but when it fails, it’s a noticeable failure.
In one case, we had a client with an X3-8 full rack running 12.1.0.2 with the April 2015 patch release. I received an email saying that one day, some of the database instances crashed across the cluster, and would not restart. The other instance continued to run without an issue, but it was spread across the two nodes. The email included a chart like this (these aren’t the real database names):
Database | Node 1 | Node 2 |
DW | UP | UP |
HR | DOWN | UP |
OID | UP | DOWN |
HIST | UP | DOWN |
STG | UP | UP |
There weren’t any changes made to the databases in question – they had been patched and upgraded 6 weeks ago, and been running without any issues until this day. Naturally the first thing that I did when I heard about the problem was try to start up the instance. Here’s what I saw in the alert log for the instance that would not start:
PRCR-1013 : Failed to start resource ora.hr.db PRCR-1064 : Failed to start resource ora.hr.db on node db01 CRS-5017: The resource action "ora.hr.db start" encountered the following error: ORA-03113: end-of-file on communication channel Process ID: 0 Session ID: 0 Serial number: 0 . For details refer to "(:CLSN00107" in "/u01/app/grid/diag/crs/db01/crs/trace/crsd_oraagent_oracle.trc". CRS-2674: Start of 'ora.hr.db' on 'db01' failed |
how about that for an explanation? Here’s what was in the alert log for the instance that failed to start up. This section is from when it was attempting to negotiate with the other instance:
Fri Jun 26 13:52:46 2015 * Load Monitor used for high load check * New Low - High Load Threshold Range = [153600 - 204800] Fri Jun 26 13:52:46 2015 Reconfiguration started (old inc 0, new inc 12) List of instances (total 2) : 1 2 My inst 1 (I'm a new instance) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished Fri Jun 26 13:52:47 2015 * domain 0 valid = 1 according to instance 2 Fri Jun 26 13:52:48 2015 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Fri Jun 26 13:52:48 2015 Fri Jun 26 13:52:48 2015 LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Fri Jun 26 13:52:48 2015 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Fri Jun 26 13:52:48 2015 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Fri Jun 26 13:52:48 2015 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Fri Jun 26 13:52:48 2015 Fri Jun 26 13:52:48 2015 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fri Jun 26 14:01:47 2015 LMD0 (ospid: 119535) waits for event 'process diagnostic dump' for 1 secs. Fri Jun 26 14:01:52 2015 LMS0 (ospid: 119579) received an instance eviction notification from instance 2 [2] Fri Jun 26 14:01:52 2015 LMON received an instance eviction notification from instance 2 The instance eviction reason is 0x2 The instance eviction map is 1 Errors in file /u01/app/oracle/diag/rdbms/hr/hr1/trace/hr1_lmhb_120095.trc (incident=1004937): ORA-29770: global enqueue process LMD0 (OSID 119535) is hung for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/hr/hr1/incident/incdir_1004937/hr1_lmhb_120095_i1004937.trc Fri Jun 26 14:01:54 2015 Received an instance abort message from instance 2 |
That’s giving a little bit more information – it looks like the instance is getting shut down by the other node before it can start. Maybe if we look on the running instance, it will give us some information about why it’s kicking out the instance. Here is the same cluster configuration section from the instance that is running ok:
Reconfiguration started (old inc 14, new inc 16) List of instances (total 2) : 1 2 New instances (total 1) : 1 My inst 2 Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Fri Jun 26 14:02:54 2015 Reconfiguration complete (total time 6.8 secs) Fri Jun 26 14:03:54 2015 Increasing number of real time LMS from 0 to 7 Fri Jun 26 14:05:16 2015 LMS0 (ospid: 34439) has detected no messaging activity from instance 1 LMS0 (ospid: 34439) issues an IMR to resolve the situation Please check LMS0 trace file for more detail. Fri Jun 26 14:05:16 2015 Communications reconfiguration: instance_number 1 by ospid 34439 Fri Jun 26 14:06:15 2015 Evicting instance 1 from cluster Waiting for instances to leave: 1 Fri Jun 26 14:06:15 2015 Dumping diagnostic data in directory=[cdmp_20150626140615], requested by (instance=1, osid=30058 (LMS0)), summary=[abnormal instance termination]. |
Now we are getting somewhere. It looks like there is a cluster communication issue between the nodes. Everything seemed ok, but I noticed the following in the instance alert log when it was starting up:
Fri Jun 26 13:52:36 2015 Cluster communication is configured to use the following interface(s) for this instance 169.254.12.191 169.254.65.68 169.254.177.146 169.254.214.117 cluster interconnect IPC version: Oracle RDS/IP (generic) |
The host was attempting to use the HAIP interfaces instead of the static IP addresses configured on the InfiniBand interfaces. Knowing that this was not the recommended setting, and knowing that I had seen MOS notes in the past about issues when using HAIP on Exadata, I looked at the CLUSTER_INTERCONNECTS setting on the database:
SQL> show parameter interconnect NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ cluster_interconnects string |
As expected, the parameter was not set. We attempted to set the parameter for each of the instances via the SPfile:
SQL> alter system set cluster_interconnects='172.16.0.185:172.16.0.186:172.16.0.187:172.16.0.188' sid='hr1' scope=spfile; System altered. SQL> alter system set cluster_interconnects='172.16.0.189:172.16.0.190:172.16.0.191:172.16.0.192' sid='hr2' scope=spfile; System altered. |
After setting the parameter, we were able to bring up the instance without any issues:
SQL> startup nomount ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance ORACLE instance started. Total System Global Area 1.7180E+10 bytes Fixed Size 5304248 bytes Variable Size 4889346120 bytes Database Buffers 1.2147E+10 bytes Redo Buffers 138514432 bytes SQL> |
While it is interesting to see the adverse effects of leaving this parameter unset, the degradation could have been avoided in the first place if the exachk failures had been recognized and addressed. The exachk script has a very robust and ever-changing list of checks that are run, and this is a good example of why it should be run regularly.