HAIP and Exadata

If you’ve run an exachk report, y0u may have seen the following message with regard to your databases:

Status	Type	Message	Status On	Details
FAIL	Database Check	Database parameter CLUSTER_INTERCONNECTS is NOT set to the recommended value	db01:dbm011, db02:dbm012	View

This check is commonly seen when a database is created on Exadata without using the custom “Exadata” templates included with the database creation assistant. These customized templates include a multitude of recommended parameter settings found in MOS note #1274318.1 (Exadata Setup/Configuration Best Practices) – one of which is the CLUSTER_INTERCONNECTS parameter. This parameter is used to determine which IP addresses will be used for communication between database instances on the cluster. If left, unset, the instance will default to the high availability IP addresses (HAIP) on interfaces defined by Grid Infrastructure to host the cluster interconnect.

What is HAIP?

HAIP is a feature that was introduced back in 11.2.0.2 which allows administrators to utilize multiple network interfaces for the cluster interconnect without needing to configure any kind of bonding. Interfaces identified within clusterware to be used for the interconnect will automatically receive IP addresses when clusterware starts. These IP addresses fall within the familiar 169.254.0.0/16 space, which is primarily seen when DHCP interfaces are unable to acquire and address. Because each of the interfaces will receive an IP, HAIP allows for easy active/active cluster interconnect configuration without needing to configure host and switch based bonding using LACP. On Exadata, Oracle recommends to not use this feature, hence the incident on the exachk report seen above. There is no way to disable the feature – the only way to ensure that it is not used is to manually set the CLUSTER_INTERCONNECTS parameter.

HAIP and Exadata

Like many Oracle features that can cause issues (I’m looking at you, automatic memory management), things will run fine for a long time, until they hit the breaking point. We have seen incidents where an unset CLUSTER_INTERCONNECTS parameter will run fine for months or years, but when it fails, it’s a noticeable failure.

In one case, we had a client with an X3-8 full rack running 12.1.0.2 with the April 2015 patch release. I received an email saying that one day, some of the database instances crashed across the cluster, and would not restart. The other instance continued to run without an issue, but it was spread across the two nodes. The email included a chart like this (these aren’t the real database names):

Database	Node 1	Node 2
DW	UP	UP
HR	DOWN	UP
OID	UP	DOWN
HIST	UP	DOWN
STG	UP	UP

There weren’t any changes made to the databases in question – they had been patched and upgraded 6 weeks ago, and been running without any issues until this day. Naturally the first thing that I did when I heard about the problem was try to start up the instance. Here’s what I saw in the alert log for the instance that would not start:

^?View Code NONE

PRCR-1013 : Failed to start resource ora.hr.db
PRCR-1064 : Failed to start resource ora.hr.db on node db01
CRS-5017: The resource action "ora.hr.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107" in "/u01/app/grid/diag/crs/db01/crs/trace/crsd_oraagent_oracle.trc".
 
CRS-2674: Start of 'ora.hr.db' on 'db01' failed

how about that for an explanation? Here’s what was in the alert log for the instance that failed to start up. This section is from when it was attempting to negotiate with the other instance:

^?View Code NONE

Fri Jun 26 13:52:46 2015
* Load Monitor used for high load check
* New Low - High Load Threshold Range = [153600 - 204800]
Fri Jun 26 13:52:46 2015
Reconfiguration started (old inc 0, new inc 12)
List of instances (total 2) :
 1 2
My inst 1 (I'm a new instance)
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
Fri Jun 26 13:52:47 2015
 * domain 0 valid = 1 according to instance 2
Fri Jun 26 13:52:48 2015
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Jun 26 13:52:48 2015
Fri Jun 26 13:52:48 2015
 LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
Fri Jun 26 13:52:48 2015
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
Fri Jun 26 14:01:47 2015
LMD0 (ospid: 119535) waits for event 'process diagnostic dump' for 1 secs.
Fri Jun 26 14:01:52 2015
LMS0 (ospid: 119579) received an instance eviction notification from instance 2 [2]
Fri Jun 26 14:01:52 2015
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x2
The instance eviction map is 1
Errors in file /u01/app/oracle/diag/rdbms/hr/hr1/trace/hr1_lmhb_120095.trc  (incident=1004937):
ORA-29770: global enqueue process LMD0 (OSID 119535) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/hr/hr1/incident/incdir_1004937/hr1_lmhb_120095_i1004937.trc
Fri Jun 26 14:01:54 2015
Received an instance abort message from instance 2

That’s giving a little bit more information – it looks like the instance is getting shut down by the other node before it can start. Maybe if we look on the running instance, it will give us some information about why it’s kicking out the instance. Here is the same cluster configuration section from the instance that is running ok:

^?View Code NONE

Reconfiguration started (old inc 14, new inc 16)
List of instances (total 2) :
 1 2
New instances (total 1) :
 1
My inst 2
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Fri Jun 26 14:02:54 2015
Reconfiguration complete (total time 6.8 secs)
Fri Jun 26 14:03:54 2015
Increasing number of real time LMS from 0 to 7
Fri Jun 26 14:05:16 2015
LMS0 (ospid: 34439) has detected no messaging activity from instance 1
LMS0 (ospid: 34439) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Fri Jun 26 14:05:16 2015
Communications reconfiguration: instance_number 1 by ospid 34439
Fri Jun 26 14:06:15 2015
Evicting instance 1 from cluster
Waiting for instances to leave: 1
Fri Jun 26 14:06:15 2015
Dumping diagnostic data in directory=[cdmp_20150626140615], requested by (instance=1, osid=30058 (LMS0)), summary=[abnormal instance termination].

Now we are getting somewhere. It looks like there is a cluster communication issue between the nodes. Everything seemed ok, but I noticed the following in the instance alert log when it was starting up:

^?View Code NONE

Fri Jun 26 13:52:36 2015
Cluster communication is configured to use the following interface(s) for this instance
  169.254.12.191
  169.254.65.68
  169.254.177.146
  169.254.214.117
cluster interconnect IPC version: Oracle RDS/IP (generic)

The host was attempting to use the HAIP interfaces instead of the static IP addresses configured on the InfiniBand interfaces. Knowing that this was not the recommended setting, and knowing that I had seen MOS notes in the past about issues when using HAIP on Exadata, I looked at the CLUSTER_INTERCONNECTS setting on the database:

^?View Code NONE

SQL> show parameter interconnect
 
NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
cluster_interconnects		     string

As expected, the parameter was not set. We attempted to set the parameter for each of the instances via the SPfile:

^?View Code NONE

SQL> alter system set cluster_interconnects='172.16.0.185:172.16.0.186:172.16.0.187:172.16.0.188' sid='hr1' scope=spfile;
 
System altered.
 
SQL> alter system set cluster_interconnects='172.16.0.189:172.16.0.190:172.16.0.191:172.16.0.192' sid='hr2' scope=spfile;
 
System altered.

After setting the parameter, we were able to bring up the instance without any issues:

^?View Code NONE

SQL> startup nomount
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.
 
Total System Global Area 1.7180E+10 bytes
Fixed Size                  5304248 bytes
Variable Size            4889346120 bytes
Database Buffers         1.2147E+10 bytes
Redo Buffers              138514432 bytes
SQL>

While it is interesting to see the adverse effects of leaving this parameter unset, the degradation could have been avoided in the first place if the exachk failures had been recognized and addressed. The exachk script has a very robust and ever-changing list of checks that are run, and this is a good example of why it should be run regularly.

HAIP and Exadata

What is HAIP?

HAIP and Exadata

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112