EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(5)升级Bundle Patch 23和交换机

升级Bundle Patch 23

1.下载最新的OPatch,p6880880_112000_Linux-x86-64.zip,解压并检查OPatch是否是最新的版本。

[oracle@gxx2db01 bp23]$ unzip p6880880_112000_Linux-x86-64.zip -d /u01/app/11.2.0.3/grid/
[oracle@gxx2db01 bp23]$ $ORACLE_HOME/OPatch/opatch version 
2.配置OCM文件

% $ORACLE_HOME/OPatch/ocm/bin/emocmrsp
3.验证Oracle Inventory

% $ORACLE_HOME/OPatch/opatch lsinventory -detail -oh $ORACLE_HOME
4.解压BP23

% cd /u01/app/oracle/patches
% unzip p18835772_112030_Linux-x86-64.zip
% cd /u01/app/oracle/patches/18835772
# chown -R oracle:oinstall /u01/app/oracle/patches/18835772
5.进行冲突检测

For Grid Infrastructure Home, as home user:

% $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /18835772/18707883
% $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /18835772/18906063
For Database home, as home user:

% $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /18835772/18707883
% $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /18835772/18906063/custom/server/18906063
6.检查SystemSpace,这里主要是ORACLE_HOME是否有足够空间。

For Grid Infrastructure Home, as home user:

% $ORACLE_HOME/OPatch/opatch prereq CheckSystemSpace -phBaseDir /18835772/18707883
% $ORACLE_HOME/OPatch/opatch prereq CheckSystemSpace -phBaseDir /18835772/18906063
For Database home, as home user:

% $ORACLE_HOME/OPatch/opatch prereq CheckSystemSpace -phBaseDir /18835772/18707883
% $ORACLE_HOME/OPatch/opatch prereq CheckSystemSpace -phBaseDir /18835772/18906063/custom/server/18906063
7.使用auto的方式打patch

# opatch auto /u01/app/oracle/patches/18835772
8.执行脚本

% SQL> sqlplus / as sysdba
% SQL> @rdbms/admin/catbundle.sql exa apply

升级InfiniBand交换机

最后一个步骤是升级InfiniBand交换机,交换机的软件安装介质也是在cells里面自带的。使用和cells一样的升级命令和方式。不过需要注意的一点是,交换机的网卡信息要和交换机主机下的/etc/hosts所配置的内容相同,如果不同,升级过程中会失败。如下所示:

[FAIL     ] Mismatch between address in ifcfg-eth[0,1] and /etc/hosts in gxx2sw-ib3. ACTION: Correct entry in /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 or /etc/hosts

此时需要我们登陆到InfiniBand交换机的主机上,通过ifconfig –a和more /etc/hosts两个命令去查找,修改好自己的/etc/hosts文件后再次运行预检测命令就能够成功通过检测了。

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -ibswitches -upgrade -ibswitch_precheck

2014-09-07 14:10:25 +0800 1 of 1 :SUCCESS: DO: Initiate pre-upgrade validation check on InfiniBand switch(es).
 ----- InfiniBand switch update process started Sun Sep  7 14:10:25 CST 2014 -----
[NOTE     ] Log file at /var/log/cellos/upgradeIBSwitch.log

[INFO     ] List of InfiniBand switches for upgrade: ( gxx2sw-ib3 gxx2sw-ib2 )
[PROMPT   ] Use the default password for all switches? (y/n) [n]: y
[PROMPT   ] Updating only 2 switch(es). Are you sure you want to continue? (y/n) [n]: y
[SUCCESS  ] Verifying Network connectivity to gxx2sw-ib3
[SUCCESS  ] Verifying Network connectivity to gxx2sw-ib2
[SUCCESS  ] Validating verify-topology output
[INFO     ] Master Subnet Manager is set to gxx2sw-ib2 in all Switches

[INFO     ] ---------- Starting with IBSwitch gxx2sw-ib2
[SUCCESS  ] gxx2sw-ib2 is at 1.3.3-2. Meets minimal patching level 1.3.3-2
[SUCCESS  ] Verifying that /tmp has 120M in gxx2sw-ib2, found 249M
[SUCCESS  ] Verifying that / has 80M in gxx2sw-ib2, found 200M
[SUCCESS  ] Verifying that gxx2sw-ib2 has 120M free memory, found 408M
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib2
[SUCCESS  ] Verifying that gxx2sw-ib2 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 20:12:34
[SUCCESS  ] Pre-update validation on gxx2sw-ib2

[INFO     ] ---------- Starting with InfiniBand Switch gxx2sw-ib3
[SUCCESS  ] gxx2sw-ib3 is at 1.3.3-2. Meets minimal patching level 1.3.3-2
[SUCCESS  ] Verifying that /tmp has 120M in gxx2sw-ib3, found 249M
[SUCCESS  ] Verifying that / has 80M in gxx2sw-ib3, found 200M
[SUCCESS  ] Verifying that gxx2sw-ib3 has 120M free memory, found 410M
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib3
[SUCCESS  ] Verifying that gxx2sw-ib3 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 20:17:29
[SUCCESS  ] Pre-update validation on gxx2sw-ib3
[SUCCESS  ] Overall status

 ----- InfiniBand switch update process ended Sun Sep  7 14:11:07 CST 2014 -----
2014-09-07 14:11:07 +0800 1 of 1 :SUCCESS: DONE: Initiate pre-upgrade validation check on InfiniBand switch(es).

检测完成之后,就可以正式进行升级了,如下所示,是一个升级的输出。

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -ibswitches -upgrade

2014-09-07 14:11:34 +0800 1 of 1 :SUCCESS: DO: Initiate upgrade of InfiniBand switches to 2.1.3-4. Expect up to 15 minutes for each switch
 ----- InfiniBand switch update process started Sun Sep  7 14:11:35 CST 2014 -----
[NOTE     ] Log file at /var/log/cellos/upgradeIBSwitch.log

[INFO     ] List of InfiniBand switches for upgrade: ( gxx2sw-ib3 gxx2sw-ib2 )
[PROMPT   ] Use the default password for all switches? (y/n) [n]: y
[PROMPT   ] Updating only 2 switch(es). Are you sure you want to continue? (y/n) [n]: y
[SUCCESS  ] Verifying Network connectivity to gxx2sw-ib3
[SUCCESS  ] Verifying Network connectivity to gxx2sw-ib2
[SUCCESS  ] Validating verify-topology output
[INFO     ] Proceeding with upgrade of InfiniBand switches to version 2.1.3_4
[INFO     ] Master Subnet Manager is set to gxx2sw-ib2 in all Switches

[INFO     ] ---------- Starting with IBSwitch gxx2sw-ib2
[SUCCESS  ] Disable Subnet Manager on gxx2sw-ib2
[SUCCESS  ] Copy firmware packages to gxx2sw-ib2
[SUCCESS  ] gxx2sw-ib2 is at 1.3.3-2. Meets minimal patching level 1.3.3-2
[SUCCESS  ] Verifying that /tmp has 120M in gxx2sw-ib2, found 139M
[SUCCESS  ] Verifying that / has 80M in gxx2sw-ib2, found 200M
[SUCCESS  ] Verifying that gxx2sw-ib2 has 120M free memory, found 299M
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib2
[SUCCESS  ] Verifying that gxx2sw-ib2 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 20:14:12
[SUCCESS  ] Pre-update validation on gxx2sw-ib2
[INFO     ] Starting upgrade on gxx2sw-ib2 to 2.1.3_4. Please give upto 10 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
[SUCCESS  ] Load firmware 2.1.3_4 onto gxx2sw-ib2
[SUCCESS  ] Disable Subnet Manager on gxx2sw-ib2
[SUCCESS  ] Verify that /conf/configvalid is set to 1 on gxx2sw-ib2
[SUCCESS  ] Set SMPriority to 5 on gxx2sw-ib2
[INFO     ] Rebooting gxx2sw-ib2. Wait for 240 secs before continuing
[SUCCESS  ] Reboot gxx2sw-ib2
[SUCCESS  ] Restart Subnet Manager on gxx2sw-ib2
[INFO     ] Starting post-update validation on gxx2sw-ib2
[SUCCESS  ] Inifiniband switch gxx2sw-ib2 is at target patching level
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib2
[SUCCESS  ] Verifying that gxx2sw-ib2 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 12:27:27
[SUCCESS  ] Firmware verification on InfiniBand switch gxx2sw-ib2
[INFO     ] Post-check validation on IBSwitch gxx2sw-ib2
[SUCCESS  ] Update switch gxx2sw-ib2 to 2.1.3_4

[INFO     ] ---------- Starting with InfiniBand Switch gxx2sw-ib3
[SUCCESS  ] Disable Subnet Manager on gxx2sw-ib3
[SUCCESS  ] Copy firmware packages to gxx2sw-ib3
[SUCCESS  ] gxx2sw-ib3 is at 1.3.3-2. Meets minimal patching level 1.3.3-2
[SUCCESS  ] Verifying that /tmp has 120M in gxx2sw-ib3, found 139M
[SUCCESS  ] Verifying that / has 80M in gxx2sw-ib3, found 200M
[SUCCESS  ] Verifying that gxx2sw-ib3 has 120M free memory, found 300M
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib3
[SUCCESS  ] Verifying that gxx2sw-ib3 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 20:37:06
[SUCCESS  ] Pre-update validation on gxx2sw-ib3
[INFO     ] Starting upgrade on gxx2sw-ib3 to 2.1.3_4. Please give upto 10 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
[SUCCESS  ] Load firmware 2.1.3_4 onto gxx2sw-ib3
[SUCCESS  ] Disable Subnet Manager on gxx2sw-ib3
[SUCCESS  ] Verify that /conf/configvalid is set to 1 on gxx2sw-ib3
[SUCCESS  ] Set SMPriority to 5 on gxx2sw-ib3
[INFO     ] Rebooting gxx2sw-ib3. Wait for 240 secs before continuing
[SUCCESS  ] Reboot gxx2sw-ib3
[SUCCESS  ] Restart Subnet Manager on gxx2sw-ib3
[INFO     ] Starting post-update validation on gxx2sw-ib3
[SUCCESS  ] Inifiniband switch gxx2sw-ib3 is at target patching level
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib3
[SUCCESS  ] Verifying that gxx2sw-ib3 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 12:49:54
[SUCCESS  ] Firmware verification on InfiniBand switch gxx2sw-ib3
[INFO     ] Post-check validation on IBSwitch gxx2sw-ib3
[SUCCESS  ] Update switch gxx2sw-ib3 to 2.1.3_4
[INFO     ] InfiniBand Switches ( gxx2sw-ib3 gxx2sw-ib2 ) updated to 2.1.3_4
[SUCCESS  ] Overall status

 ----- InfiniBand switch update process ended Sun Sep  7 14:47:43 CST 2014 -----
2014-09-07 14:47:43 +0800 1 of 1 :SUCCESS: DONE: Upgrade InfiniBand switch(es) to 2.1.3-4.

升级后的检查

整个升级成功后,还有一些步骤需要执行,例如我们把ASM的disk repair的时间修改回来,把CRS的状态改成enable,启动dbfs,以及检查IMAGE是否都已经升级到了11.2.3.3.0

----修改ASM的disk_repair_time
SQL> alter diskgroup DATA_GXX2 set attribute 'disk_repair_time'='3.6h';
Diskgroup altered.

----将CRS的状态改成enable
[root@gxx2db01 tmp]dcli -g dbs_group -l root "/u01/app/11.2.0.3/grid/bin/crsctl enable crs"

----检查计算节点和存储节点的IMAGE信息
[root@gxx2db01 mydbfs]# dcli -g /tmp/all_group -l root 'imagehistory'
gxx2db01: Version                              : 11.2.3.1.0.120304
gxx2db01: Image activation date                : 2002-05-03 22:47:44 +0800
gxx2db01: Imaging mode                         : fresh
gxx2db01: Imaging status                       : success
gxx2db01:
gxx2db01: Version                              : 11.2.3.3.0.131014.1
gxx2db01: Image activation date                : 2014-09-07 11:57:33 +0800
gxx2db01: Imaging mode                         : patch
gxx2db01: Imaging status                       : success
gxx2db01:
gxx2db02: Version                              : 11.2.3.1.0.120304
gxx2db02: Image activation date                : 2012-05-03 11:29:41 +0800
gxx2db02: Imaging mode                         : fresh
gxx2db02: Imaging status                       : success
gxx2db02:
gxx2db02: Version                              : 11.2.3.3.0.131014.1
gxx2db02: Image activation date                : 2014-09-06 22:03:14 +0800
gxx2db02: Imaging mode                         : patch
gxx2db02: Imaging status                       : success
gxx2db02:
gxx2cel01: Version                              : 11.2.2.3.5.110815
gxx2cel01: Image activation date                : 2011-10-19 16:15:42 -0700
gxx2cel01: Imaging mode                         : fresh
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel01: Version                              : 11.2.3.1.0.120304
gxx2cel01: Image activation date                : 2012-05-03 03:00:13 -0700
gxx2cel01: Imaging mode                         : out of partition upgrade
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel01: Version                              : 11.2.3.3.0.131014.1
gxx2cel01: Image activation date                : 2014-09-06 16:01:21 +0800
gxx2cel01: Imaging mode                         : out of partition upgrade
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel02: Version                              : 11.2.2.3.5.110815
gxx2cel02: Image activation date                : 2011-10-19 16:26:30 -0700
gxx2cel02: Imaging mode                         : fresh
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel02: Version                              : 11.2.3.1.0.120304
gxx2cel02: Image activation date                : 2012-05-03 02:59:52 -0700
gxx2cel02: Imaging mode                         : out of partition upgrade
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel02: Version                              : 11.2.3.3.0.131014.1
gxx2cel02: Image activation date                : 2014-09-06 17:42:01 +0800
gxx2cel02: Imaging mode                         : out of partition upgrade
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel03: Version                              : 11.2.2.3.5.110815
gxx2cel03: Image activation date                : 2011-10-19 16:26:59 -0700
gxx2cel03: Imaging mode                         : fresh
gxx2cel03: Imaging status                       : success
gxx2cel03:
gxx2cel03: Version                              : 11.2.3.1.0.120304
gxx2cel03: Image activation date                : 2012-05-03 02:58:38 -0700
gxx2cel03: Imaging mode                         : out of partition upgrade
gxx2cel03: Imaging status                       : success
gxx2cel03:
gxx2cel03: Version                              : 11.2.3.3.0.131014.1
gxx2cel03: Image activation date                : 2014-09-06 17:42:08 +0800
gxx2cel03: Imaging mode                         : out of partition upgrade
gxx2cel03: Imaging status                       : success

参考下列文档

Sun_Oracle_Database_Machine_Owner’s_Guide

How to backup / restore Exadata Database Server (Lunix) –社区文档

dbnodeupdate.sh: Exadata Database Server Patching using the DB Node Update Utility (文档 ID 1553103.1)

Exadata 11.2.3.3.0 release and patch (16278923) (文档 ID 1487339.1)

Information Center: Upgrading Oracle Exadata Database Machine [ID 1364356.2]

EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(4)释放Solaris空间和升级计算节点

释放Solaris空间

Exadata在出厂的时候,默认安装了两个OS系统,一个是Linux,一个是Solaris X86,然后互相做RAID 1,我们在升级计算节点的时候,如果不释放掉Solaris就会报下列错误:

ERROR: Solaris disks are not reclaimed. This needs to be done before the upgrade. See the Exadata Database Machine documentation to claim the Solaris disks

我们可以使用出厂自带的脚本来查看计算节点本地盘的一个情况,这里可以看到,总共物理盘有4块,RDID的级别是1,拥有dual boot。

[root@gxx2db01 oracle.SupportTools]# ./reclaimdisks.sh -check
[INFO] This is SUN FIRE X4170 M2 SERVER machine
[INFO] Number of LSI controllers: 1
[INFO] Physical disks found: 4 (252:0 252:1 252:2 252:3)
[INFO] Logical drives found: 3
[INFO] Dual boot installation: yes
[WARNING] Some lvm logical volume(s) resizes on other than /dev/sda device
[INFO] Linux logical drive: 0
[INFO] RAID Level for the Linux logical drive: 1
[INFO] Physical disks in the Linux logical drive: 2 (252:0 252:1)
[INFO] Dedicated Hot Spares for the Linux logical drive: 0
[INFO] Global Hot Spares: 0
[INFO] Valid dual boot configuration found for Linux: RAID1 from 2 disks

释放solaris操作系统很简单,运行reclaimdisks.sh脚本释放即可,当然在运行的时候我遇到了一个小问题,这个脚本只认系统默认的盘和卷组,而南宁电网自己配置了一个新的VG(就是用作备份的那个datavg),因为我们在前面做了备份的操作,我把这个VG删除,重新运行脚本执行成功,当然你也可以改脚本运行,不过我们做了尝试,还是会把你新建的VG配置信息给清理掉。所以这个动作还是很危险的,我们在做这个之前,一定要做好备份。在运行的过程中,我们可以去监控日志/var/log/cellos/reclaimdisks.bg.log,看它具体都做了些什么操作。

[root@gxx2db02 oracle.SupportTools]# ./reclaimdisks.sh -free -reclaim

Started from ./reclaimdisks.sh
[INFO] Free mode is set
[INFO] Reclaim mode is set
[INFO] This is SUN FIRE X4170 M2 SERVER machine
[INFO] Number of LSI controllers: 1
[INFO] Physical disks found: 4 (252:0 252:1 252:2 252:3)
[INFO] Logical drives found: 3
[INFO] Dual boot installation: yes
[INFO] Linux logical drive: 0
[INFO] RAID Level for the Linux logical drive: 1
[INFO] Physical disks in the Linux logical drive: 2 (252:0 252:1)
[INFO] Dedicated Hot Spares for the Linux logical drive: 0
[INFO] Global Hot Spares: 0
[INFO] Non-linux physical disks that will be reclaimed: 2 (252:2 252:3)
[INFO] Non-linux logical drives that will be reclaimed: 2 (1 2)
Remove logical drive 1

Adapter 0: Deleted Virtual Drive-1(target id-1)
Exit Code: 0x00
Remove logical drive 2

Adapter 0: Deleted Virtual Drive-2(target id-2)

Exit Code: 0x00
[INFO] Remove Solaris entries from /boot/grub/grub.conf
[INFO] Disk reclaiming started in the background with parent process id 17405.
[INFO] Check the log file /var/log/cellos/reclaimdisks.bg.log.
[INFO] This process may take about two hours.
[INFO] DO NOT REBOOT THE NODE.
[INFO] The node will be rebooted automatically upon completion.

升级计算节点

Exadata计算节点的升级很简单,先要在计算节点上关闭CRS和数据库,并把CRS设置成为disable状态,这样在安装过程中发生重启,也不会去启动集群和数据库。

[root@gxx2db01 tmp]dcli -g dbs_group -l root "/u01/app/11.2.0.3/grid/bin/crsctl stop crs -f"
[root@gxx2db01 tmp]dcli -g dbs_group -l root "ps -ef | grep d.bin"
[root@gxx2db01 tmp]dcli -g dbs_group -l root "/u01/app/11.2.0.3/grid/bin/crsctl disable crs"

在这里我们需要使用工具DB Node Update Utility,也就是补丁16486998所提供的脚本。详细内容可以参考文档dbnodeupdate.sh: Exadata Database Server Patching using the DB Node Update Utility (文档 ID 1553103.1)。这篇文档有该脚本很多的使用案例。我们使用的方式是ISO IMAGE的方式,还可以使用http的方式。升级之前最好先使用-v参数预检测一把。还有一个注意的地方,如果你的solaris系统没有被reclaim掉的话,执行该脚本就会报错。

整个升级过程如下所示:

[root@gxx2db02 u01]# ./dbnodeupdate.sh -u -l /u01/p17809253_112330_Linux-x86-64.zip
##########################################################################################################################
#                                                                                                                        
# Guidelines for using dbnodeupdate.sh (rel. 3.55):                                                                      #                                                                                                              
# - Prerequisites for usage:                                                                                             #
#         1. Refer to dbnodeupdate.sh options. See MOS 1553103.1                                                         
#         2. Use the latest release of dbnodeupdate.sh. See patch 16486998                                               
#         3. Run the prereq check with the '-v' option.                                                                  #                                                                                                                  #
#   I.e.:  ./dbnodeupdate.sh -u -l /u01/my-iso-repo.zip -v                                                               #
#          ./dbnodeupdate.sh -u -l http://my-yum-repo -v                                                                 #
#                                                                                                                        
# - Prerequisite dependency check failures can happen due to customization:                                              #
#     - The prereq check detects dependency issues that need to be addressed prior to running a successful update.       
#     - Customized rpm packages may fail the built-in dependency check and system updates cannot proceed until resolved. 
#                                                                                                                        
#   When upgrading from releases later than 11.2.2.4.2 to releases before 11.2.3.3.0:                                    #
#      - Conflicting packages should be removed before proceeding the update.                                            #                                                                                                                     
#   When upgrading to releases 11.2.3.3.0 or later:                                                                      #
#      - When the 'exact' package dependency check fails 'minimum' package dependency check will be tried.               #
#      - When the 'minimum' package dependency check also fails,                                                         #
#        the conflicting packages should be removed before proceeding.                                                   #                                                                                                                       
# - As part of the prereq checks and as part of the update, a number of rpms will be removed.                            #
#   This removal is required to preserve Exadata functioning. This should not be confused with obsolete packages.        
#      - See /var/log/cellos/packages_to_be_removed.txt for details on what packages will be removed.                                                                                                                                     
# - In case of any problem when filing an SR, upload the following:                                                      #
#      - /var/log/cellos/dbnodeupdate.log                                                                                #
#      - /var/log/cellos/dbnodeupdate..diag                                                                       #
#      - where  is the unique number of the failing run.                                                          #
#                                                                                                                        #
##########################################################################################################################
Continue ? [y/n]
y
  (*) 2014-09-06 21:53:28: Unzipping helpers (/u01/dbupdate-helpers.zip) to /opt/oracle.SupportTools/dbnodeupdate_helpers
  (*) 2014-09-06 21:53:28: Initializing logfile /var/log/cellos/dbnodeupdate.log
  (*) 2014-09-06 21:53:28: Collecting system configuration details. This may take a while...
  (*) 2014-09-06 21:53:41: Validating system details for known issues and best practices. This may take a while...
  (*) 2014-09-06 21:53:41: Checking free space in /u01/iso.stage.060914215326
  (*) 2014-09-06 21:53:41: Unzipping /u01/p17809253_112330_Linux-x86-64.zip to /u01/iso.stage.060914215326, this may take a while
  (*) 2014-09-06 21:54:00: Original /etc/yum.conf moved to /etc/yum.conf.060914215326, generating new yum.conf
  (*) 2014-09-06 21:54:00: Generating Exadata repository file /etc/yum.repos.d/Exadata-computenode.repo

  Warning: Network routing configuration requires change before updating database server. See MOS 1306154.1

Continue ? [y/n]
y

  (*) 2014-09-06 21:54:17: Validating the specified source location.
  (*) 2014-09-06 21:54:18: Cleaning up the yum cache.
  (*) 2014-09-06 21:54:18: Preparing update for releases 11.2.3.3.0 and later
  (*) 2014-09-06 21:54:28: Performing yum package dependency check for 'exact' dependencies. This may take a while...
  (*) 2014-09-06 21:54:32: 'Exact'package dependency check succeeded.
  (*) 2014-09-06 21:54:32: 'Minimum' package dependency check succeeded.

Active Image version   : 11.2.3.1.0.120304
Active Kernel version  : 2.6.18-274.18.1.0.1.el5
Active LVM Name        : /dev/mapper/VGExaDb-LVDbSys1
Inactive Image version : n/a
Inactive LVM Name      : /dev/mapper/VGExaDb-LVDbSys2
Current user id        : root
Action                 : upgrade
Upgrading to           : 11.2.3.3.0.131014.1 (to exadata-sun-computenode-exact)
Baseurl                : file:///var/www/html/yum/unknown/EXADATA/dbserver/060914215326/x86_64/ (iso)
Iso file               : /u01/iso.stage.060914215326/repoimage.iso
Create a backup        : Yes
Shutdown stack         : No (Currently stack is down)
Hotspare exists        : Yes, but will NOT be reclaimed as part of this update)
                       : Raid reconstruction to add the hotspare to be done later when required
RPM exclusion list     : Not in use (add rpms to /etc/exadata/yum/exclusion.lst and restart dbnodeupdate.sh)
RPM obsolete list      : /etc/exadata/yum/obsolete.lst (lists rpms to be removed by the update)
                       : RPM obsolete list is extracted from exadata-sun-computenode-11.2.3.3.0.131014.1-1.x86_64.rpm
Exact dependencies     : No conflicts
Minimum dependencies   : No conflicts
Logfile                : /var/log/cellos/dbnodeupdate.log (runid: 060914215326)
Diagfile               : /var/log/cellos/dbnodeupdate.060914215326.diag
Server model           : SUN FIRE X4170 M2 SERVER
dbnodeupdate.sh rel.   : 3.55 (always check MOS 1553103.1 for the latest release of dbnodeupdate)
Note                   : After upgrading and rebooting run './dbnodeupdate.sh -c' to finish post steps.

The following known issues will be checked for and automatically corrected by dbnodeupdate.sh:
  (*) - Issue 1.7 - Updating database servers with customized partitions may remove partitions already in use
  (*) - Issue - 11.2.3.3.0 and 12.1.1.1.0 require disabling SDP APM settings. See MOS 1623834.1
  (*) - Issue - Incorrect validation name for ExaWatcher in /etc/cron.daily/cellos stops ExaWatcher
  (*) - Issue - tls_checkpeer and tls_crlcheck mis-configured in /etc/ldap.conf

The following known issues will be checked for but require manual follow-up:
  (*) - Issue - Database Server upgrades may hit network routing issues after the upgrade
  (*) - Issue - Yum rolling update requires fix for 11768055 when Grid Infrastructure is below 11.2.0.2 BP12
  (*) - Updates from releases earlier than 11.2.3.3.0 may hang during reboot. See MOS 1620826.1 for more details

Continue ? [y/n]
y
  (*) 2014-09-06 21:54:57: Verifying GI and DB's are shutdown
  (*) 2014-09-06 21:54:59: Collecting console history for diag purposes
  (*) 2014-09-06 21:55:15: Unmount of /boot successful
  (*) 2014-09-06 21:55:15: Check for /dev/sda1 successful
  (*) 2014-09-06 21:55:15: Mount of /boot successful
  (*) 2014-09-06 21:55:15: Disabling stack from starting
  (*) 2014-09-06 21:55:15: Performing filesystem backup to /dev/mapper/VGExaDb-LVDbSys2. Avg. 30 minutes (maximum 120) depends per environment.....
  (*) 2014-09-06 21:59:26: Backup successful
  (*) 2014-09-06 21:59:26: OSWatcher stopped successful
  (*) 2014-09-06 21:59:27: Validating the specified source location.
  (*) 2014-09-06 21:59:28: Cleaning up the yum cache.
  (*) 2014-09-06 21:59:28: Preparing update for releases 11.2.3.3.0 and later
  (*) 2014-09-06 21:59:32: Performing yum update. Node is expected to reboot when finished.
  (*) 2014-09-06 22:01:56: Waiting for post rpm script to finish. Sleeping another 60 seconds (60 / 900)

Remote broadcast message (Sat Sep  6 22:02:02 2014):

Exadata post install steps started.
It may take up to 5 minutes.
  (*) 2014-09-06 22:02:56: Waiting for post rpm script to finish. Sleeping another 60 seconds (120 / 900)
Remote broadcast message (Sat Sep  6 22:03:15 2014):
Exadata post install steps completed with success

整个Update需要40-50分钟,系统会重启几次。在这中间可以观察到系统重启后,能ping通,但是ssh是不通的,然后要等待最后一次自动重启之后才能SSH连上。这期间最好不要着急。

EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(3)升级存储节点

升级存储节点的IMAGE之前,需要对环境做check。这里选择使用计算节点作为主要操作对象。

1.检查各个cells节点之间root用户的安全信任关系

[root@gxx2db01 tmp]# dcli -g all_group -l root date
gxx2db01: Sat Sep  6 12:14:41 CST 2014
gxx2db02: Sat Sep  6 12:14:40 CST 2014
gxx2cel01: Sat Sep  6 12:14:41 CST 2014
gxx2cel02: Sat Sep  6 12:14:41 CST 2014
gxx2cel03: Sat Sep  6 12:14:41 CST 2014
[root@gxx2db01 tmp]# dcli -g cell_group -l root 'hostname -i'
gxx2cel01: 10.100.84.104
gxx2cel02: 10.100.84.105
gxx2cel03: 10.100.84.106
2.检测磁盘组属性disk_repair_time配置

[grid@gxx2db02 ~]$ sqlplus / as sysasm
SQL*Plus: Release 11.2.0.3.0 Production on Sat Sep 6 12:20:14 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a
where dg.group_number=a.group_number and a.name='disk_repair_time';  
NAME              VALUE
-------          -----
DATA_GXX2       3.6h
DBFS_DG         3.6h
RECO_GXX2       3.6h

这里的时间是3.6个小时,修改这个主要是为了避免升级过程中达到缺省的3.6小时后在cell节点执行删除griddisk的操作。如果发生删除了griddisk的情况,那么,需要升级完成后手工添加这些磁盘组。这里先把它修改成24个小时吧。

SQL> alter diskgroup DATA_GXX2 set attribute 'disk_repair_time'='24h';
Diskgroup altered.

SQL> alter diskgroup DBFS_DG set attribute 'disk_repair_time'='24h';
Diskgroup altered.

SQL> alter diskgroup RECO_GXX2 set attribute 'disk_repair_time'='24h';
Diskgroup altered.

SQL> select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a
where dg.group_number=a.group_number and a.name='disk_repair_time';  
NAME              VALUE
-------          -----
DATA_GXX2        24h
DBFS_DG          24h
RECO_GXX2        24h
3.检查操作系统的内核版本

root@gxx2db01 tmp]# dcli -g all_group -l root 'uname -a'
gxx2db01: Linux gxx2db01.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
gxx2db02: Linux gxx2db02.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
gxx2cel01: Linux gxx2cel01.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
gxx2cel02: Linux gxx2cel02.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
gxx2cel03: Linux gxx2cel03.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
4.检查操作系统版本

[root@gxx2db01 tmp]# dcli -g all_group -l root 'cat /etc/oracle-release'
gxx2db01: Oracle Linux Server release 5.7
gxx2db02: Oracle Linux Server release 5.7
gxx2cel01: Oracle Linux Server release 5.7
gxx2cel02: Oracle Linux Server release 5.7
gxx2cel03: Oracle Linux Server release 5.7
5.检查IMAGE版本

[root@gxx2db01 tmp]# dcli -g all_group -l root 'imageinfo'
gxx2db01:
gxx2db01: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2db01: Image version: 11.2.3.1.0.120304
gxx2db01: Image activated: 2002-05-03 22:47:44 +0800
gxx2db01: Image status: success
gxx2db01: System partition on device: /dev/mapper/VGExaDb-LVDbSys1
gxx2db01:
gxx2db02:
gxx2db02: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2db02: Image version: 11.2.3.1.0.120304
gxx2db02: Image activated: 2012-05-03 11:29:41 +0800
gxx2db02: Image status: success
gxx2db02: System partition on device: /dev/mapper/VGExaDb-LVDbSys1
gxx2db02:
gxx2cel01:
gxx2cel01: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2cel01: Cell version: OSS_11.2.3.1.0_LINUX.X64_120304
gxx2cel01: Cell rpm version: cell-11.2.3.1.0_LINUX.X64_120304-1
gxx2cel01:
gxx2cel01: Active image version: 11.2.3.1.0.120304
gxx2cel01: Active image activated: 2012-05-03 03:00:13 -0700
gxx2cel01: Active image status: success
gxx2cel01: Active system partition on device: /dev/md6
gxx2cel01: Active software partition on device: /dev/md8
gxx2cel01:
gxx2cel01: In partition rollback: Impossible
gxx2cel01:
gxx2cel01: Cell boot usb partition: /dev/sdm1
gxx2cel01: Cell boot usb version: 11.2.3.1.0.120304
gxx2cel01:
gxx2cel01: Inactive image version: 11.2.2.3.5.110815
gxx2cel01: Inactive image activated: 2011-10-19 16:15:42 -0700
gxx2cel01: Inactive image status: success
gxx2cel01: Inactive system partition on device: /dev/md5
gxx2cel01: Inactive software partition on device: /dev/md7
gxx2cel01:
gxx2cel01: Boot area has rollback archive for the version: 11.2.2.3.5.110815
gxx2cel01: Rollback to the inactive partitions: Possible
gxx2cel02:
gxx2cel02: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2cel02: Cell version: OSS_11.2.3.1.0_LINUX.X64_120304
gxx2cel02: Cell rpm version: cell-11.2.3.1.0_LINUX.X64_120304-1
gxx2cel02:
gxx2cel02: Active image version: 11.2.3.1.0.120304
gxx2cel02: Active image activated: 2012-05-03 02:59:52 -0700
gxx2cel02: Active image status: success
gxx2cel02: Active system partition on device: /dev/md6
gxx2cel02: Active software partition on device: /dev/md8
gxx2cel02:
gxx2cel02: In partition rollback: Impossible
gxx2cel02:
gxx2cel02: Cell boot usb partition: /dev/sdm1
gxx2cel02: Cell boot usb version: 11.2.3.1.0.120304
gxx2cel02:
gxx2cel02: Inactive image version: 11.2.2.3.5.110815
gxx2cel02: Inactive image activated: 2011-10-19 16:26:30 -0700
gxx2cel02: Inactive image status: success
gxx2cel02: Inactive system partition on device: /dev/md5
gxx2cel02: Inactive software partition on device: /dev/md7
gxx2cel02:
gxx2cel02: Boot area has rollback archive for the version: 11.2.2.3.5.110815
gxx2cel02: Rollback to the inactive partitions: Possible
gxx2cel03:
gxx2cel03: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2cel03: Cell version: OSS_11.2.3.1.0_LINUX.X64_120304
gxx2cel03: Cell rpm version: cell-11.2.3.1.0_LINUX.X64_120304-1
gxx2cel03:
gxx2cel03: Active image version: 11.2.3.1.0.120304
gxx2cel03: Active image activated: 2012-05-03 02:58:38 -0700
gxx2cel03: Active image status: success
gxx2cel03: Active system partition on device: /dev/md6
gxx2cel03: Active software partition on device: /dev/md8
gxx2cel03:
gxx2cel03: In partition rollback: Impossible
gxx2cel03:
gxx2cel03: Cell boot usb partition: /dev/sdm1
gxx2cel03: Cell boot usb version: 11.2.3.1.0.120304
gxx2cel03:
gxx2cel03: Inactive image version: 11.2.2.3.5.110815
gxx2cel03: Inactive image activated: 2011-10-19 16:26:59 -0700
gxx2cel03: Inactive image status: success
gxx2cel03: Inactive system partition on device: /dev/md5
gxx2cel03: Inactive software partition on device: /dev/md7
gxx2cel03:
gxx2cel03: Boot area has rollback archive for the version: 11.2.2.3.5.110815
gxx2cel03: Rollback to the inactive partitions: Possible

[root@gxx2db01 tmp]# dcli -g all_group -l root 'imagehistory'
gxx2db01: Version                              : 11.2.3.1.0.120304
gxx2db01: Image activation date                : 2002-05-03 22:47:44 +0800
gxx2db01: Imaging mode                         : fresh
gxx2db01: Imaging status                       : success
gxx2db01:
gxx2db02: Version                              : 11.2.3.1.0.120304
gxx2db02: Image activation date                : 2012-05-03 11:29:41 +0800
gxx2db02: Imaging mode                         : fresh
gxx2db02: Imaging status                       : success
gxx2db02:
gxx2cel01: Version                              : 11.2.2.3.5.110815
gxx2cel01: Image activation date                : 2011-10-19 16:15:42 -0700
gxx2cel01: Imaging mode                         : fresh
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel01: Version                              : 11.2.3.1.0.120304
gxx2cel01: Image activation date                : 2012-05-03 03:00:13 -0700
gxx2cel01: Imaging mode                         : out of partition upgrade
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel02: Version                              : 11.2.2.3.5.110815
gxx2cel02: Image activation date                : 2011-10-19 16:26:30 -0700
gxx2cel02: Imaging mode                         : fresh
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel02: Version                              : 11.2.3.1.0.120304
gxx2cel02: Image activation date                : 2012-05-03 02:59:52 -0700
gxx2cel02: Imaging mode                         : out of partition upgrade
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel03: Version                              : 11.2.2.3.5.110815
gxx2cel03: Image activation date                : 2011-10-19 16:26:59 -0700
gxx2cel03: Imaging mode                         : fresh
gxx2cel03: Imaging status                       : success
gxx2cel03:
gxx2cel03: Version                              : 11.2.3.1.0.120304
gxx2cel03: Image activation date                : 2012-05-03 02:58:38 -0700
gxx2cel03: Imaging mode                         : out of partition upgrade
gxx2cel03: Imaging status                       : success
gxx2cel03:
6.检查ofa版本

[root@gxx2db01 tmp]# dcli -g all_group -l root 'rpm -qa | grep ofa'
gxx2db01: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
gxx2db02: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
gxx2cel01: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
gxx2cel02: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
gxx2cel03: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
7.检查硬件设备

[root@gxx2db01 tmp]# dcli -g all_group -l root 'dmidecode -s system-product-name'
gxx2db01: SUN FIRE X4170 M2 SERVER
gxx2db02: SUN FIRE X4170 M2 SERVER
gxx2cel01: SUN FIRE X4270 M2 SERVER
gxx2cel02: SUN FIRE X4270 M2 SERVER
gxx2cel03: SUN FIRE X4270 M2 SERVER
8.检查cells节点的日志

gxx2cel01: 36    2014-08-29T08:54:27+08:00       info            "This is a test trap"
gxx2cel02: 40_1  2014-08-28T20:01:24+08:00       warning         "Oracle Exadata Storage Server failed to auto-create cell disk and grid disks on the newly inserted physical disk. Physical Disk : 20:4  Status        : normal  Manufacturer  : SEAGATE  Model Number  : ST360057SSUN600G  Size          : 600G  Serial Number : E4CK7V  Firmware      : 0B25  Slot Number   : 4  "
gxx2cel02: 41    2014-08-29T08:54:04+08:00       info            "This is a test trap"gxx2cel03: 27_3  2014-08-13T18:28:11+08:00       clear           "Hard disk replaced.  Status        : NORMAL  Manufacturer  : HITACHI  Model Number  : HUS1560SCSUN600G  Size          : 600G  Serial Number : K7UL6N  Firmware      : A700  Slot Number   : 11  Cell Disk     : CD_11_gxx2cel03  Grid Disk     : DATA_GXX2_CD_11_gxx2cel03, RECO_GXX2_CD_11_gxx2cel03, DBFS_DG_CD_11_gxx2cel03"
gxx2cel03: 28    2014-08-29T08:54:43+08:00       info            "This is a test trap"
9.检查是否存在offline的grid盘

[root@gxx2db01 tmp]# dcli -g cell_group -l root "cellcli -e "LIST GRIDDISK ATTRIBUTES name WHERE asmdeactivationoutcome != 'Yes'" "
10. 验证cell节点网络配置信息与cell.conf保持一致

[root@gxx2db01 tmp]# dcli -g cell_group -l root /opt/oracle.cellos/ipconf -verify
gxx2cel01: Verifying of Exadata configuration file /opt/oracle.cellos/cell.conf
gxx2cel01: Done. Configuration file /opt/oracle.cellos/cell.conf passed all verification checks
gxx2cel02: Verifying of Exadata configuration file /opt/oracle.cellos/cell.conf
gxx2cel02: Done. Configuration file /opt/oracle.cellos/cell.conf passed all verification checks
gxx2cel03: Verifying of Exadata configuration file /opt/oracle.cellos/cell.conf
gxx2cel03: Done. Configuration file /opt/oracle.cellos/cell.conf passed all verification checks
11.停止CRS和存储节点的服务

[root@gxx2db01 tmp]dcli -g dbs_group -l root "/u01/app/11.2.0.3/grid/bin/crsctl stop crs -f"
[root@gxx2db01 tmp]dcli -g dbs_group -l root "ps -ef | grep d.bin"
[root@gxx2db01 tmp]dcli -g cell_group -l root "cellcli -e alter cell shutdown services all"
12.解压安装介质和解压插件

[root@gxx2db01 ExaImage]# unzip p16278923_112330_Linux-x86-64.zip
[root@gxx2db01 ExaImage]# unzip -d patch_11.2.3.3.0.131014.1/plugins/ p17938410_112330_Linux-x86-64.zip -x Readme.txt
[root@gxx2db01 ExaImage]# chmod +x patch_11.2.3.3.0.131014.1/plugins/*
13. 清理之前patchmgr运行后的环境

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -cells /tmp/cell_group -reset_force
2014-09-06 13:48:44 +0800 DONE: reset_force

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -cells  /tmp/cell_group -cleanup
2014-09-06 13:49:51 +0800 DONE: Cleanup
14.预安装检查

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -cells /tmp/cell_group -patch_check_prereq
2014-09-06 14:27:26 +0800        :Working: DO: Check cells have ssh equivalence for root user. Up to 10 seconds per cell ...
2014-09-06 14:27:27 +0800        :SUCCESS: DONE: Check cells have ssh equivalence for root user.
2014-09-06 14:27:27 +0800        :Working: DO: Initialize files, check space and state of cell services. Up to 1 minute ...
2014-09-06 14:27:49 +0800        :SUCCESS: DONE: Initialize files, check space and state of cell services.
2014-09-06 14:27:49 +0800        :Working: DO: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction. Up to 40 minutes ...
2014-09-06 14:28:17 +0800 Wait correction of degraded md11 due to md partner size mismatch. Up to 30 minutes.

2014-09-06 14:28:18 +0800        :SUCCESS: DONE: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction.
2014-09-06 14:28:18 +0800        :Working: DO: Check prerequisites on all cells. Up to 2 minutes ...
2014-09-06 14:29:01 +0800        :SUCCESS: DONE: Check prerequisites on all cells.
2014-09-06 14:29:01 +0800        :Working: DO: Execute plugin check for Patch Check Prereq ...
2014-09-06 14:29:01 +0800 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 17854520 v1.1. Details in logfile /backup/ExaImage/patch_11.2.3.3.0.131014.1/patchmgr.stdout.
2014-09-06 14:29:01 +0800 :INFO: This plugin checks dbhomes across all nodes with oracle-user ssh equivalence, but only for those known to the local system. dbhomes that exist only on remote nodes must be checked manually.
2014-09-06 14:29:01 +0800 :SUCCESS: No exposure to bug 17854520 with non-rolling patching
2014-09-06 14:29:01 +0800        :SUCCESS: DONE: Execute plugin check for Patch Check Prereq.
15.升级存储节点

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -cells /tmp/cell_group -patch
NOTE Cells will reboot during the patch or rollback process.
NOTE For non-rolling patch or rollback, ensure all ASM instances using
NOTE the cells are shut down for the duration of the patch or rollback.
NOTE For rolling patch or rollback, ensure all ASM instances using
NOTE the cells are up for the duration of the patch or rollback.

WARNING Do not start more than one instance of patchmgr.
WARNING Do not interrupt the patchmgr session.
WARNING Do not alter state of ASM instances during patch or rollback.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot cells or alter cell services during patch or rollback.
WARNING Do not open log files in editor in write mode or try to alter them.

NOTE All time estimates are approximate. Timestamps on the left are real.
NOTE You may interrupt this patchmgr run in next 60 seconds with control-c.


2014-09-06 14:32:49 +0800        :Working: DO: Check cells have ssh equivalence for root user. Up to 10 seconds per cell ...
2014-09-06 14:32:50 +0800        :SUCCESS: DONE: Check cells have ssh equivalence for root user.
2014-09-06 14:32:50 +0800        :Working: DO: Initialize files, check space and state of cell services. Up to 1 minute ...
2014-09-06 14:33:32 +0800        :SUCCESS: DONE: Initialize files, check space and state of cell services.
2014-09-06 14:33:32 +0800        :Working: DO: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction. Up to 40 minutes ...
2014-09-06 14:34:00 +0800 Wait correction of degraded md11 due to md partner size mismatch. Up to 30 minutes.


2014-09-06 14:34:01 +0800        :SUCCESS: DONE: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction.
2014-09-06 14:34:01 +0800        :Working: DO: Check prerequisites on all cells. Up to 2 minutes ...
2014-09-06 14:34:43 +0800        :SUCCESS: DONE: Check prerequisites on all cells.
2014-09-06 14:34:43 +0800        :Working: DO: Copy the patch to all cells. Up to 3 minutes ...
2014-09-06 14:35:15 +0800        :SUCCESS: DONE: Copy the patch to all cells.
2014-09-06 14:35:17 +0800        :Working: DO: Execute plugin check for Patch Check Prereq ...
2014-09-06 14:35:17 +0800 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 17854520 v1.1. Details in logfile /backup/ExaImage/patch_11.2.3.3.0.131014.1/patchmgr.stdout.
2014-09-06 14:35:17 +0800 :INFO: This plugin checks dbhomes across all nodes with oracle-user ssh equivalence, but only for those known to the local system. dbhomes that exist only on remote nodes must be checked manually.
2014-09-06 14:35:17 +0800 :SUCCESS: No exposure to bug 17854520 with non-rolling patching
2014-09-06 14:35:18 +0800        :SUCCESS: DONE: Execute plugin check for Patch Check Prereq.
2014-09-06 14:35:18 +0800 1 of 5 :Working: DO: Initiate patch on cells. Cells will remain up. Up to 5 minutes ...
2014-09-06 14:35:30 +0800 1 of 5 :SUCCESS: DONE: Initiate patch on cells.
2014-09-06 14:35:30 +0800 2 of 5 :Working: DO: Waiting to finish pre-reboot patch actions. Cells will remain up. Up to 45 minutes ...
2014-09-06 14:36:30 +0800 Wait for patch pre-reboot procedures


2014-09-06 15:03:13 +0800 2 of 5 :SUCCESS: DONE: Waiting to finish pre-reboot patch actions.
2014-09-06 15:03:13 +0800        :Working: DO: Execute plugin check for Patching ...
2014-09-06 15:03:13 +0800        :SUCCESS: DONE: Execute plugin check for Patching.
2014-09-06 15:03:13 +0800 3 of 5 :Working: DO: Finalize patch on cells. Cells will reboot. Up to 5 minutes ...
2014-09-06 15:03:33 +0800 3 of 5 :SUCCESS: DONE: Finalize patch on cells.
2014-09-06 15:03:33 +0800 4 of 5 :Working: DO: Wait for cells to reboot and come online. Up to 120 minutes ...
2014-09-06 15:04:33 +0800 Wait for patch finalization and reboot

||||| Minutes left 076

2014-09-06 16:01:39 +0800 4 of 5 :SUCCESS: DONE: Wait for cells to reboot and come online.
2014-09-06 16:01:39 +0800 5 of 5 :Working: DO: Check the state of patch on cells. Up to 5 minutes ...
2014-09-06 16:02:14 +0800 5 of 5 :SUCCESS: DONE: Check the state of patch on cells.
2014-09-06 16:02:14 +0800        :Working: DO: Execute plugin check for Post Patch ...
2014-09-06 16:02:14 +0800 :INFO: /backup/ExaImage/patch_11.2.3.3.0.131014.1/plugins/001-post_11_2_3_3_0 - 17718598: Correct /etc/oracle-release.
2014-09-06 16:02:14 +0800 :INFO: /backup/ExaImage/patch_11.2.3.3.0.131014.1/plugins/001-post_11_2_3_3_0 - 17908298: Preserve password quality policies where applicable.
2014-09-06 16:02:15 +0800        :SUCCESS: DONE: Execute plugin check for Post Patch.

运行完成升级脚本后,系统会在屏幕上输出一系列的WORKING,SUCCESS等,如果运行到某一个地方出现Failed,则升级会中断,此时需要去解决这个问题。存储节点在升级的时候会自动重启,我们在计算节点可以看到下列日志:“SUCCESS: DONE: Wait for cells to reboot and come online.”最终在计算节点升级脚本运行完毕,一般需要1个半小时以上的时间。然后我可以检查下image的版本,判断是否升级成功。这期间要保证网络不断,因为我们是从计算节点发起的升级操作。所以最好使用vnc软件来执行升级,免得终端突然断掉引起不可预知的问题.

EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(2)备份环境和升级LSI Disk Array Controller Firmware

1.配置NFS环境

为了能够保证升级出错以后,可以回退到升级前的状态。我们需要把整个Exadata的部分环境做一个备份。我们采用的备份方式是NFS方式。我们找到了一台能够ping通的局域网内网的Linux服务器,把这台服务器将作为NFS的服务器,并且这台服务器上事先已经挂载了1T的空间。

在服务端修改/etc/exports,加上下列内容

/media/_data/  10.100.82.1(rw)
/media/_data/  10.100.82.2(rw)

注意:这个IP地址是Exadata映射出来的IP,不是计算节点的物理IP,必须从服务器端/var/log/messages里面可以看到Exadata客户端发起的请求IP,把请求IP配置到/etc/exports才能配置成功。因为客户在不同网段之间访问设置了防火墙,所以还需要通过配置固定端口进行连通。在服务端修改/etc/sysconfig/nfs,增加如下端口。

MOUNTD_PORT="4002"
STATD_PORT="4003"
LOCKD_TCPPORT="4004"
LOCKD_UDPPORT="4004"

操作系统上的防火墙全部都要关闭。

service iptables off

检查NFS是否配置好。

rpcinfo –p    在服务器端执行,查看端口是否正确.
showmount –e  在服务器端执行能查看到nfs文件系统的信息.
showmount -e  服务端ip地址  在客户端执行  能从客户端查看到nfs文件系统的信息.

在exadata的两个计算节点上mount NFS文件系统。

mount -t nfs -o rw,intr,soft,proto=tcp,nolock 10.194.42.11:/media/_data /root/tar

2.备份现有环境

做完NFS的配置之后,我们就可以用来进行备份Exadata计算节点的操作系统,集群软件、数据库软件及数据库的备份,而我们的存储节点因为可以使用CELL BOOT USB Flash Drive来进行恢复,所以无须备份。

2.1备份计算节点操作系统
[root@gxx2db01 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   14G   15G  49% /
/dev/sda1             502M   36M  441M   8% /boot
/dev/mapper/VGExaDb-LVDbOra1
                       99G   55G   39G  59% /u01
tmpfs                  81G   26M   81G   1% /dev/shm
/dev/mapper/datavg-lv_data
                      549G  355G  166G  69% /backup
dbfs-dbfs@dbfs:/      800G  4.9G  796G   1% /data
10.194.42.11:/media/_data
                      985G  199M  935G   1% /root/tar

可以看到当前目录已经挂载了1个T空间的NFS容量,我们的操作系统存在着两个LV,一个是/dev/mapper/VGExaDb-LVDbSys1和/dev/mapper/VGExaDb-LVDbOra1,而datavg-lv-data是我们自己划的用于数据库备份的。所以备份操作系统也就是备份/dev/mapper/VGExaDb-LVDbSys1和/dev/mapper/VGExaDb-LVDbOra1这两个LV,我们使用下面的备份方式。

[root@gxx2db01 ~]# lvcreate -L1G -s -n root_snap /dev/VGExaDb/LVDbSys1
  Logical volume "root_snap" created
[root@gxx2db01 ~]# e2label /dev/VGExaDb/root_snap DBSYS_SNAP
[root@gxx2db01 ~]# mkdir /root/mnt
[root@gxx2db01 ~]# mount /dev/VGExaDb/root_snap /root/mnt -t ext3

[root@gxx2db01 ~]# lvcreate -L5G -s -n u01_snap /dev/VGExaDb/LVDbOra1
  Logical volume "u01_snap" created
[root@gxx2db01 ~]# e2label /dev/VGExaDb/u01_snap DBORA_SNAP
[root@gxx2db01 ~]# mkdir -p /root/mnt/u01
[root@gxx2db01 ~]# mount /dev/VGExaDb/u01_snap /root/mnt/u01 -t ext3

[root@gxx2db01 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   14G   15G  49% /
/dev/sda1             502M   36M  441M   8% /boot
/dev/mapper/VGExaDb-LVDbOra1
                       99G   55G   39G  59% /u01
tmpfs                  81G   26M   81G   1% /dev/shm
/dev/mapper/datavg-lv_data
                      549G  355G  166G  69% /backup
dbfs-dbfs@dbfs:/      800G  4.9G  796G   1% /data
10.194.42.11:/media/_data
                      985G  199M  935G   1% /root/tar
/dev/mapper/VGExaDb-root_snap
                       30G   14G   15G  49% /root/mnt
/dev/mapper/VGExaDb-u01_snap
                       99G   55G   39G  59% /root/mnt/u01

做完上述步骤之后,可以看到多了两个lv,对VGExaDb-LVDbSys1和VGExaDb-LVDbOra1做了一个备份并挂载成了文件系统。接下来我们就可以把我们备份的文件系统tar到NFS上面。

[root@gxx2db01 ~]# cd /root/mnt
[root@gxx2db01 ~]#  tar -pjcvf /root/tar/mybackup.tar.bz2 * /boot --exclude \
tar/mybackup.tar.bz2 --exclude  /root/tar > \
/tmp/backup_tar.stdout 2> /tmp/backup_tar.stderr

做完tar之后可以查看/tmp/backup_tar.stderr文件检查是否有错误。如果无误,我们就可以把刚刚建的文件系统挂载点进行卸载,创建的LV进行删除。

[root@gxx2db01 ~]# cd /
[root@gxx2db01 ~]# umount /root/mnt/u01
[root@gxx2db01 ~]# umount /root/mnt
[root@gxx2db01 ~]# /bin/rm -rf /root/mnt
[root@gxx2db01 ~]# lvremove /dev/VGExaDb/u01_snap
[root@gxx2db01 ~]# lvremove /dev/VGExaDb/root_snap

以上操作分别在两个节点进行。

2.2备份计算节点数据库

计算节点上运行了三套数据库实例,分别是gxypdb,orcl,jjscpd等,而gxypdb和orcl采用了RMAN备份,而jjscpd采用了exp备份,是放在计算节点的dbfs文件系统里面的。对于使用RMAN备份的数据库,我们采用下列脚本,把数据备份到了/backup/orcl和/backup/gxypdb下面。我们只需要把备份出的文件夹拷贝到NFS目录下即可完成对数据库的备份,而对于exp的备份,我们也只需要把dbfs文件系统里面的dmp文件copy到NFS目录下。

--->备份数据库
export ORACLE_SID=orcl2
source /home/oracle/.bash_profile
$ORACLE_HOME/bin/rman log=/backup/log/full_`date +%Y%m%d%H%M`.log <<EOF
connect target /
run
{
# Backup Database full
BACKUP
     SKIP INACCESSIBLE
     TAG hot_db_bk_level
     FORMAT '/backup/orcl/bk_s%s_p%p_t%T'
    DATABASE
    INCLUDE CURRENT CONTROLFILE;
}
run
{
# Backup Archived Logs

sql 'alter system archive log current';
change archivelog all crosscheck;
BACKUP
    FORMAT '/backup/orcl/ar_s%s_p%p_t%T'
    ARCHIVELOG ALL;

# Control file backup
BACKUP
    FORMAT '/backup/orcl/cf_s%s_p%p_t%T'
    CURRENT CONTROLFILE;
}
delete noprompt archivelog until time "sysdate - 5";
crosscheck backup;
delete force noprompt expired backup;
allocate channel for maintenance type disk;
delete force noprompt obsolete device type disk;
list backup summary;
exit;
EOF
--->拷贝备份集到NFS
[root@gxx2db01 ~]# cp  -rp /backup/orcl/ /root/tar
[root@gxx2db01 ~]# cp  -rp /backup/gxypdb/ /root/tar
[root@gxx2db01 ~]# cp –rp /data/*.dmp  /root/tar
2.3备份计算节点集群软件和数据库软件

备份计算节点集群软件和数据库软件,主要是为了防止安装QUARTERLY DATABASE PATCH FOR EXADATA (BP 23),也就是GI和DB的Patch出现不可预知的错误,方便我们能够进行回退。此操作最好是要先停止掉数据库软件和GI软件。

[oracle@gxx2db01 ~]$ srvctl stop instance –i orcl1 –d orcl
[oracle@gxx2db01 ~]$ srvctl stop instance –i orcl2 –d orcl
[oracle@gxx2db01 ~]$ srvctl stop instance –i gxypdb1 –d gxypdb
[oracle@gxx2db01 ~]$ srvctl stop instance –i gxypdb2 –d gxypdb
[oracle@gxx2db01 ~]$ srvctl stop instance –i jjscpd1 –d jjscpd
[oracle@gxx2db01 ~]$ srvctl stop instance –i jjscpd2 –d jjscpd
[root@gxx2db01 ~]# /u01/app/11.2.0.3/grid/bin/crsctl stop crs -f
[root@gxx2db01 ~]# cd /root/tar
[root@gxx2db01 ~]# tar -cvf oraInventory.tar /u01/app/oraInventory 
[root@gxx2db01 ~]# tar -cvf grid.tar /u01/app/11.2.0.3/grid 
[root@gxx2db01 ~]# tar -cvf oracle.tar /u01/app/oracle/product/11.2.0.3/dbhome_1
2.4备份交换机配置文件

任意登陆到一台ILOM的管理界面上,例如:gxx2db01-ilom https://10.100.84.118,通过点击Maintenance标签,再选择Backup/Restore的标签,选择Operation为Backup,而Method为Browser,选择完成之后在Passphrase输入密码,点击Run,即可以在浏览器中生成一个XML的备份文件。

image

3.Reset ILOM和重启Exadata

为了能够顺利的进行升级,升级前最好把整个Exadata重启一次,重启的顺序就是先进入到ILOM 管理界面Reset SP,然后停止CELLS节点的服务,重启所有CELLS,重启成功之后,在重启计算节点。客户的Exadata总共有5个ILOM的管理界面,分别是两台计算节点和三台存储CELLS节点的,需要通过网址访问,因为防火墙的关系,需要找网络管理员开放端口才可以访问。进入到管理界面选择Maintenance,然后选择Reset SP即可。然后要等一会,就可以重新连接上了。ILOM5台的管理地址如下:

gxx2db01-ilom		https://10.100.84.118
gxx2db02-ilom		https://10.100.84.119
gxx2cel01-ilom		https://10.100.84.126
gxx2cel02-ilom		https://10.100.84.127
gxx2cel03-ilom       https://10.100.84.128

对于存储节点,我们需要先停止掉cells的服务,到每一台cells服务器上运行下列命令:

cellcli -e alter cell shutdown services all

停止成功后检查一下cells的服务是否全部停止成功。

cellcli -e list cell attributes msstatus,cellsrvstatus,rsstatus

重启存储节点的主机

sync
reboot

等到存储节点重启完成之后,检查cells服务是否成功启动,成功启动则没有问题。此时可以重启计算节点,在前面做软件备份的时候停止了数据库和集群软件,如果没有停止,需要先考虑停止数据库,然后再停止集群软件,再进行计算节点的重启。

sync
reboot

4.检查信任关系

为了保证顺畅的升级,需要确保在计算节点能够和存储节点建立安全的信任关系,这里主要是通过SSH来实现的。首先在/tmp下建立一个all_group,配置上两个计算节点和三个存储节点的主机名。然后在建立一个cell_group,配置上三个存储节点的主机名,然后执行下列命令,如果不需要输入密码能够直接显示,则信任关系正常。

 [root@gxx2db01 tmp]# dcli -g all_group -l root date
gxx2db01: Sat Sep  6 12:14:41 CST 2014
gxx2db02: Sat Sep  6 12:14:40 CST 2014
gxx2cel01: Sat Sep  6 12:14:41 CST 2014
gxx2cel02: Sat Sep  6 12:14:41 CST 2014
gxx2cel03: Sat Sep  6 12:14:41 CST 2014
[root@gxx2db01 tmp]# dcli -g cell_group -l root 'hostname -i'
gxx2cel01: 10.100.84.104
gxx2cel02: 10.100.84.105
gxx2cel03: 10.100.84.106

如果信任关系有问题,需要使用下列命令,重建信任关系。

ssh-keygen -t rsa
dcli -g cell_group -l root –k

5.升级 LSI Disk Array Controller Firmware

安装LSI DISK Disk Array Controller Firmware可以使用滚动模式和非滚动模式,因为我们申请了停机的时间,所以这个操作使用的是非滚动模式。

1.把安装介质FW12120140.zip 上传到每个cells节点的/tmp目录下.

2.解压FW12120140.zip文件.

[root@gxx2db01 tmp]# unzip FW12120140.zip -d /tmp
[root@gxx2db01 tmp]# mkdir -p /tmp/firmware
[root@gxx2db01 tmp]# tar -pjxf  FW12120140.tbz -C /tmp/firmware

在/tmp/fireware下面应该存在一个这样的文件

12.12.0.0140_AF2108_FW_Image.rom 5ff5650dd92acd4e62530bf72aa9ea83

3.验证FW12120140.sh脚本

#!/bin/ksh
echo date > /tmp/manual_fw_update.log
logfile=/tmp/manual_fw_update.log
HWModel=`dmidecode --string system-product-name | tail -1 | sed -e 's/[ \t]\+$//g;s/ /_/g'`
silicon_ver_lsi_card="`lspci 2>/dev/null | grep 'RAID' | grep LSI | awk '{print $NF}' | sed -e 's/03)/B2/g;s/05)/B4/g;'`"
silicon_ver_lsi_card=`echo $silicon_ver_lsi_card | sed -e 's/B2/B4/g'`
lsi_card_firmware_file="SUNDiskControllerFirmware_${silicon_ver_lsi_card}"
echo $lsi_card_firmware_file
echo "`date '+%F %T'`: Now updating the disk controller firmware ..." | tee -a $logfile
echo "`date '+%F %T'`: Now disabling cache of the disk controller ..." | tee -a $logfile
sync
/opt/MegaRAID/MegaCli/MegaCli64 -AdpCacheFlush -aALL -NoLog | tee -a $logfile
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WT -Lall -a0 -NoLog | tee -a $logfile
/opt/MegaRAID/MegaCli/MegaCli64 -AdpCacheFlush -aALL -NoLog | tee -a $logfile
/opt/MegaRAID/MegaCli/MegaCli64 -v | tee -a $logfile
/opt/MegaRAID/MegaCli/MegaCli64 -AdpFwFlash -f /tmp/firmware/12.12.0.0140_AF2108_FW_Image.rom  -NoVerChk -a0 -Silent -AppLogFile /tmp/manual_fw_update.log
if [ $? -ne 0 ]; then
   echo "`date '+%F %T'`: [ERROR] Failed to update the Disk Controller firmware. Will continue anyway ..." | tee -a $logfile
else
   echo "`date '+%F %T'`: [INFO] Disk controller firmware update command completed successfully." | tee -a $logfile
fi

给脚本赋予700的权限。

chmod 700 /tmp/FW12120140.sh

4.停止数据库和CRS

[oracle@gxx2db01 ~]$ srvctl stop instance –i orcl1 –d orcl
[oracle@gxx2db01 ~]$ srvctl stop instance –i orcl2 –d orcl
[oracle@gxx2db01 ~]$ srvctl stop instance –i gxypdb1 –d gxypdb
[oracle@gxx2db01 ~]$ srvctl stop instance –i gxypdb2 –d gxypdb
[oracle@gxx2db01 ~]$ srvctl stop instance –i jjscpd1 –d jjscpd
[oracle@gxx2db01 ~]$ srvctl stop instance –i jjscpd2 –d jjscpd
[root@gxx2db01 ~]# /u01/app/11.2.0.3/grid/bin/crsctl stop crs –f
[root@gxx2db01 ~]# /u01/app/11.2.0.3/grid/bin/crsctl check crs

5.停止所有存储节点的服务

[root@gxx2db01 ~]# dcli -l root -g cell_group "cellcli -e alter cell shutdown services all"

6.创建文件DISABLE_HARDWARE_FIRMWARE_CHECKS

[root@gxx2db01 ~]# #dcli -l root -g cell_group "touch /opt/oracle.cellos/DISABLE_HARDWARE_FIRMWARE_CHECKS"

7.禁用exachkcfg服务

[root@gxx2db01 ~]# #dcli -l root -g cell_group "chkconfig exachkcfg off"

8.在cells节点上执行FW12120140.sh脚本

[root@gxx2cel01 tmp]# /tmp/FW12120140.sh
SUNDiskControllerFirmware_B4
2014-09-06 11:15:31: Now updating the disk controller firmware ...
2014-09-06 11:15:31: Now disabling cache of the disk controller ...

Cache Flush is successfully done on adapter 0.

Exit Code: 0x00
Set Write Policy to WriteThrough on Adapter 0, VD 0 (target id: 0) success
Set Write Policy to WriteThrough on Adapter 0, VD 1 (target id: 1) success
Set Write Policy to WriteThrough on Adapter 0, VD 2 (target id: 2) success
Set Write Policy to WriteThrough on Adapter 0, VD 3 (target id: 3) success
Set Write Policy to WriteThrough on Adapter 0, VD 4 (target id: 4) success
Set Write Policy to WriteThrough on Adapter 0, VD 5 (target id: 5) success
Set Write Policy to WriteThrough on Adapter 0, VD 6 (target id: 6) success
Set Write Policy to WriteThrough on Adapter 0, VD 7 (target id: 7) success
Set Write Policy to WriteThrough on Adapter 0, VD 8 (target id: 8) success
Set Write Policy to WriteThrough on Adapter 0, VD 9 (target id: 9) success
Set Write Policy to WriteThrough on Adapter 0, VD 10 (target id: 10) success
Set Write Policy to WriteThrough on Adapter 0, VD 11 (target id: 11) success

Exit Code: 0x00
Cache Flush is successfully done on adapter 0.
Exit Code: 0x00

      MegaCLI SAS RAID Management Tool  Ver 8.02.21 Oct 21, 2011
    (c)Copyright 2011, LSI Corporation, All Rights Reserved.
Exit Code: 0x00
95%   Completed2014-09-06 11:16:09: [INFO] Disk controller firmware update command completed successfully.

9.脚本执行成功之后,需要重启,这里需要注意的一点是,需要重启两次。

[root@gxx2cel01 tmp]#sync
[root@gxx2cel01 tmp]#shutdown -fr now

10.重启完成之后,可以检查LSI MegaRaid Disk Controller Firmware的版本。

[root@gxx2cel01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -a0 -NoLog  | grep  'FW
Package Build'
FW Package Build: 12.12.0-0079
FW Version         : 2.120.203-1440
Current Size of FW Cache       : 399 MB

11.升级成功之后,移除文件DISABLE_HARDWARE_FIRMWARE_CHECKS

[root@gxx2cel01 ~]# dcli -l root -g cell_group "rm -fr /opt/oracle.cellos/DISABLE_HARDWARE_FIRMWARE_CHECKS"

12.开启exachkcfg服务

[root@gxx2cel01 ~]# dcli -l root -g cell_group "chkconfig exachkcfg on"

13.查看cells服务状态

[root@gxx2cel01 ~]# dcli -l root -g cell_group "cellcli -e list cell attributes msstatus,cellsrvstatus,rsstatus"
   running         running         running

从第5步,开始重复上面的步骤在其他存储节点上运行。等所有节点都完成之后,并且验证是有效的LSI MegaRaid Disk Controller Firmware,重启整个存储节点的服务。

EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(1)升级简介

客户的Exadata一体机版本目前是11.2.3.1.0,为了提高安全性和稳定性,提高产品的可用性,减少Bug发生,本次将对Exadata一体机进行改造,将11.2.3.1.0升级到11.2.3.3.0。

本期升级的操作流程如下:

1. 提前做备份,防止升级过程中出现意外,需要回退;

2. 重置ILOM(重启),重启存储节点,计算节点;

3. 升级LSI DISK Array Controller Firmware(存储节点);

4. 升级存储节点的IMAGE到11.2.3.3.0;

5. 释放计算节点Solaris操作系统;

6. 升级计算节点的IMAGE到11.2.3.3.0;

7. 升级BP23或者BP24,包括GI和RDBMS;

8. 升级交换机,从1.3.3-2升级到2.1.3-4;

本次升级所需要的安装介质如下:

Exadata 11.2.3.3.0 Storage Server Patch and InfiniBand switche —-> p16278923_112330_Linux-x86-64.zip

DBNODEUPDATE – ONE STEP UPDATE UTL. FOR LINUX DB SERVERS —-> p16486998_121111_Linux-x86-64.zip

EXADATA COMPUTE NODE 11.2.3.3.0 BASE REPOSITORY ISO —-> p17809253_112330_Linux-x86-64.zip

QUARTERLY DATABASE PATCH FOR EXADATA (JUL 2014 – 11.2.0.3.24) —-> p18835772_112030_Linux-x86-64.zip

EXADATA 11.2.3.3.0 PATCHMGR PLUG-INS DOWNLOAD —-> p17938410_112330_Linux-x86-64.zip

Bug 16397592 disk controller fw check is incorrect 12.9.0049 >= 12.12.00140 —-> FW12120140.zip

Patch 6880880 OPatch patch of version 11.2.0.3.6 for Oracle software releases 11.2.0.x (DEC 2013) —-> p6880880_112000_Linux-x86-64.zip

本次升级可能存在的风险点:

1. 存储节点损坏;

对于存储节点的损坏我们可以采用CELL BOOT USB Flash Drive来进行恢复,可以参考Oracle® Exadata Database Machine Owner’s Guide11g Release 2 (11.2)的 7.32 Using the Oracle Exadata Storage Server Software Rescue Procedure;

2. 计算节点损坏;

对于计算节点的损坏,我们会对计算节点做备份,并拷贝到NFS服务器上,方便做恢复;

3. 集群软件、数据库软件损坏;

对于集群软件、数据库软件的损坏,我们会对集群软件和数据库软件做备份,并拷贝到NFS服务器上,方便做恢复;

4. 数据库损坏;

对于数据库的损坏,直接用RMAN进行恢复,事先做全备把备份文件迁移到NFS服务器上;

ORA-00600: internal error code, arguments: [15709], [29], [1]故障解决

客户一套10.2.0.4的数据库,一个实例突然的Crash掉了。客户想让我们帮忙分析宕机的原因。对于这种数据库突然Crash的问题,我们首先就会看数据库的Alert日志,可以看到在宕机之前,SMON进程报了ORA-00600[15709]的错误,紧接数据库就输出了一条信息“Fatal internal error happened while SMON was doing active transaction recovery.”也就是说SMON在做活动事务恢复的时候出现了异常。最终导致了数据库实例的宕机。日志输出如下所示:

Fri Sep 26 10:53:35 2014
Errors in file /oracle/app/oracle/admin/wxyydb/bdump/wxyydb_smon_28997.trc:
ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], []
ORA-30319: Message 30319 not found;  product=RDBMS; facility=ORA
Fri Sep 26 10:53:55 2014
Fatal internal error happened while SMON was doing active transaction recovery.
Fri Sep 26 10:53:55 2014
Errors in file /oracle/app/oracle/admin/wxyydb/bdump/wxyydb_smon_28997.trc:
ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], []
ORA-30319: Message 30319 not found;  product=RDBMS; facility=ORA
SMON: terminating instance due to error 474
Termination issued to instance processes. Waiting for the processes to exit
Fri Sep 26 10:54:05 2014
Instance termination failed to kill one or more processes
Instance terminated by SMON, pid = 28997

我们再来分析一下wxyydb_smon_28997.trc文件的信息。可以看到数据库的SMON进程一直尝试在做并行恢复事务。在恢复的过程中遇到了ORA-00600错误,最终底层代码异常触发了数据库的宕机。

*** 2014-09-26 10:10:36.236
Parallel Transaction recovery caught error 30319 
*** 2014-09-26 10:15:10.643
Parallel Transaction recovery caught exception 30319
*** 2014-09-26 10:15:21.816
Parallel Transaction recovery caught error 30319 
*** 2014-09-26 10:19:51.707
Parallel Transaction recovery caught exception 30319
*** 2014-09-26 10:53:35.830
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], []
ORA-30319: Message 30319 not found;  product=RDBMS; facility=ORA
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst()+64          call     ksedst1()            000000000 ? 000000001 ?
ksedmp()+2176        call     ksedst()             000000000 ?
                                                   C000000000000C9F ?
                                                   4000000004057F40 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
ksfdmp()+48          call     ksedmp()             000000003 ?
kgeriv()+336         call     ksfdmp()             C000000000000695 ?
                                                   000000003 ?
                                                   40000000095185E0 ?
                                                   00000EC33 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
kgeasi()+416         call     kgeriv()             6000000000031770 ?
                                                   6000000000032828 ?
                                                   4000000001A504E0 ?
                                                   000000002 ?
                                                   9FFFFFFFFFFFA138 ?
$cold_kxfpqsrls()+1  call     kgeasi()             6000000000031770 ?
168                                                9FFFFFFFFD3D2290 ?
                                                   000003D5D ? 000000002 ?
                                                   000000002 ? 0000003E7 ?
                                                   000003D5D ?
                                                   9FFFFFFFFD3D22A0 ?
kxfpqrsod()+1104     call     $cold_kxfpqsrls()    C0000004FDF7A838 ?
                                                   C0000004FDF74430 ?
                                                   000000004 ?
                                                   9FFFFFFFFFFFA200 ?
                                                   C0000000000011AB ?
                                                   4000000003AA1250 ?
                                                   00000EDF5 ? 000000001 ?
kxfpdelqrefs()+640   call     kxfpqrsod()          C0000004FDF74430 ?
                                                   000000001 ?
                                                   60000000000B6300 ?
                                                   C000000000000694 ?
                                                   4000000003DD14F0 ?
                                                   00000EE2D ?
                                                   60000000000C6708 ?
kxfpqsod_qc_sod()+2  call     kxfpdelqrefs()       00000003E ? 000000001 ?
016                                                60000000000B6300 ?
                                                   C000000000001028 ?
                                                   40000000025DE5A0 ?
                                                   4000000001B1A110 ?
                                                   60000000000C2D04 ?
                                                   60000000000C2E90 ?
kxfpqsod()+816       call     kxfpqsod_qc_sod()    000000010 ? 000000001 ?
                                                   9FFFFFFFFFFFA260 ?
                                                   60000000000B6300 ?
                                                   9FFFFFFFFFFFA7F0 ?
                                                   C000000000001028 ?
                                                   40000000025DF810 ?
                                                   00000EE65 ?
ktprdestroy()+208    call     kxfpqsod()           C0000004FDF7A838 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFFA810 ?
                                                   60000000000B6300 ?
                                                   9FFFFFFFFFFFAD90 ?
ktprbeg()+8272       call     ktprdestroy()        C000000000001026 ?
                                                   40000000025615B0 ?
                                                   000006E61 ? 000000000 ?
                                                   4000000001052E40 ?
                                                   000000000 ?
ktmmon()+10096       call     ktprbeg()            9FFFFFFFFFFFBE70 ?
                                                   9FFFFFFFFFFFADA0 ?
                                                   60000000000B6300 ?
                                                   40000000028B75A0 ?
                                                   00000EF21 ?
                                                   9FFFFFFFFFFFADD8 ?
                                                   9FFFFFFFFFFFADE0 ?
ktmSmonMain()+64     call     ktmmon()             9FFFFFFFFFFFD140 ?
ksbrdp()+2816        call     ktmSmonMain()        C000000100E1CA60 ?
                                                   C000000000000FA5 ?
                                                   000007361 ?
                                                   4000000003B5AE10 ?
                                                   C000000000000205 ?
                                                   400000000409DCD0 ?
opirip()+1136        call     ksbrdp()             9FFFFFFFFFFFD150 ?
                                                   60000000000B6300 ?
                                                   9FFFFFFFFFFFDC90 ?
                                                   4000000002863EF0 ?
                                                   000004861 ?
                                                   C000000000000B1D ?
                                                   60000000000318F0 ?
$cold_opidrv()+1408  call     opirip()             9FFFFFFFFFFFEA70 ?
                                                   000000004 ?
                                                   9FFFFFFFFFFFF090 ?
                                                   9FFFFFFFFFFFDCA0 ?
                                                   60000000000B6300 ?
                                                   C000000000000DA1 ?
sou2o()+336          call     $cold_opidrv()       000000032 ?
                                                   9FFFFFFFFFFFF090 ?
                                                   60000000000C2C78 ?
$cold_opimai_real()  call     sou2o()              9FFFFFFFFFFFF0B0 ?
+640                                               000000032 ? 000000004 ?
                                                   9FFFFFFFFFFFF090 ?
main()+368           call     $cold_opimai_real()  000000003 ? 000000000 ?
main_opd_entry()+80  call     main()               000000003 ?
                                                   9FFFFFFFFFFFF598 ?
                                                   60000000000B6300 ?
                                                   C000000000000004 ?
 

根据ORA-00600[15709],我们在Oracle Support上找到一篇文档,SMON may fail with ORA-00600 [15709] Errors Crashing the Instance (文档 ID 736348.1),这篇文档的错误信息和我们所报出来的信息雷同。这篇文档列出了出现错误的堆栈情况:kxfpqsrls <- kxfpqrsod <- kxfpdelqrefs <- kxfpqsod_qc_sod <- kxfpqsod <- ktprdestroy <- ktprbe <- ktmmon。我们可以从SMON的Trace里面看到,堆栈内容基本上和这个匹配。所以,这个问题是在恢复的过程中命中了bug 695472,而如果你安装了这个patch,还是有类似的问题,很可能是遇到了另外一个类似的bug 9233544,Oracle的Bug还真是多啊。

bug 695472会影响9.2.0.8和10.2.0.4这两个版本,并且在10.2.0.4.2和10.2.0.5,11.1.0.7,11.2.0.1上得到了修复。解决bug 695472的方法是:

1.Use the following workaround

Set fast_start_parallel_rollback=false and recovery_parallelism=0

OR

2.Apply one-off  <<Patch:6954722>>, if available for your platform/version here.

OR

3.Upgrade to fixed release 10.2.0.5, 11.1.0.7 or 11.2.0.1.

bug 9233544会影响10.2.0.4,11.1.0.7和11.2.0.1这三个版本,并且在11.2.0.3和12.1上得到了修复,解决bug 9233544的方法是:

1.Apply patchset 11.2.0.3, in which Bug: 9233544 is fixed.

OR

2.Check if one-off Patch:9233544 is available for your release and platform here.

我们仔细检查了一下系统的补丁,发现系统已经安装了patch 6954722,那就证明是bug 9233544影响的。要么升级到11.2.0.3的版本,要么就是安装单独的patch 9233544。对于升级11.2.0.3这个动作太大了,给客户说了一下考虑安装小patch来解决。

Oracle数据库升级后保障SQL性能退化浅谈

一、数据库升级后保障手段

为了保障从10.2.0.4版本升级到11.2.0.4版本更加平稳,我们事先采用了oracle性能分析器(SQL Performance Analyzer)来预测数据库的关键SQL在Oracle 11.2.0.4版本上的性能情况。以便提前发现问题并做相关性能优化。这一部分的SQL已经提前进行了优化处理。但是Oracle SPA功能这只是预测,我们并不能完全仿真真实应用业务压力上来之后对数据库性能造成的影响。因此,我们需要对其他SQL问题进行快速的处理,保障系统升级后平稳的运行。

二、SQL语句性能下降

2.1  和10g对比,检查执行计划有无发生变化

我们第一步要做的就是将新库SQL产生的执行计划和老库的执行计划进行对比,这种对比一般需要你先登陆到新库上查询,然后在登陆到老库上查询。这其实是比较麻烦的做法。做SPA的时候,我们把10g的SQL语句进行了一个捕捉,并把结果集保存在SQLSET中。我们可以把这个结果集作为一个数据表进行了导出导入,在11g环境中把这个放在SPA用户下。这个表的数据保存了10g数据库1-2月的游标,数据量大概在200-300万左右。可以考虑对它进行去重,去除字面的SQL,去除重复之后数据量只有20-30万。然后我们可以通过高CPU消耗脚本进行监控,或者是ash,awr报告,找到引起性能的sql_id。通过这个表和v$sql之间的相同SQL_ID语句的PLAN_HASH_VALUE进行对比。检查语句的执行计划有无改变。语句如下:

select distinct 'NEW   ',sql_id,PLAN_HASH_VALUE from V$SQL where sql_id='&sqlid'
union 
select distinct 'OLD   ',sql_id,PLAN_HASH_VALUE from spa.SQLSET2_TAB a where sql_id='&sqlid';
2.2  快速切换统计信息

当发现SQL语句的执行计划发生改变,我们需要检查是否是统计信息引起的问题,在这里我们默认会有两套统计信息供我们随时进行切换。当把数据迁移到11g之后,你可以把所有的表的统计信息收集一遍,然后用export_database_stats的方法导到一个表里面。然后在把10g的统计信息也导入到11g里面。这样你就可以在出问题的时候快速切换统计信息了。

exec dbms_stats.IMPORT_TABLE_STATS(OWNNAME=>'TABLE_OWNER',TABNAME=>'TABLE_NAME',STATTAB=>'STAT_11G',statown=>'SPA');
exec dbms_stats.IMPORT_TABLE_STATS(OWNNAME=>' TABLE_OWNER ',TABNAME=>' TABLE_NAME ',STATTAB=>'STAT_10G',statown=>'SPA');
2.3 使用SPM快速固定执行计划

如果发现统计信息也一致,而执行计划仍然变坏的语句,需要我们使用SPM来固定住执行计划,首先我们做SPA会有一个SQLSET,前面提到过,我们把导入到的表转换成11g的SQLSET,然后使用LOAD_PLANS_FROM_SQLSET方法进行固定。

declare
my_plans pls_integer;
begin
my_plans:=DBMS_SPM.LOAD_PLANS_FROM_SQLSET(SQLSET_NAME=>'SQLSET1',SQLSET_OWNER=>'SPA',basic_filter => 'sql_id=''6j2pfum10dvxg''');
end;
/

如果这个涉及到硬解析的语句,可能SQL_ID太多无法绑定,需要我们在basic_filter需要使用下面的语法。

basic_filter => 'sql_text like ''select /*LOAD_STS*/%'''

以下是一个SPM绑定的示例:

SQL> explain plan for select distinct prod_id from pd_prod_rel a,pd_userprc_info_41 b where a.element_idb= :ELEMENT_IDB           and a.relation_type in('3','4') and a.element_ida=b.prod_id  and b.exp_date>sysdate and b.id_no=20310013952141  and  PROD_MAIN_FLAG='1'and RELPRCINS_ID=0;
Explained.

SQL> select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2479329866

---------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                    |     1 |    51 |    23   (9)| 00:00:01 |
|   1 |  HASH UNIQUE                 |                    |     1 |    51 |    23   (9)| 00:00:01 |
|*  2 |   TABLE ACCESS BY INDEX ROWID| PD_USERPRC_INFO_41 |     1 |    33 |     5   (0)| 00:00:01 |
|   3 |    NESTED LOOPS              |                    |     1 |    51 |    22   (5)| 00:00:01 |
|   4 |     INLIST ITERATOR          |                    |       |       |            |          |
|*  5 |      INDEX RANGE SCAN        | IDX_PRODREL        |     1 |    18 |    17   (0)| 00:00:01 |
|*  6 |     INDEX RANGE SCAN         | IDX_USERPRC_41_02  |     8 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("PROD_MAIN_FLAG"='1' AND "B"."EXP_DATE">SYSDATE@! AND
              "A"."ELEMENT_IDA"="B"."PROD_ID")
   5 - access(("A"."RELATION_TYPE"='3' OR "A"."RELATION_TYPE"='4') AND
              "A"."ELEMENT_IDB"=:ELEMENT_IDB)
       filter("A"."ELEMENT_IDB"=:ELEMENT_IDB)
   6 - access("B"."ID_NO"=20310013952141 AND "RELPRCINS_ID"=0)

Note
-----
   - SQL plan baseline "SQL_PLAN_9z5mbs0jxkucn6ea6bdfe" used for this statement
27 rows selected.

SQL> select SQL_HANDLE,PLAN_NAME,ENABLED,ACCEPTED,FIXED,OPTIMIZER_COST from dba_sql_plan_baselines; 
SQL_HANDLE                     PLAN_NAME                      ENA ACC FIX OPTIMIZER_COST
------------------------------ ------------------------------ --- --- --- --------------
SQL_9f966bc023d96994           SQL_PLAN_9z5mbs0jxkucn6ea6bdfe YES YES NO              10

删除SPM的示例:

declare
my_plans pls_integer;
begin
my_plans:=DBMS_SPM.DROP_SQL_PLAN_BASELINE(SQL_HANDLE=>'SQL_9f966bc023d96994');
end;
/

SQL> select SQL_HANDLE,PLAN_NAME,ENABLED,ACCEPTED,FIXED,OPTIMIZER_COST from dba_sql_plan_baselines; 
2.4 使用SQL Profile来绑定执行计划(COE脚本)

有时候使用SPM无法固定执行计划,需要我们使用SQL PROFILE来进行执行计划的固定。

-------------在10g环境下执行
SQL>START coe_xfr_sql_profile.sql

这个脚本执行完成之后会生产一个profile文件,把这个SQL文件拿到11g下执行。以下是一个示例:

SQL> start coe_xfr_sql_profile.sql 
Parameter 1:
SQL_ID (required)

Enter value for 1: 7jduw71f2w00p

PLAN_HASH_VALUE AVG_ET_SECS                ==》这里可以选择最好的计划,生成profile。
--------------- -----------
     2344910864        .051
Parameter 2:
PLAN_HASH_VALUE (required)

Enter value for 22344910864

Values passed to coe_xfr_sql_profile:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SQL_ID         : "7jduw71f2w00p"
PLAN_HASH_VALUE: "2344910864"

SQL>BEGIN
  2    IF sql_text IS NULL THEN
  3      RAISE_APPLICATION_ERROR(-20100, 'SQL_TEXT for SQL_ID &&sql_id. was not found in memory (gv$sqltext_with_newlines) or AWR (dba_hist_sqltext).');
  4    END IF;
  5  END;
  6  /
SQL>SET TERM OFF;
SQL>BEGIN
  2    IF other_xml IS NULL THEN
  3      RAISE_APPLICATION_ERROR(-20101, 'PLAN for SQL_ID &&sql_id. and PHV &&plan_hash_value. was not found in memory (gv$sql_plan) or AWR (dba_hist_sql_plan).');
  4    END IF;
  5  END;
  6  /
SQL>SET TERM OFF;

Execute coe_xfr_sql_profile_7jduw71f2w00p_2344910864.sql
on TARGET system in order to create a custom SQL Profile
with plan 2344910864 linked to adjusted sql_text.
COE_XFR_SQL_PROFILE completed.

然后我们就可以把coe_xfr_sql_profile_7jduw71f2w00p_2344910864.sql这个脚本拿到11g下执行即可。 删除sql profile使用下列命令:

exec dbms_sqltune.DROP_SQL_PROFILE(name=>'');