Some RAC notes
reference JulianDyke’s Presentations.
Global Cache Wait Events: Overview
It includes 5 categories: placeholders, Block-oriented,Message-oriented,Contention-oriented,Load-oriented.
Placeholders.
gc current/cr request: These wait events are relevant only while a gc request for a cr or current buffer is in progress. They act as placeholders until the request completes.
Block-oriented
gc current block 2-way
gc current block 3-way
gc cr block 2-way
gc cr block 3-wayThe block-oriented wait event statistics indicate that a block was received as either the result of a 2-way or a 3-way message, that is, the block was sent from either the resource master requiring 1 message and 1 transfer, or was forwarded to a third node from which it was sent, requiring 2 messages and 1 block transfer.
Message-oriented
gc current grant 2-way
gc cr grant 2-wayThe message-oriented wait event statistics indicate that no block was received because it was not cached in any instance. Instead a global grant was given, enabling the requesting instance to read the block from disk or modify it.
If the time consumed by these events is high, then it may be assumed that the frequently used SQL causes a lot of disk I/O (in the event of the cr grant) or that the workload inserts a lot of data and needs to find and format new blocks frequently (in the event of the current grant).
Contention-oriented
gc current block busy
gc cr block busy
gc buffer busy
gc buffer busy acquire/releaseThe gc current block busy and gc cr block busy wait events indicate that the local instance that is making the request did not immediately receive a current or consistent read block. The term busy in these events’ names indicates that the sending of the block was delayed on a remote instance. For example, a block cannot be shipped immediately if Oracle Database has not yet written the redo for the block’s changes to a log file.
In comparison to block busy wait events, a gc buffer busy event indicates that Oracle Database cannot immediately grant access to data that is stored in the local buffer cache. This is because a global operation on the buffer is pending and the operation has not yet completed. In other words, the buffer is busy and all other processes that are attempting to access the local buffer must wait to complete.
The existence of gc buffer busy events also means that there is block contention that is resulting in multiple requests for access to the local block. Oracle Database must queue these requests. The length of time that Oracle Database needs to process the queue depends on the remaining service time for the block. The service time is affected by the processing time that any network latency adds, the processing time on the remote and local instances, and the length of the wait queue.
The average wait time and the total wait time should be considered when being alerted to performance issues where these particular waits have a high impact. Usually, either interconnect or load issues or SQL execution against a large shared working set can be found to be the root cause.The contention-oriented wait event statistics indicate that a block was received which was pinned by a session on another node,
was deferred because a change had not yet been flushed to disk or because of high concurrency, and therefore could not be shipped immediately.
A buffer may also be busy locally when a session has already initiated a cache fusion operation and is waiting for its completion
when another session on the same node is trying to read or modify the same data.
High service times for blocks exchanged in the global cache may exacerbate the contention,
which can be caused by frequent concurrent read and write accesses to the same data.
Load-oriented
gc current block congested
gc cr block congestedThe load-oriented wait events indicate that a delay in processing has occurred in the GCS, which is usually caused by high load,
CPU saturation and would have to be solved by additional CPUs, load-balancing,
off loading processing to different times or a new cluster node.
For the events mentioned,the wait time encompasses the entire round trip
from the time a session starts to wait after initiating a block request until the block arrives.
Below pick up some most common global cache wait events.
NAME PARAMETER1 PARAMETER2 PARAMETER3
————————————————– ———- ———- ———-
ASM PST query : wait for [PM][grp][0] grant
Streams: RAC waiting for inter instance ack
gc assume le
gc block recovery request file# block# class#
gc buffer busy file# block# id#
gc claim
gc cr block 2-way
gc cr block 3-way
gc cr block busy
gc cr block congested
gc cr block lost
gc cr block unknown
gc cr cancel le
gc cr disk read
gc cr disk request file# block# class#
gc cr failure
gc cr grant 2-way
gc cr grant busy
gc cr grant congested
gc cr grant unknown
gc cr multi block request file# block# class#
gc cr request file# block# class#
gc current block 2-way
gc current block 3-way
gc current block busy
gc current block congested
gc current block lost
gc current block unknown
gc current cancel le
gc current grant 2-way
gc current grant busy
gc current grant congested
gc current grant unknown
gc current multi block request file# block# id#
gc current request file# block# id#
gc current retry
gc current split
gc domain validation file# block# class#
gc freelist
gc object scan
gc prepare
gc quiesce
gc recovery free
gc recovery quiesce
gc remaster file# block# class#
lock remastering
pi renounce write complete file# block#
retry contact SCN lock master48 rows selected.
SQL> select event,total_waits,TOTAL_TIMEOUTS,TIME_WAITED,AVERAGE_WAIT from v$system_event
where WAIT_CLASS=’Cluster’ order by event;EVENT TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT
——————————————— ———– ————– ———– ————
gc buffer busy 834046 15 58008 .07
gc cr block 2-way 364991 0 24321 .07
gc cr block busy 10276 5 2252 .22
gc cr disk read 47 0 2 .05
gc cr failure 2 0 0 .03
gc cr grant 2-way 458415 0 20068 .04
gc cr grant congested 2 0 0 .2
gc cr multi block request 93680 6 5519 .06
gc current block 2-way 179535 0 10782 .06
gc current block busy 861 0 1871 2.17
gc current block congested 1 0 0 .27EVENT TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT
——————————————— ———– ————– ———– ————
gc current block unknown 19 0 1 .06
gc current grant 2-way 3002 0 130 .04
gc current grant busy 50221 0 2840 .06
gc current multi block request 57531 0 2644 .05
gc current retry 5 0 12 2.34
gc current split 11 0 15 1.38
gc object scan 1 1 0 0
gc remaster 24 0 243 10.1119 rows selected.
Also there’re lots of Cache Fusion View.
gv$cache_transfer
gv$transfer: Monitor blocks transferred by object
gv$class_cache_transfer
gv$transfer: Monitor block transfer by class
gv$file_cache_transfer
gv$transfer: Monitor the blocks transferred per file
gv$temp_cache_transfer
gv$transfer: Monitor the transfer of temporary tablespace blocks
V$INSTANCE_CACHE_TRANSFER:theGlobal Cache Transfer Stats section contains dataV$GES_STATISTICS: the Global Enqueue Statistics
V$CR_BLOCK_SERVER: the Global CR Served Stats
V$CURRENT_BLOCK_SERVER:the Global CURRENT Served StatsThe Segment Statistics section also includes the GC Buffer Busy Waits, CR Blocks Received, and CUR Blocks Received information for relevant segments.
How to tune hot table?
Less rows per blockLess block: ALTER TABLE ……MINIMIZE RECORDS_PER_BLOCKRECORDS_BLOCK
Tune Sequences
use NoOrder, cached sequences
Tune index block contention
Use reverse index
Use partitioned index, multiple index segment header reduce single-index-header contention. you can hash partition index.
Measure Global cache transfer
With cache fusion, blocks are transferred from remote cache if a suitable block is found in the remote cache avoiding costly disk reads.
Block transfer between caches are done by LMS processes.Until 10.2.0.1, LMS processes are running in normal priority.If there is CPU starvation in any server, then all instances will be
affected due to LMS latency.
After 10gR2, LMS runs in real time mode.
Contention in blocks can be measured by using the block transfer time. To determine
block transfer time, examine the statistics global cache cr block receive time and global cache
cr blocks received. The time is determined by calculating the ratio of global cache cr block
receive time to global cache cr blocks received. The values for these statistics are taken
from the gv$sysstat view shown below:The following script shows this calculation.
column “AVG RECEIVE TIME (ms)” format 9999999.9
col inst_id for 9999
prompt GCS CR BLOCKS
set numwidth 20
column “AVG CR BLOCK RECEIVE TIME (ms)” format 9999999.9
select
b1.inst_id,
b2.value “GCS CR BLOCKS RECEIVED”,
b1.value “GCS CR BLOCK RECEIVE TIME”,
((b1.value/b2.value) * 10) “AVG CR BLOCK RECEIVE TIME (ms)”
from gv$sysstat b1,
gv$sysstat b2
where b1.name=’gc cr block receive time’
and b2.name=’gc cr blocks received’
and b1.inst_id=b2.inst_id;INST_ID RECEIVED RECEIVE TIME AVG RECEIVE TIME (ms)
——- ———- ———— ———————
1 2791 3287 14.8
2 3760 7482 19.9
So, Global cache latency ~= Interconnect latency for message from & to LMS + LMS processing latency + LGWR log flush latency
Any performance problem of interconnect latency, log buffer writes latency, LMS CPU starvation would cause long global cache transfer time and global buffer busy wait.
MTU and jumbo_frames
In concept, large MTU (9000bytes) perform better than default value (1500 bytes) during transfering global buffer between nodes.
Per this test, jumbo frames reduce UDP transferred dramatically, however, it doesn’t reduce CPU usage too much.
To tune interconnect , always verify that you use a private network for your interconnect, and that your private network is configured properly. Ensure that a network link is operating in full duplex mode. More powerful network interface and Ethernet switches is prefered.
High-Water Mark Considerations and ASSM
When application load Wait events would be,
A high percentage of wait time for enq: HW – contention
A high percentage of wait time for gc current grant events
The former is a consequence of the serialization on the HWM enqueue, and the latter is because of the fact that current access to the new data blocks that need formatting is required for the new block operation.
A certain combination of wait events and statistics presents itself in applications where the insertion of data is a dominant business function and new blocks have to be allocated frequently to a segment.
If data is inserted at a high rate, new blocks may have to be made available after unfruitful searches for free space. This has to happen while holding the high-water mark (HWM) enqueue.
Therefore,the main recommendation to alleviate the symptoms is to define uniform and large extent sizes for the locally managed and automatic space managed segments (ASSM) that are subject to high-volume inserts.
Automatic Segment Space Management (ASSM) manages free space in objects using bitmaps rather than chains. Each instance is allocated a different bitmap block from which to manage free space and therefore there is less contention.
Application Scale on RAC
RAC imposes an overhead in terms of shared memory, CPU and I/O. Consequently linear scalability is difficult to achieve. This is mainly due to the additional cost of supporting Global Cache Services (GCS) and Global Enqueue Services (GES).
A scaling factor of 1.8 for a second instance is considered good by most experienced users.
However, the amount an application scales is almost entirely dependent on the design of the application. Applications which minimizing locking and less parsing and also which achieve high levels of node affinity for database blocks (logical partitioning) are most likely to scale well.
Applications where there is significant parsing or other contention are unlikely to scale.
Process
lms and cssd etc are running in RT mode, others are in TS mode.
On Unix platforms an additional daemon process called OPROCD is configured. This process is locked into memory to monitor the cluster and provide I/O fencing. It provides similar functionality to the hangcheck timer on Linux.
OPROCD performs its check, stops running, and if the wake up is beyond the expected time, then OPROCD reboots the node.
An OPROCD failure results in Oracle Clusterware restarting the node.
ps -ef -o class,pri,pid,nlwp,args,fname | egrep ‘TEST|crs_1|ASM|init|d.bin|ocls|sleep|evmlogger|oprocd|diskmon’ | sort
RT 100 13775 1 asm_lms0_+ASM2 oracle
RT 100 21028 1 ora_lms0_TEST2 oracle
RT 100 21032 1 ora_lms1_TEST2 oracle
RT 100 21036 1 ora_lms2_TEST2 oracle
RT 100 21040 1 ora_lms3_TEST2 oracle
RT 100 21044 1 ora_lms4_TEST2 oracle
RT 100 21048 1 ora_lms5_TEST2 oracle
RT 100 21052 1 ora_lms6_TEST2 oracle
RT 100 21056 1 ora_lms7_TEST2 oracle
RT 100 21060 1 ora_lms8_TEST2 oracle
RT 100 21064 1 ora_lms9_TEST2 oracle
RT 100 21068 1 ora_lmsa_TEST2 oracle
RT 100 21072 1 ora_lmsb_TEST2 oracle
RT 100 21076 1 ora_lmsc_TEST2 oracle
RT 100 21088 1 ora_lmsd_TEST2 oracle
RT 100 21092 1 ora_lmse_TEST2 oracle
RT 100 21096 1 ora_lmsf_TEST2 oracle
RT 110 13408 1 /bin/sh /etc/init.d/init.cssd oclsomon init.css
RT 110 13431 1 /bin/sh /etc/init.d/init.cssd daemon init.css
RT 110 13549 1 csh -c /bin/sh -c ‘cd /export/home/oracle/products/10204/crs_1/log/qadb007/css csh
RT 110 13550 1 /bin/sh -c cd /export/home/oracle/products/10204/crs_1/log/qadb007/cssd/oclsom sh
RT 110 13551 1 /export/home/oracle/products/10204/crs_1/bin/oclsomon.bin oclsomon
RT 110 13623 15 /export/home/oracle/products/10204/crs_1/bin/ocssd.bin ocssd.bi
RT 159 13537 1 /export/home/oracle/products/10204/crs_1/bin/oprocd.bin run -t 1000 -m 500 -f oprocd.b
TS 50 12880 1 /bin/sh /etc/init.d/init.cssd fatal init.css
Log directory
$CRS_HOME/log/ directory contains:
- Oracle Clusterware alert log e.g. alertserver6.log
- client - logfiles for OCR applications including CLSCFG, CSS, OCRCHECK, OCRCONFIG, OCRDUMP and OIFCFG
- crsd - logfiles for CRS daemon including crsd.log
- cssd - logfiles for CSS daemon including ocssd.log
- evmd - logfiles for EVM daemon including evmd.log
- racg - logfiles for node applications including VIP and ONS
Where configured OPROCD logs are written to /etc/oracle/oprocd or /var/opt/oracle/oprocd
Memory
Significant memory areas in a RAC instance include
- gcs resources - a segmented array that contains the block resources for the global resource directory
- gcs shadows - a segmented array that contains the shadow structures for blocks held by other instances
- ges resource - a heap-based structure that contains the lock resources for the global resource directory
- ges enqueues - a segmented array that contains the enqueues for the global resource directory
- ges big msg buffers - an internal segmented array that contains buffers used to build messages to pass down the interconnect to other instances.
SQL> select * from v$sgastat where name like ‘%ges%’ or name like ‘%gcs%’ or name like ‘%KCL%’ order by name;
POOL NAME BYTES
———— ————————– ———-
shared pool gcs I/O statistics struct 32
shared pool gcs affinity 4160
shared pool gcs commit sga state 159768
shared pool gcs mastership buckets 6144
shared pool gcs opaque in 4104
shared pool gcs res hash bucket 262144
shared pool gcs res latch table 196608
shared pool gcs resource freelist arr 3328
shared pool gcs resource freelist dyn 1024
shared pool gcs resources 27989248
shared pool gcs scan queue array 2816
shared pool gcs shadow locks dyn seg 1024
shared pool gcs shadow locks freelist 3328
shared pool gcs shadows 14560896POOL NAME BYTES
———— ————————– ———-
shared pool ges big msg buffers 8652008
shared pool ges deadlock xid freelist 11264
shared pool ges deadlock xid hash tab 17800
shared pool ges enqueue cur. usage pe 256
shared pool ges enqueue max. usage pe 256
shared pool ges enqueue multiple free 10240
shared pool ges ipc instance maps 384
shared pool ges lmd process descripto 2760
shared pool ges lms process descripto 44160
shared pool ges process array 407040
shared pool ges process hash table 14080
shared pool ges recovery domain table 176
shared pool ges regular msg buffers 478808
shared pool ges res mastership bucket 4096
shared pool ges reserved msg buffers 2480248
shared pool ges resource 3421336
shared pool ges resource hash seq tab 8192
shared pool ges resource hash table 360448
shared pool ges resource pools 15872
shared pool ges scan queue array 176
shared pool ges shared global area 2392837 rows selected.;
PI Image
Before an instance passes a dirty block to another instance it
- Flushes redo buffer to redo log
- Retains past image (PI) of block in buffer cache
- PI is retained until another instance writes block to disk
- It can be used to reduce recovery times
Recorded in V$BH.STATUS as pi . Based on X$BH.STATE (value 8 in Oracle 10.2)
redo log flushes
Uncommitted changes MUST be flushed to the redo log before the LMS process can ship a consistent block to another instance.
Reading process must wait until redo log changes have been written to redo log by LGWR process. This is a bad wait event for standard RAC databases as “Reads must wait for redo log writes”.
- Number of redo log flushes is recorded in the FLUSHES column of V$CR_BLOCK_SERVER.
- Redo log flush time is recorded in the gc cr block flush time statistic for the LMS process.
so if LGWR has performance problem, it would impact cache fusion performance.
Remaster objects
You can force remastering of an object in Oracle 10.1 and above using the LKDEBUG tool in ORADEBUG.
First you need to identify a data object id, for example:
SELECT data_object_id FROM dba_objectsWHERE owner = ‘US01′AND object_name = ‘T1′;
OBJECT_ID
———
52084To remaster the object to the current instance use:
ORADEBUG LKDEBUG -m pkey 52084
All blocks will now be mastered by the current instance.
To redistribute the masters back to the original instances use:ORADEBUG LKDEBUG -m dpkey 52084
Blocks will be mastered back to all instances again.
The V$GCSPFMASTER_INFO dynamic performance view shows which objects have been remastered. In Oracle 11.1 and above, Dynamic Remastering statistics are reported in V$DYNAMIC_REMASTER_STATS.
How to find master node of a block
The current master for any block is reported by X$KJBR.KJBRMASTER. And block database block addresss (DBA) is reported by X$KJBR.KJBRNAME.
Names have the format
[<block_number>][<file_number>].[BL]
For example
[0×1234][0×20000].[BL]
You can use below function and SQL to parse and query block master.
CREATE OR REPLACE FUNCTION get_file_number (p_resource_name VARCHAR2)
RETURN INTEGER IS
pos1 INTEGER := INSTR (p_resource_name,’x',1,2);
pos2 INTEGER := INSTR (p_resource_name,’]',1,2);
s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1);
BEGIN
RETURN TO_NUMBER (s,’XXXXXXXX’) / 65536;
END;
/CREATE OR REPLACE FUNCTION get_block_number (p_resource_name VARCHAR2)
RETURN INTEGER IS
pos1 INTEGER := INSTR (p_resource_name,’x',1,1);
pos2 INTEGER := INSTR (p_resource_name,’]',1,1);
s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1);
BEGIN
RETURN TO_NUMBER (s,’XXXXXXXX’);
END;
/SELECT get_file_number (kjbrname) AS file_number, get_block_number (kjbrname) AS block_number,
kjbrmaster AS master
FROM x$kjbr;


















