(1) When a problem was reported on at least one of the Critical Devices (Pnotes), some troubleshooting steps should be taken.

Related Solutions:

(2) Failover will occur by design in "Switch to higher priority Gateway" cluster configuration, when a member with Lower Priority was in Active state prior to policy installation.

During a policy installation, a member with Higher Priority (according to SmartDashboard - cluster object properties - 'Cluster Members' tab) should be the Active member.

If such fail-over is unwanted, either change the configuration of cluster object in SmartDashboard from "Switch to higher priority Gateway" to "Maintain current active Gateway", or enable the "freeze" mechanism on each cluster member.

Related Solutions:

sk32488 - When to use 'fwha_freeze_state_machine_timeout' parameter
sk66064 - Change of Cluster Member priority when the kernel parameter 'fwha_freeze_state_machine_timeout' is enabled may cause network outage
sk66881 - On OPSec cluster, output of 'cphaprob state' does not show the local member, only peer members
sk26202 - Changing the kernel global parameters on all platforms

(3) By design, in "Maintain current active Gateway" cluster configuration, the currently Standby member installs the policy faster, than the currently Active member.

The first member that is able/allowed to become an Active member, will change its state to Active.

If such fail-over is unwanted, enable the "freeze" mechanism on each cluster member.

Related Solutions:

Important Note:

Policy installation might cause high CPU load on the cluster members. Due to high CPU load, the CCP packets might not be sent/received correctly (in time) on the cluster members. As a result, fail-over still occurs during the policy installation - however, due to a different root cause.

In order to confirm this, run the following debug on the cluster members:

Prepare:

[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf pnote stat if
Verify:

[Expert@HostName]# fw ctl debug -m cluster
Start:

[Expert@HostName]# fw ctl kdebug -T -f > /var/log/$(uname -n)_cluster_debug.txt &

In 2nd shell, collect the data about the CPU load.

The following shell script is suggested
(run 'vmstat -n 1' command in the background; in addition,
run 'ps auxwwwf' and 'top -n 1' commands in the loop):

Important Note: Run 'top' command - press digit 1 to display all CPU cores - press Shift+W to save this configuration - exit from 'top'.

#!/bin/sh

# GLOBAL VARIABLES
# how many seconds between the probings
SLEEP_TIME=1
# time stamp
DATE="/bin/date +%d-%b-%Y_%Hh-%Mm-%Ss"
# output directory with all output files
OUTPUTDIR="/var/log/$(uname -n)_CPU_data"

# INTRODUCTION
echo "---------------------------------------------"
echo "The script will probe the system continuously - every _${SLEEP_TIME}_ second(s)."
echo "To stop the script, press CTRL+C"
echo " "
echo "Do NOT forget to kill the VMSTAT that runs in the background."
echo "---------------------------------------------"
echo " "
sleep 3

# PART 1: check and create the output folder
echo "PART 1: Checking for output folder..."
echo " "
if [ ! -d ${OUTPUTDIR} ];
   then
      mkdir ${OUTPUTDIR}
else
   echo "- First move the files from ${OUTPUTDIR} folder and delete that folder."
   echo "- Exiting..."
   echo " "
   exit 1
fi

# PART 2: collect VMSTAT in the background
echo "PART 2: Starting to collect VMSTAT in the background..."
echo " "
vmstat -n 1 1>> ${OUTPUTDIR}/vmstat.txt 2>> ${OUTPUTDIR}/vmstat.txt &
sleep 2
PID_OF_VMSTAT=$(ps auxw | grep 'vmstat' | grep -v 'grep' | awk '{print $2}')
if [ -n "$PID_OF_VMSTAT" ];
   then
      if [ "$PID_OF_VMSTAT" -gt 0 ];
         then
            echo "- VMSTAT was started"
            echo "- PID of VMSTAT is : $PID_OF_VMSTAT"
            echo " "
            echo "- Do NOT forget to kill the VMSTAT after stopping this script."
            echo " "
      fi
else
      echo "- Could NOT detect the PID of VMSTAT"
      echo "- VMSTAT was NOT started for some reason"
      echo " "
fi
sleep 3

# PART 3: collect PS and TOP in the loop
echo "PART 3: Starting to collect PS and TOP in the background..."
echo " "

# define local auxiliary variable - number of times this script was run, initialy set to 0
ROUND=0

while true
   do
      # increase the ROUND counter
      ROUND=$(expr $ROUND + 1)

      echo "- Probing the system. . . To stop the script, press CTRL+C (and kill the VMSTAT in the background)" 

      echo "ps auxwwwwf (time stamp $($DATE)) (round "$ROUND") :" >> ${OUTPUTDIR}/ps.txt
      echo "------------------------------------------------------" >> ${OUTPUTDIR}/ps.txt
      ps auxwwwwf 1>> ${OUTPUTDIR}/ps.txt 2>> ${OUTPUTDIR}/ps.txt
      echo " " >> ${OUTPUTDIR}/ps.txt

      echo "top -n 1 (time stamp $($DATE)) (round "$ROUND") :" >> ${OUTPUTDIR}/top.txt
      echo "------------------------------------------------------" >> ${OUTPUTDIR}/top.txt
      top -n 1 1>> ${OUTPUTDIR}/top.txt 2>> ${OUTPUTDIR}/top.txt
      echo " " >> ${OUTPUTDIR}/top.txt

      sleep "$SLEEP_TIME"
   done

exit 0

Install the policy in SmartDashboard.
Wait for policy installation to complete and for fail-over to occur.
Stop cluster debug:

[Expert@HostName]# fw ctl debug 0
Stop collecting data about CPU load:
1. Press CTRL+C to stop the shell script that runs VMSTAT, PS and TOP.
2. Kill VMSTAT that runs in the background:
  
  [Expert@HostName]# PID_OF_VMSTAT=$(ps auxw | grep 'vmstat' | grep -v 'grep' | awk '{print $2}')
  [Expert@HostName]# echo $PID_OF_VMSTAT
  [Expert@HostName]# kill -KILL $PID_OF_VMSTAT

Analyze cluster debug:

Search for these lines:

is dying (since is dead (since probe_local_net: About to ICMP probe fwha_next_local_state: new state changed fwha_update_local_state: local (state fwha_update_state: ID fwha_set_new_local_state: entering CPHA : changing state to fwha_confirm_state: ID CPHA:

Examples:

FW-1: check_other_machine_activity: ID 0 (state ACTIVE) is dying (since 943194.5 now 943195.2);
FW-1: check_other_machine_activity: ID 0 (state ACTIVE) is dead (since 943194.5 now 943195.5);
FW-1: probe_local_net: About to ICMP probe 0x604a8c0 on ifn 5;
FW-1: fwha_next_local_state: new state changed (ACTIVE -> READY) (prev state: STANDBY);
FW-1: fwha_update_local_state: local (state STANDBY -> READY) (time 943195.5);
FW-1: fwha_set_new_local_state: entering with current state=STANDBY and new state=READY;
CPHA : changing state to READY;
FW-1: fwha_confirm_state: ID 0 confirmed my state as STANDBY (first confirm);
FW-1: fwha_set_conf: setting HA configuration
CPHA: Sending Policy ID change request
FW-1: fwha_update_state: ID 0 (state FAILURE -> ACTIVE) (time 943206.3);
FW-1: Starting ClusterXL

Correlate the cluster debug (the lines with 'is dying', 'is dead' and 'probe_local_net: About to ICMP probe') to the outputs of 'vmstat', 'ps' and 'top' commands.

You should be able to see when the CCP packets were not received/sent correctly (in time), on which interface(s) this issue occurs, and whether a CPU load was high at that time (caused by either Kernel Space, or User Space).