Support Center > Search Results > SecureKnowledge Details
Failover occurs in the cluster during Security Policy installation Technical Level
Symptoms
  • Failover occurs in the cluster during Security Policy installation.
Cause

Possible root causes are:

  1. A problem was reported on at least one of the Critical Devices (Pnotes).

  2. In "Switch to higher priority Gateway" cluster configuration - a member with Lower Priority was in Active state prior to policy installation.

  3. In "Maintain current active Gateway" cluster configuration - the currently Standby member installs the policy faster, than the currently Active member, therefore it is the first member to load the new configuration, and as a result it is the first member to check if there are any Active members with new configuration, so it assumes the Active state.

Solution

(1) When a problem was reported on at least one of the Critical Devices (Pnotes), some troubleshooting steps should be taken.

Related Solutions:

 

(2) Failover will occur by design in "Switch to higher priority Gateway" cluster configuration, when a member with Lower Priority was in Active state prior to policy installation.

During a policy installation, a member with Higher Priority (according to SmartDashboard - cluster object properties - 'Cluster Members' tab) should be the Active member.

If such fail-over is unwanted, either change the configuration of cluster object in SmartDashboard from "Switch to higher priority Gateway" to "Maintain current active Gateway", or enable the "freeze" mechanism on each cluster member.

Related Solutions:

 

(3) By design, in "Maintain current active Gateway" cluster configuration, the currently Standby member installs the policy faster, than the currently Active member.

The first member that is able/allowed to become an Active member, will change its state to Active.

If such fail-over is unwanted, enable the "freeze" mechanism on each cluster member.

Related Solutions:

 


 

Important Note:

Policy installation might cause high CPU load on the cluster members. Due to high CPU load, the CCP packets might not be sent/received correctly (in time) on the cluster members. As a result, fail-over still occurs during the policy installation - however, due to a different root cause.

In order to confirm this, run the following debug on the cluster members:

  1. Prepare:

    [Expert@HostName]# fw ctl debug 0
    [Expert@HostName]# fw ctl debug -buf 32000
    [Expert@HostName]# fw ctl debug -m cluster + conf pnote stat if

  2. Verify:

    [Expert@HostName]# fw ctl debug -m cluster

  3. Start:

    [Expert@HostName]# fw ctl kdebug -T -f > /var/log/$(uname -n)_cluster_debug.txt &

  4. In 2nd shell, collect the data about the CPU load.

    The following shell script is suggested
    (run 'vmstat -n 1' command in the background; in addition,
    run 'ps auxwwwf' and 'top -n 1' commands in the loop):

    Important Note: Run 'top' command - press digit 1 to display all CPU cores - press Shift+W to save this configuration - exit from 'top'.

    #!/bin/sh
    
    # GLOBAL VARIABLES
    # how many seconds between the probings
    SLEEP_TIME=1
    # time stamp
    DATE="/bin/date +%d-%b-%Y_%Hh-%Mm-%Ss"
    # output directory with all output files
    OUTPUTDIR="/var/log/$(uname -n)_CPU_data"
    
    # INTRODUCTION
    echo "---------------------------------------------"
    echo "The script will probe the system continuously - every _${SLEEP_TIME}_ second(s)."
    echo "To stop the script, press CTRL+C"
    echo " "
    echo "Do NOT forget to kill the VMSTAT that runs in the background."
    echo "---------------------------------------------"
    echo " "
    sleep 3
    
    # PART 1: check and create the output folder
    echo "PART 1: Checking for output folder..."
    echo " "
    if [ ! -d ${OUTPUTDIR} ];
       then
          mkdir ${OUTPUTDIR}
    else
       echo "- First move the files from ${OUTPUTDIR} folder and delete that folder."
       echo "- Exiting..."
       echo " "
       exit 1
    fi
    
    # PART 2: collect VMSTAT in the background
    echo "PART 2: Starting to collect VMSTAT in the background..."
    echo " "
    vmstat -n 1 1>> ${OUTPUTDIR}/vmstat.txt 2>> ${OUTPUTDIR}/vmstat.txt &
    sleep 2
    PID_OF_VMSTAT=$(ps auxw | grep 'vmstat' | grep -v 'grep' | awk '{print $2}')
    if [ -n "$PID_OF_VMSTAT" ];
       then
          if [ "$PID_OF_VMSTAT" -gt 0 ];
             then
                echo "- VMSTAT was started"
                echo "- PID of VMSTAT is : $PID_OF_VMSTAT"
                echo " "
                echo "- Do NOT forget to kill the VMSTAT after stopping this script."
                echo " "
          fi
    else
          echo "- Could NOT detect the PID of VMSTAT"
          echo "- VMSTAT was NOT started for some reason"
          echo " "
    fi
    sleep 3
    
    # PART 3: collect PS and TOP in the loop
    echo "PART 3: Starting to collect PS and TOP in the background..."
    echo " "
    
    # define local auxiliary variable - number of times this script was run, initialy set to 0
    ROUND=0
    
    while true
       do
          # increase the ROUND counter
          ROUND=$(expr $ROUND + 1)
    
          echo "- Probing the system. . . To stop the script, press CTRL+C (and kill the VMSTAT in the background)" 
    
          echo "ps auxwwwwf (time stamp $($DATE)) (round "$ROUND") :" >> ${OUTPUTDIR}/ps.txt
          echo "------------------------------------------------------" >> ${OUTPUTDIR}/ps.txt
          ps auxwwwwf 1>> ${OUTPUTDIR}/ps.txt 2>> ${OUTPUTDIR}/ps.txt
          echo " " >> ${OUTPUTDIR}/ps.txt
    
          echo "top -n 1 (time stamp $($DATE)) (round "$ROUND") :" >> ${OUTPUTDIR}/top.txt
          echo "------------------------------------------------------" >> ${OUTPUTDIR}/top.txt
          top -n 1 1>> ${OUTPUTDIR}/top.txt 2>> ${OUTPUTDIR}/top.txt
          echo " " >> ${OUTPUTDIR}/top.txt
    
          sleep "$SLEEP_TIME"
       done
    
    exit 0
    


  5. Install the policy in SmartDashboard.

  6. Wait for policy installation to complete and for fail-over to occur.

  7. Stop cluster debug:

    [Expert@HostName]# fw ctl debug 0

  8. Stop collecting data about CPU load:

    1. Press CTRL+C to stop the shell script that runs VMSTAT, PS and TOP.

    2. Kill VMSTAT that runs in the background:

      [Expert@HostName]# PID_OF_VMSTAT=$(ps auxw | grep 'vmstat' | grep -v 'grep' | awk '{print $2}')
      [Expert@HostName]# echo $PID_OF_VMSTAT
      [Expert@HostName]# kill -KILL $PID_OF_VMSTAT


  9. Analyze cluster debug:

    Search for these lines:

    • is dying (since
    • is dead (since
    • probe_local_net: About to ICMP probe
    • fwha_next_local_state: new state changed
    • fwha_update_local_state: local (state
    • fwha_update_state: ID
    • fwha_set_new_local_state: entering
    • CPHA : changing state to
    • fwha_confirm_state: ID
    • CPHA:


    Examples:
    FW-1: check_other_machine_activity: ID 0 (state ACTIVE) is dying (since 943194.5 now 943195.2);
    FW-1: check_other_machine_activity: ID 0 (state ACTIVE) is dead (since 943194.5 now 943195.5);
    FW-1: probe_local_net: About to ICMP probe 0x604a8c0 on ifn 5;
    FW-1: fwha_next_local_state: new state changed (ACTIVE -> READY) (prev state: STANDBY);
    FW-1: fwha_update_local_state: local (state STANDBY -> READY) (time 943195.5);
    FW-1: fwha_set_new_local_state: entering with current state=STANDBY and new state=READY;
    CPHA : changing state to READY;
    FW-1: fwha_confirm_state: ID 0 confirmed my state as STANDBY (first confirm);
    FW-1: fwha_set_conf: setting HA configuration
    CPHA: Sending Policy ID change request
    FW-1: fwha_update_state: ID 0 (state FAILURE -> ACTIVE) (time 943206.3);
    FW-1: Starting ClusterXL
    


  10. Correlate the cluster debug (the lines with 'is dying', 'is dead' and 'probe_local_net: About to ICMP probe') to the outputs of 'vmstat', 'ps' and 'top' commands.

    You should be able to see when the CCP packets were not received/sent correctly (in time), on which interface(s) this issue occurs, and whether a CPU load was high at that time (caused by either Kernel Space, or User Space).

Give us Feedback
Please rate this document
[1=Worst,5=Best]
Comment