State of Standby cluster member in High Availability cluster is constantly changing between 'Standby' and 'Down'
ClusterXL is configured in High Availability mode. Each cluster member is connected to a separate switch, and these switches are connected to each other. There are hosts, which are connected to each of the switches.
When a cable is disconnected, the Standby member is not able to send/receive CCP packets on the given interface since the physical link was disrupted.
By design, when a cluster member is not able to send/receive CCP packets on a given interface, it will initiate a probing process.
Probing is a best-effort algorithm intended to allow the cluster member to test the local interface - is all the traffic not able to pass, or just CCP packet.
Cluster member starts sending a series of ARP Requests to the subnet of a given interface, hoping that there are some hosts that will respond with ARP Reply. Then the member sends a series of ICMP Requests to the hosts that responded. If these ICMP Requests are answered, then the cluster member will be able to conclude that its local interface must be intact. The next logical conclusion is that the CCP packets are not sent/received due to some problem with a peer member.
In parallel, there are hard-coded timeouts - how long should a member wait for CCP packets to be sent/received, until it changes a state - either its own state, or the state of its local interface. If the hard-coded timeouts have expired, and CCP packets are still not sent/received correctly, then the member would have to change its state - depending on the outcome of multiple internal tests.
By default, the Standby cluster member starts the probing process only after its state changes to Down. This leads to the situation where the Standby member is not running the probing process, therefore it is not checking its own local interface. As a result, the state of the interface is declared as Down, and the state of the whole member is declared as Down. After changing the state to Down, Standby member starts the probing process, receives a necessary response from directly connected hosts, and concludes that the local interface is intact. As a result it changes its state back to Standby.
If the CCP packets are still not sent/received correctly on Standby member, the cycle repeats:
- The CCP packets are not sent/received correctly
- The hard-coded timeouts start to expire
- The probing process is not started as long as the state is not Down
- Eventually the state of a given interface is declared as Down
- Eventually the state of the whole member is declared as Down
- The probing process starts
- The member receives the necessary responses from the hosts on the subnet
- Eventually the state of a given interface is declared as Up
- Eventually the state of the whole member is declared as Standby