Support Center > Search Results > SecureKnowledge Details
ATRG: CloudGuard for Azure - High Availability (HA) Technical Level
Solution

Table of Contents

  • Introduction
  • Important Information
  • FAQ
  • Errors and Troubleshooting
    • General
    • Tester
    • Failover
    • Traffic
  • Contact Check Point Support


Note:
To stay up-to-date refer to CloudGuard for Azure latest updates

Click Here to Show the Entire Section

 

Introduction

The CloudGuard cloud security solution delivers advanced threat protection to private or public cloud infrastructures. It controls and manages the security in both the physical and virtual environments with one unified management solution.

A cluster is a group of Virtual Machines that work together in a High Availability Mode. One Cluster Member is the Active, and the second Cluster Member is the Standby. The cluster fails over from the Active Cluster Member to the Standby Cluster Member when necessary.

  • Cluster Members communicate with each other using unicast IP addresses.
  • For inbound, outbound, and East-West traffic, Cluster Members rely on Azure Load Balancers to represent their external and internal Virtual IP addresses. Load Balancers only forward traffic to the Active Cluster Member.
  • For VPN traffic, Cluster Members use API calls to Azure to communicate the failover from the Active Cluster Member. The Standby Cluster Member then promotes itself to Active. During cluster failover, the Standby Cluster Member associates the Active Cluster Member's private and public cluster IP addresses with its external interface.

To get additional information about the solution, please refer to:


Important Information


SK Description Versions Fix
sk171553 Default policy is loaded on Azure Gateways after a policy install or after a crash on the Gateway R81
R80.40
R80.30
R80.20
R80.10
Images with build higher than 800.

Jumbo Hotfix
R81 Take 25
R80.40 Take 114
R80.30 Take 235
R80.20 Take 39
sk164838 Corrupted memory and/or user-mode code dumps on Azure Gateways and OCI R80.30 Images with build higher than 614.
sk169417 User space core dumps on Security Gateways that use the AES-GCM encryption algorithm R80.40
R80.30
Images with build higher than 800.

Jumbo Hotfix
R80.40 Take 87
R80.30 Take 227
sk170660 Important certificate update to CloudGuard Controller, Cloud Management Extension (CME), and Azure HA Security Gateways R80.40
R80.30
R80.20
R80.10
Images with build higher than 710.

Jumbo Hotfix
R80.40 Take 89
R80.30 Take 226
R80.20 Take 188
R80.10 Take 287
sk164838 Corrupted memory and/or user-mode code dumps on Azure Gateways and OCI for D2v2 R80.30 Images with build higher than 614.
sk165055 High CPU on waagent on Azure Gateway R80.30 Images with build higher than 590.
sk166674 Azure Gateway freezes / crashes during Azure server maintenance R80.40
R80.30
Images with build higher than 614.

Jumbo Hotfix
R80.40 Take 38
R80.30 Take 195
sk173631 "hv_utils: Shutdown request received - graceful shutdown initiated" error in message file All versions Open a ticket to Azure

 


Frequently Asked Questions


Click Here to Show the Entire Section
  • How can I know that my cluster is well configured?
    1. Make sure that the tester ($FWDIR/scripts/azure_ha_test.py)passes and there are no errors in $FWDIR/log/azure_had.log on each member.
    2. Ensure that the cluster members use a JHF that contains fixes of the relevant bug mentioned above.
    3. Make sure that the daemon in charge of communicating with Azure runs on each cluster member by running: cpwd_admin getpid -name AZURE_HAD and ensuring the output is different from 0.

     

  • What is the expected failover time?
    Use case Expected failover time Comments
    Site-to-site VPN Less than 2 minutes. Depends on the Azure API.
    Inbound inspection through the External Load Balancer Less than 15 seconds. Depends on the Load Balancer health probe.
    Outbound inspection Less than 2 minutes. Depends on the Load Balancer health probe and Azure API.
    East-West inspection through the Internal Load Balancer Less than 15 seconds. Depends on the Load Balancer health probe.

     

  • Is the solution stateful?
    There are two possible scenarios for the solution:
    1. Using Azure load balancers
      While the HA solution supports stateful failover, the failover will not be stateful when the connection passes through an Azure load balancer. When a connection initiates through a load balancer, the load balancer always forwards the connection to the same instance regardless of the health probe or the instance status. Hence, the load balancer sends the existing connection to the same instance even when a failover occurs. Once the connection times out and is reinitiated, it will connect via the new Active member.

       

    2. Using the HA Virtual IP (VIP)
      The failover is stateful.
  • Can I terminate my VPN connection by the load balancer?

    No,

    Azure load balancers do not support IPSec.
  • What do the API calls for HA exactly do?

    During failover, the only API calls are from the Cluster Member that gets promoted to Active to attach the cluster private and public IP addresses to itself.

  • Can I use another public IP than the one deployed by the solution?

    Yes.

    On each Cluster Member, edit the configuration file located at $FWDIR/conf/azure-ha.json. In "clusterNetworkInterfaces" -> "eth0", replace the public IP name with the original cluster VIP resource id. Then run $FWDIR/scripts/azure_ha_cli.py reconf

    Note: Please refer to CloudGuard Network HA admin guide in the "Upgrading a Check Point CloudGuard IaaS High Availability Solution to a Newer Version" section for more information.

 

Errors and Troubleshooting

  • General
  • Tester
  • Failover
  • Traffic

 

Click Here to Show the Entire Section

General Errors

 

Troubleshooting Tester Errors

 

Troubleshooting Failover Issues

  • Azure HA log file shows the "rest.TimeoutException: b'curl: (28) Operation timed out after 20000 milliseconds with 0 bytes received\n" error
    Cause: Azure API server didn't respond to the active member's API requests in 20 seconds.

    How to resolve:
    1. There are no functionality impact if the log is not shown repetitively.
    2. If the log appears repetitively, ensure there are no connectivity issue to Azure. Run curl_cli --verbose https://management.azure.com?api-version=1.3 --cacert $FWDIR/conf/ca-bundle-public-cloud.crt and part of the output should be Connected to management.azure.com (X.X.X.X) port 443. If the output is Failed to connect to management.azure.com port 443: Connection timed out, fix Azure's connectivity.
  • Azure HA log file shows the "HTTP status code 429 Too many requests"" error
    Cause: Azure throttling

    How to resolve:
    Follow sk131932
  • Azure HA log file shows the "Couldn't resolve host 'management.azure.com'" error.
    Cause: There is no outbound connectivity to Azure.

    How to resolve:
    1. Ensure dns settings are correct:
      The command nslookup management.azure.com should return the following output:
      Server: X.X.X.X
      Address: X.X.X.X#53
      .
      If the command doesn't return a similar output, please configure your DNS accordingly.
      It is recommended to have the Primary or Secondary DNS set to the Azure default, as mentionned in sk122274.

       

    2. Resolve Azure connectivity:
      The command curl_cli --verbose https://management.azure.com?api-version=1.3 --cacert $FWDIR/conf/ca-bundle-public-cloud.crt should return the following output: Connected to management.azure.com (X.X.X.X) port 443.
      If the command doesn't return a similar output, please set Azure connecivity.
    Note: There is no impact on the cluster performace as only the Active member performs API calls.

 

Troubleshooting Traffic Issues

  • Hide NAT and static NAT (to the public ip) are being applied to the Standby Gateway

    Hide NAT behind VNET:
    • Cause:
      It fails because when selecting Hide NAT on the VNET, the Standby member requests are hidden behind the cluster IP (VIP) on eth0. The Active member owns this, so Azure will drop the packet since it is not the proper device sending this request.
    • How to resolve
      Disable NAT to resolve this. Add a No-NAT rule originating from each Gateway for HTTP, HTTPS, and DNS.

    Static NAT to public IP:
    • Cause:
      It fails because Azure owns the Public IP, and it is not expecting the public IP to come from the firewalls.
    • How to resolve
      Disabling this NAT rule resolve the problem.
  • The gateways don't respond to Azure health probes requests
    Cause: There are several possible reasons:
    1. The load balancer health probes requests are not received by the cluster members.
    2. The active member doesn't handle the health probes as expected.

    How to resolve:
    1. Ensure that health probe using port 8117 are configured in the solution's load balancers. Please see Azure health probe official documentation for more information.
    2. Follow sk171584 resolution.
  • The Standby cluster member cannot access the Internet
    Cause: The Standby cluster member is hidden behind the VIP when it should leave through the member IP.

    How to resolve:

    Follow sk175108 resolution.
    Note: There is no impact on the cluster performace as only the Active member performs API calls.

 

     

    Contacting Check Point Support

    If you still encounter issues although the section above, please contact your local Check Point Support and attach the following files to the case to speed-up the process:

    • /etc/cloud-version
    • $FWDIR/conf/azure-ha.json
    • $FWDIR/log/azure_had.elg*
    • /var/log/cloud_config.log
    • $FWDIR/boot/modules/fwkern.conf
    • curl_cli --verbose https://management.azure.com --cacert $FWDIR/conf/ca-bundle-public-cloud.crt > /home/admin/azure-connectivity.txt
    • $FWDIR/scripts/azure_ha_test.py > /home/admin/ha-tester-output.txt
    • cphaprob state > /home/admin/cphaprob-stat.txt

     

    This solution has been verified for the specific scenario, described by the combination of Product, Version and Symptoms. It may not work in other scenarios.

    Give us Feedback
    Please rate this document
    [1=Worst,5=Best]
    Comment