The issue is related to a firmware update of the MLX5 driver on the Azure host that enables Mellanox QoS support.
Check Point has identified a Mellanox driver issue that occurs when connections originate on or are destined for the Check Point Gateway in Azure.
If the crash occurs during a policy installation, a race condition may occur, resulting in the Check Point Gateway's applying a default policy upon reboot.
Table of Contents
Affected Environment
Note: See sk132192 - CloudGuard for Azure Latest Updates for more information about released image versions.
Unaffected Environment
-
Management Servers and Multi-Domain Management Servers.
-
Security Gateway solutions deployed in other cloud platforms than Azure.
-
Security Gateways using versions R77.30, R80.10.
-
R80.30-based Security Gateways using image versions R80.30-273.801 and higher.
- Note: Do not install R80.30 Jumbo HF lower than Take 235.
-
R80.40-based Security Gateways using image versions R80.40-294.801 and higher.
- Note: Do not install R80.40 Jumbo HF lower than Take 114.
-
R81-based Security Gateways using image versions R81-392.807 and higher.
- Note: Do not install R81 Jumbo HF lower than Take 25.
Solution
Install the applicable Hotfix on the Security Gateway:
Warning: If you configured a workaround cronjob as described below, you must remove it before you install the hotfix.
Important note: In case you installed a hotfix on top of a specific Jumbo Hotfix Take, do not install a new Jumbo hotfix Take lower take than the one mentioned to include the fix.
Hotfix installation instructions:
Refer to sk168597 - How to install a Hotfix.
On VMSS environment the recommended method is to scale out to the latest image that includes the above hotfix.
-
R80.30-based Security Gateways using image versions R80.30-273.801 and higher.
-
R80.40-based Security Gateways using image versions R80.40-294.801 and higher.
-
R81-based Security Gateways using image versions R81-392.807 and higher.
Note: If the workaround was added prior to the hotfix installation, it is recommended to remove the workaround.
Validate the HF is installed by running this command:
ethtool -i <Name of Mellanox Interface>
Interface version should show:
driver: mlx5_core
version: 4.6-1.0.1hf1 (14 Jan 21)
Recovery of a Security Gateway that rebooted with default policy:
To regain access to the Security Gateway:
-
Console into the Check Point VM using the Azure portal.
-
Unload the default policy that is applied to the Security Gateways:
[Expert@HostName:0]# fw unloadlocal
-
Install the policy on the Azure Gateways.
If policy installation continues to fail, follow these instructions:
-
Clear the state directory on both the Security Gateway and the Management Server. See sk33328 - How to clear $FWDIR/state/ directory to resolve policy corruption issues.
Note: If the Security Gateway and/or Management Server is running R80.40 version, you must create the following directories:
-
Clearing the state directory on the Security Gateway removes all defined Dynamic Objects.
In a VMSS and HA environments, the LocalGatewayExternal and LocalGatewayInternal objects must be added.
If these Dynamic objects are missing, it may result in a traffic failure in your environment.
-
Confirm the Dynamic objects have been removed:
[Expert@HostName:0]# dynamic_objects -l
-
Add the required Dynamic objects again on the Security Gateway, if they were removed:
[Expert@HostName:0]# dynamic_objects -n LocalGatewayExternal -r <IP Address of eth0> <IP Address of eth0> -a
[Expert@HostName:0]# dynamic_objects -n LocalGatewayInternal -r <IP Address of eth1> <IP Address of eth1> -a
Example:
[Expert@HostName:0]# dynamic_objects -n LocalGatewayExternal -r 10.0.1.12 10.0.1.12 -a
[Expert@HostName:0]# dynamic_objects -n LocalGatewayInternal -r 10.0.2.12 10.0.2.12 -a
-
Verify the Dynamic objects were successfully added.
[Expert@HostName:0]# dynamic_objects -l
Example:
[Expert@HostName:0]# dynamic_objects -l
object name : LocalGatewayExternal
range 0 : 10.0.1.11 10.0.1.11
object name : LocalGatewayInternal
range 0 : 10.0.2.12 10.0.2.12
Operation completed successfully
Note: If policy installation fails with the error about the local.magic file, then follow the applicable scenario from sk33893 - 'Installation failed. Reason: Load on Module failed - failed to load security policy' error during policy installation.
|
This solution has been verified for the specific scenario, described by the combination of Product, Version and Symptoms. It may not work in other scenarios.
|