Security Gateway running SecurePlatform / Gaia OS freezes, crashes, or reboots randomly, core dump files are not created.
Solution
Click here to see how to proceed when Security Gateway installed on SecurePlatform freezes, crashes, or reboots randomly, and core dump files are not created.
Table of Contents:
Background
Procedure
Related solutions
Background
Due to various circumstances, the Security Gateway might freeze, or crash. In such cases, no information can be written into any system logs.
To understand what might have caused such failure, we have to extract the necessary information from the operating system (memory stack). The Gaia / SecurePlatform operating system can be configured to dump core files. However, the failure might be so hard, that the core files cannot be dumped. In these cases, we need to "seize" the moment when Gaia / SecurePlatform freezes or crashes and extract the memory stack directly from the kernel.
The following procedure explains how to prepare the problematic machine for the next occurrence of freeze, or crash.
Procedure
Note: This procedure is a simple change in configuration of Linux kernel, which has no impact on performance.
A demonstration of the procedure on R77.30 Security Gateway running Gaia OS:
We need to configure Gaia / SecurePlatform OS kernel on the problematic machine in certain way, so during the problem it produces an output on the console, and accepts the input from the keyboard.
The procedure involves the following basic steps:
Configuring Gaia / SecurePlatform OS kernel to produce an output on the console and accept input from keyboard during the freeze, or crash
Connecting another machine to the problematic Security Gateway using a Serial (RS232) cable, or connecting to the LOM card on the appliance (and opening the Console session)
Rebooting the problematic Security Gateway into "Debug Mode"
Testing that we can communicate with the kernel directly
Waiting for the next occurrence of freeze, or crash
Extracting the necessary information from the kernel (memory stack)
title SecurePlatform NG with Application Intelligence (R55) [Debugging]
root (hd0,0)
kernel /vmlinuz-smp_kdb ro root=/dev/sda3 console=CURRENT 3
initrd /initrd-smp_kdb
Note: In some rare cases, it has been found that the USB drivers can cause conflict with KDB.
If KDB generates an 'Oops' message when attempting to switch into KDB mode, then at the end of the kernel line in /boot/grub/grub.conf file, the user should add 'nousb' parameter (e.g, '...console=ttyS0 kdb=on 3 nousb').
Using a serial cable, connect a separate machine to port COM# (see Step 2 above). Alternatively, connect to the LOM card on the appliance and open the Console session.
From this separate machine, connect to the Security gateway using a Terminal program such as PuTTY / SecureCRT / TeraTerm Web / HyperTerminal. The connection parameters for this are the RS232 default (9600, NONE, 8, 1, NONE).
You should be able to communicate with the crashing machine through the COM port. Make sure you have a serial connection with the machine - you should see the usual prompt [Expert@HostName] and you should be able to run commands.
Reboot the crashing machine. When you see the prompt "Press any key to see the boot menu..." - press any key and start the machine in Debugging Mode:
Wait for the login prompt (login:) - log in to the system.
Note: On a VM machine it is recommended to add a serial device (refer to sk164893). As a quick option you can remove ",ttyS0" for one-time debugging.
Try switching to the kernel prompt by one of the following methods:
For Gaia with kernel 3.10:
To get into KDB you need to send the SysRq-G command.
You should remember that getting to the kdb will stop all processes. You can resume to normal operation by typing: # go
There are several ways to execute it, depending on the way you are connected to the device.
Run the following commands:
[Expert@gw:0]# echo "1" > /proc/sys/kernel/sysrq [Expert@gw:0]# echo g > /proc/sysrq-trigger
Output:
SysRq : DEBUG
Entering kdb (current=0xffff880851436ba0, pid 10683) on processor 0 due to Keyboard Entry
Using mimicom terminal emulator:
Press CTRL-A f g
Using telnet via PortServer/Digi:
Press CTRL-]
Type: send break and press "g"
For Gaia with kernel 2.6.18 press one of the following:
pressing CTRL+A (you will see ^A) and then pressing Enter pressing CTRL+AA (you will see ^A^A) and then pressing Enter pressing CTRL+C (you will see ^C) and then pressing Enter pressing Esc+K+D+B and then pressing Enter using the Send a Break Signal option in the Terminal program pressing CTRL+q on the AZERTY keyboard
Important Note: Keyboard layout must be English, otherwise machine will freeze.
The prompt should change:
from the usual Bash prompt:
[Expert@HostName]#
to the kernel prompt:
kdb> Entering kdb (current=0xXXXXXXXX, pid 0) due to Keyboard Entry
Type ? or help to see all available commands (in sub-menu more>, type q or Q and then press Enter to quit).
From the kernel prompt, run:
kdb> bt
For each function, the following information is given:
The pointer to the stack frame (ESP).
The current address within this frame (EIP).
The return address from the previous frame converted to a function name and an offset of the address within the function. If this is the first frame, the address is of the current location (EIP Register).
The function arguments.
You should get a Stack output on the screen - multiple lines:
You should exit the on-line kernel debugger and enter a regular prompt [Expert@HostName]#
Note: Occasionally, the 'go' command might not work, and the error is displayed: 'Catastrophic error detected'. In such case, instead of returning to a regular prompt, the machine might be stuck in KDB prompt, or even crash. Such kernel's behavior might be caused by numerous reasons. If the machine is stuck in KDB prompt, reboot the machine manually. Otherwise, let the machine crash and reboot itself...
Example: [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, type go a second time if you really want to continue [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, attempting to continue Kernel panic - not syncing
Make sure the serial connection is still working and that you can run commands.
Leave this serial machine working and connected. Wait until the freeze/crash occurs again.
When the freeze/crash occurs, on the machine, which is still connected over serial console, repeat Step 8-C to switch to kernel debug prompt kdb>
See what processes are running and copy their complete list (to match the PID and the stacks):
kdb> ps
You should get an output on the screen - multiple lines that look like these:
Task Addr Pid Parent [*] cpu State Thread Command
0x9fdcaaa0 1 0 0 0 S 0x9fdcac50 init
0x9ffe2550 1263 1 0 0 S 0x9ffe2700 syslogd
0x9fe95550 1272 1 0 0 S 0x9fe95700 klogd
0x9fc0b000 1518 1 0 0 S 0x9fc0b1b0 sshd
0x9fc0baa0 1563 1 0 0 S 0x9fc0bc50 crond
................................................
more>
................................................
0x97325000 1935 1922 0 0 S 0x973251b0 login
0x9fc5caa0 1936 1935 0 0 S 0x9fc5cc50 bash
Then press Enter until you see the kernel prompt again kdb> - copy the complete list of the processes.
In NGX R60 and above, display syslog buffer and copy the complete output:
kdb> dmesg
text text more>
Then press Enter until you see the kernel prompt again kdb> - copy the complete output.
Note: The output will be long.
On Gaia / SecurePlatform 2.6, show the summary:
kdb> summary
You should get an output on the screen - multiple lines that look like these:
Important for SecurePlatform 2.6: In SecurePlatform 2.6, the kernel panic stack (the output of bt command) can miss some information. Therefore, run an additional command ("Display Memory Symbolically") to collect this missing information:
kdb> mds %esp
You should get a Stack output on the screen - multiple lines that look like these:
In SecurePlatform 2.6, the kernel panic stack (the output of bt command) can miss some information. Therefore, run an additional command ("Display Memory Symbolically") to collect this missing information:
kdb> mds %esp
You should get a Stack output on the screen - multiple lines that look like these:
In SecurePlatform 2.6, the kernel panic stack (the output of bt command) can miss some information. Therefore, run an additional command ("Display Memory Symbolically") to collect this missing information:
kdb> mds %esp
You should get a Stack output on the screen - multiple lines that look like these:
Return to the normal shell. In the kernel prompt, run:
kdb> go
You should exit the on-line kernel debugger and enter a regular prompt [Expert@HostName]#
Note: Occasionally, the 'go' command might not work, and the error is displayed: 'Catastrophic error detected'. In such case, instead of returning to a regular prompt, the machine might be stuck in KDB prompt, or even crash. Such kernel's behavior might be caused by numerous reasons. If the machine is stuck in KDB prompt, reboot the machine manually. Otherwise, let the machine crash and reboot itself...
Example: [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, type go a second time if you really want to continue [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, attempting to continue Kernel panic - not syncing
If the keyboard does not respond, the press CTRL+A or CTRL+AA or CTRL+C or Send a Break Signal from the Terminal program, and then try the "bt" command again.
In NGX R60 and above, disable the on-line kernel debugger: