How to Deal With a PSOD VMware?

How to Deal With a PSOD VMware?

Reading Time: 4 minutes

PSOD, a “Purple Screen of Death”, is a diagnostic screen with white type on a purple background displayed when the VMkernel of an ESX/ESXi host experiences a critical error inoperative and terminates any virtual machines that are running.

The purple screen displays the state of memory at the time of the crash and also additional details that are essential for troubleshooting the cause of the crash: ESXi version and build, exception type, registration shutdown, backtrace, uptime server, error messages, and dump kernel information – an error-generated file that contains additional diagnostic information.

There is a case of a not implemented error message occurring to assert the error message of the purple screen error.

Why does PSOD occur?

PSOD is kernel panic. The ESXi (VMkernel) kernel activates this security measure in response to an irreversible event/error, which would mean that continuing to operate would pose a high risk to the services and VMs. Although we all know ESXi is not UNIX-based, panic implementation fits the UNIX definition.

The most common causes of PSOD are:

1. Hardware defects, mainly related to RAM or CPU. They usually make an “MCE” or “NMI” error.

  • MCE – Exception for machine checking, CPU mechanism for detecting and reporting hardware problems. There are essential details for identifying the root cause of the case in the codes displayed on the purple screen.
  • “NMI” – an outage that can not be masked, a hardware outage that the CPU can not ignore. Because NMI is the critical message for HW failure, the standard response starting with ESXi 5.0 and later is to activate PSOD. Previous versions have recorded the error and continued. As with the MCE, the NMI-induced purple display will provide essential troubleshooting codes.

2. Software bugs

  • improper interactions between ESXi SW components (for example: KB2105711)
  • race conditions (eg KB2136430)
  • no resources: memory, stack, buffer (eg KB2034111, KB2150280)
  • infinite loop + steak overflow (eg KB2105522)
  • Incorrect or unsupported configuration parameters (eg KB2012125, KB2127997)

3. Misconduct of drivers; Errors in drivers are trying to access an incorrect index or non-existent method (e.g., KB2146526, KB2148123).

Why the screen turns purple

According to VMware official documentation, the Violet Death Screen, or “ESX/ESXi Host Violet Diagnostic Screen”, is a diagnostic information screen displayed when VMkernel encounters a critical error on your ESXi hosts.

VMkernel error message generated is an example error of the lost heartbeat example error from the VMware ESXi version. VMkernel error messages can hold possible deadlock example error messages to spin count exceeded errors.

ESXi Host Error causes the kernel to malfunction and interrupts any background running on virtual machines. It is caused by RAM and CPU problems or hardware defects such as damaged CPUs, fried memory sticks, system failure boards, and damaged internal boot cards.

The PSOD message screen, a purple, white screen, displays diagnostic information, such as the state of memory at the time of the crack, and error messages about the machine check failure or failure, such as the exception type.

Almost any hardware problem can trigger a purple screen of death, including off-range management warnings.

The impact of PSOD

When an alarm arises and the host crashes, it terminates all services running on it and all virtual machines hosted. VMs are not gracefully shut down but shut down abruptly.

If the host is an element of a cluster and you have configured HA, these VMs will start on the other hosts in the cluster. In addition to interrupting and unavailable VMs during downtime, critical applications such as database servers, message queues, or backup tasks can be affected by “dirty” shutdowns.

Also, all other services the host provides will be terminated, so if your host is a member of the VSAN cluster, PSOD will also affect vSAN.

The most problematic aspect of PSOD is that it makes you lose confidence in your infrastructure and the concern it creates, at least until you reach the bottom.

How to fix the PSOD – purple screen of death errors

In most cases, you will need the services of professional PSOD recovery providers to fix the purple screen of death problems and recover valuable contents of the file system. Here are the things to do to solve the purple screen problem:

Examine the purple screen message

One of the most significant things to do when you have a PSOD is take a screenshot. If you connect remotely to the console, taking a screenshot will be easy, but if you need to go to the data center, you may need to take out your phone and take a screenshot. There is a lot of helpful information on that screen about the cause of the crash.

Contact VMware support

Contact VMware Support in case you have a support agreement before starting further investigation and troubleshooting. In parallel with your research, they will be able to assist you in doing a root cause analysis (RCA).

Reboot the affected ESXi host

To restore the server, you will need to restart it. If you can not afford to keep it in maintenance mode, at least adjust the DRS rules so that only unimportant VMs will work on it so that if another PSOD appears, the impact will be minimal. I would also advise you to keep it in maintenance mode until you complete the RCA, identify the cause, and fix it.

Get the core dump

Once the server is booted, you must collect the core dump. The core dump, also called vmkernel-zdump, is a log file containing similar but more detailed information to those seen on the purple diagnostic screen and will be used for further troubleshooting.

Conclusion

The ESXi kernel (vmkernel) triggers this purple screen as a safety measure in response to an unrecoverable error and would mean it is continuing to run would pose a high risk for the services and VMs.

When the ESXi hosts feel it has become corrupted, and display the purple screen with a Long message and Code. The purple screen of death mainly appears due to hardware (RAM or CPU) failures.