Last Week, one of our x3650 which serve as a primary file and print server crashed during Backup.
This wasn’t nice, especially because i couldn’t start the machine using the RSA Adapter. After going on-site (at around 23:00), i unplugged the power, reconnected it, and the machine booted. And crashed. And booted.
I booted it in safe mode, disabled all DFS replication task by disabling the DFS service, and the machine was finally able to boot. As soon as something IO intensive happened, it crashed. I opened a call with IBM, IBM replaced the system board.
This worked again for about a day. Then the machine crashed again, this time with “Planar Voltage Channel Fault”. This was at around 20:00. A few hours later, IBM arrived on-site and replaced the system board again.
The ServeRAID Controller then started AutoSynchronization of the array, which made the machine unbearably slow. And then it crashed again. I disabled the Automatic Server Reboot service, and set the synchronization priority to low. I also upgraded the hard disk firmware to version 1.03.
After about a day, the synchronization completed. Performance was comparably normal, but the machine still required about 20 minutes to boot, which was unacceptable. Looking at the eventlog, i saw some log entries before the waiting time, and some after. All of them normal, none of them looking like errors.
IBM came on site two times, replacing the ServeRAID memory and replacing the CPU and HD backplane power connector. However, neither of that fixed the issue. I began to suspect a software issue.
Today, IBM was on site again. They brought a replacement server. After testing our server with one of their hard disks (which was fast, as usual), it seemed clear that this was a software issue.
I had an idea, mostly related to the fact that the first crash was during a backup. I looked at the device manager, set it to show all hidden devices, and found about 100 shadow copies.
I looked at vssadmin list shadows, which also showed a lot of shadow copies. I removed all shadow copies on one of the drives which contained shadow copies, but there were still some there. I deleted them using vssadmin delete shadows. The reboot was fast again: Instead of 20 minutes, it only took around 5 (including the BIOS).
So, if you have Windows Server 2008 hanging at the Black Screen of Death err Waiting, it’s probably a good idea to look at your shadow copies. They may be culprit.
Also, big thanks to the techs from IBM, and for taking this issue seriously.
Friday, April 24, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment