I recently had a fun experience. My Mageia 1 Linux system seemed to be experiencing hard lockups requiring a push of the reset button to “resolve”. By “hard”, I mean no keyboard input, no program updates showing in X and sometimes no ping response from another PC on the LAN. I had run some updates, including a new kernel update, and these lockups appeared after running the updates. Cause and effect. Yes? Well, no. It turned out I was having a hardware problem. Here is how I figured that out.
All was well with the world and my PC on Friday, 11 November 2011. Okay, maybe not with the world, but my PC was humming along just fine. Then Saturday came along and it was Update Day. Since my PC is my business system as well as my personal system, I usually try to run my updates on the weekends to avoid down-time during the week. The updates completed successfully and these packages were updated:
flash-player-plugin-188.8.131.52-1.mga1.nonfree Sat 12 Nov 2011 04:16:02 PM CST
libmsn0.3-4.1-5.1.mga1 Sat 12 Nov 2011 04:15:52 PM CST
kernel-source-latest-184.108.40.206-8.mga1 Sat 12 Nov 2011 04:15:51 PM CST
kernel-desktop-latest-220.127.116.11-8.mga1 Sat 12 Nov 2011 04:15:51 PM CST
kernel-desktop-devel-latest-18.104.22.168-8.mga1 Sat 12 Nov 2011 04:15:50 PM CST
kernel-desktop-devel-22.214.171.124-8.mga-1-1.mga1 Sat 12 Nov 2011 04:15:40 PM CST
kernel-desktop-126.96.36.199-8.mga-1-1.mga1 Sat 12 Nov 2011 04:14:47 PM CST
kernel-source-188.8.131.52-8.mga-1-1.mga1 Sat 12 Nov 2011 04:14:31 PM CST
(Output from ‘rpm -qa –last |less’)
Of course, after getting a kernel update a reboot is required to load the new kernel. So I rebooted the system and everything seemed to be working fine. Sunday evening came along and I decided to play a bit of Unreal Tournament 2004, a.k.a. UT2004, and frag some bots for relaxation. Yeah, I am a “violent guy”, NOT. I was into the middle of a Capture the Flag run when my game froze hard. What was my first thought? Yup, you guessed it, “That kernel update has messed up my box!” It is sad how we humans jump to the wrong conclusions so quickly, is it not? How many folks would be happier if we all took a step back from our assumptions and reconsidered before acting? Not only that, jumping to the wrong conclusion in my case turned into a week of unnecessary frustration and angst.
I spent several wasted hours every day looking for a solution to my “kernel problem”. You may wonder why I started hunting for kernel problems. I was using an nVidia based graphics adapter running the non-free nVidia driver supplied with my distribution. During forensics after my first hang I saw this in /var/log/messages:
kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
I started thinking the new kernel and nVidia driver Had A Problem and based on that poor assumption I forged ahead to find The Solution To The Problem. I will spare you the details of all the wrong turns and dead-ends I found during my week of agony. Let us just say, it was not fun trying to fix a problem that did not exist anywhere else but in my fevered imagination.
What happened to get me on the correct track? Yesterday, 22 November 2011, my system suddenly started hanging when not running a 3D game. I was just looking up parts to order for a new PC build for one of our Linux clients. This was the “Ah ha!” moment for me. I had not had this happen before under any Linux system except when I had a hardware problem. So, I shut down the PC, pulled my CD case out of my brief-case and loaded the Live Parted Magic CD to run hardware tests. The RAM was tested first – no errors. Then I rebooted to the PM GUI and ran GSmartControl disk tests simultaneously on my four SATA drives.
The first drive finished with no errors. While waiting for the other drives to finish, Parted Magic had a hard hang. My very next thought was, “It is the graphics card!” Because the only time I have had Parted Magic hang like that was when I encountered a bad graphics card on a client’s PC I was trying to diagnose. Yes, this is another bit of assumption. But this time it was correct.
However, I wanted to be sure I was following the correct path this time. So I did some more forensic investigation before marching ahead. What I did was reboot my PC off the boot drive to Mageia 1. Login to my console on my Linux router and ‘ssh’ in to my PC. Then ‘su’ to root and ‘tail -F /var/log/messages’ to watch what was happening while I used the PC. In a few minutes of use the PC “froze” and these were the last lines displayed in the log:
Nov 22 21:41:33 era4 kernel: NVRM: Xid (0000:01:00): 8, Channel 00000000
Nov 22 21:41:35 era4 kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Nov 22 21:41:37 era4 kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Nov 22 21:41:42 era4 kernel: NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00017c0c
Nov 22 21:42:09 era4 kernel: NVRM: Xid (0000:01:00): 8, Channel 00000020
Nov 22 21:42:26 era4 kernel: NVRM: Xid (0000:01:00): 8, Channel 00000020
Nov 22 21:42:41 era4 kernel: NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00017c0d
Nov 22 21:42:50 era4 kernel: NVRM: Xid (0000:01:00): 8, Channel 00000020
Nov 22 21:42:59 era4 kernel: NVRM: Xid (0000:01:00): 8, Channel 00000020
Nov 22 21:42:59 era4 kernel: NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00017c0e
Nov 22 21:43:13 era4 kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus.
See that last line? That means the driver could not “see” my video card any longer. I think the hard lockups were because the non-free, proprietary nVidia 3D driver has hooks into the kernel to do its “3D magic”. It can cause a video failure to hang the entire system. If there is a way to do fast 3D processing under X without hooking into the Linux kernel, well, I vote for that. Why? So that an ‘ssh’ into a desktop Linux box with a dead video card has a chance of being successful so a savvy troubleshooter has a chance to do forensics on the running system. In any case, a switch to a new video card solved the problem.
My new video card? It is an “old”, unused ATI Radeon X1650 Pro I had sitting on a shelf here. It is using the “free” ATI driver supplied with my distribution. Oh, and my 3D game, Unreal Tournament 2004, works just fine with that. However, I have not gotten Quake 4 to run yet with the “new” setup. I expect I will be able to get Quake 4 working if I decide to take the time to look into that. But for now, I am happy with what I have. At least I can get my work done, which is much more important than any game.
Now, if you will excuse me, I need to get back to building Linux based systems for our clients. Thanks for stopping by though.