Tux + Linux Items

Help promote Linux and FOSS at the
Sample T-Shirt from the ERACC Cafe Press Store
ERACC Cafe Press Store

Open Source Horror Story – A Linux Recovery Tale

Are you in the market for a new laptop, desktop or server PC with Linux installed? Please give us the opportunity to quote a preloaded Linux laptop, desktop or server system for you.

Hi children! I know it is a bit early for scary tales. We usually get to those in October. But I have one for you that you just might want to hear now. So. Get your hot cocoa, your S’mores and your sleeping bag and come over here by the fire. I have a tale of chills and thrills to tell you young’uns. There now. Are you all snuggled in and ready for a scary tale? Good. Here goes …

It was late on an August evening. August 30th to be exact. A brave independent consultant and Linux administrator was finishing up a long, slow upgrade from Mandriva 2010.2 to Mandriva 2011 for a client. He had noticed the upgrade was taking an excessively long time, but as this was only his second upgrade of the new release of Mandriva, he chalked it up to the new release of Mandriva. Little did he suspect the slow upgrade was due to … due to … oh, I can hardly say it to you sweet, innocent young’uns. But to tell the tale properly I must say it … A FAILING HARD DRIVE! (Look! I have goose bumps!)

When he rebooted following the last stage of the upgrade, he saw a … a … a … KERNEL PANIC! The system could not find the root / boot partition. So, he booted a PartedMagic Live CD to access the drive and see what was wrong. But PartedMagic refused to mount the partitions too. When he checked with GParted he saw that the /home partition, which he knew to be an XFS file system, was being “reported” as a “damaged” EXT4 file system. This looked bad. Very bad. So, he ran GSmartControl and tested the drive. Oh no! The drive was giving errors by the megabyte! Oh the horror! The angst! The tearing out of the hair … Okay, so he’s 50ish and mostly bald on top with a ponytail. He really avoids pulling out what hair he has left. But you get the picture.

Okay, not to worry. He had sold the client a new, spare hard drive just the right size to replace the failing drive. He also “knew’ the client had backups, because he had set up the backups for them and told them how to run them. Plus they had periodic automatic backups as well and had been told how to check that the backups were running and completing successfully. But when he checked for the most recent backup … it was in May! No one had been running the manual backups and the automated backups were returning error logs that NO ONE WAS READING! (Yeah, he should have run an “extra” backup himself, but time was pressing because he had a time limit from the client to get the upgrade done. The time limit left no time for a backup.)

Now things were starting to look grim. He knew that losing three months of financial data stored in QuickBooks in the XP Professional virtual machine on the /home partition of the client’s drive could be a disaster for this small business client. Thinking it over, he decided the only solution was to run xfs_repair on the /home partition. So he did. Lo and behold, it worked! Well, somewhat. There were hundreds of megabytes in lost+found but the user directories showed up and most of the files were there, including what appeared to be the XP Professional virtual machine directory named .VirtualBox in the user account that ran the VM. Unless you have been in this position, my children, you have no idea the sense of relief this brave Linux denizen felt. But it was a premature relief, as you shall see.

He immediately shutdown the system and installed the spare hard drive. Then our brave lad rebooted with the PartedMagic Live CD and ran GParted again to create a new partition layout. Then he ran Clonezilla to clone the recovered /home partition to the new drive. Keeping his fingers, toes, arms, legs and eyes crossed for luck. (Did I mention he is a contortionist? No? Well, he’s not. That sentence is just for “color”.) The clone completed successfully and our intrepid Linux fellow shut down the system, removed the naughty hard drive, and gave it proper rites before smashing it with a sledge hammer. (Yeah, you guessed it, more “color”.)

Then he reran the “upgrade”, which was now morphed into a fresh install of Mandriva 2011 on the new hard drive. It was 4:00 AM on August 31st at this point. He was now into his 14th hour of an “upgrade” that had been supposed to take less than six hours by prearranged agreement with the client. By 7:30 AM, when the client’s staff began arriving, he had the system “finished”. The printer was printing. The scanner was scanning. The VM was booting. The rooster was crowing … just checking to see if you are paying attention. All appeared well and the client was understanding about hardware failures happening. After going over backup procedures with the client, again, our weary Linux consultant headed home for a short nap before starting his new business day.

Later that day he received a call. Yes, children, it was the client. The QuickBooks data was showing nothing past April 2010. Since this was August 2011, that was a Very Bad Thing. So, our fine Linux fellow headed back to the client and the “problem” system as he was now calling it. Upon review he discovered the restored virtual disk was one that had been a backup made in April of 2010 prior to an upgrade of VirtualBox at the time. Where was the most recent virtual disk with the client’s data? Gone. Vanished. Eaten by an evil hard drive. But, a light appeared above our hero’s head! Due to having had some sleep and some caffeine, he remembered that QuickBooks had been reinstalled with a new release in late June of 2011. He Had A Backup Of The System On A USB Drive From That Day! Yes, it would still mean losing two months of data. But that was much more acceptable in the client’s view than losing a year and a half of financial data. Which would mean near certain doom for almost any small business.

So, our Linux protagonist retrieved the USB hard drive, attached it to the system and ran a restore to get the virtual machine back from June 2011. This worked successfully and the VM booted. A check of the VM showed the data from June was there and intact. Our nice Linux guy packed up his gear, went over backup procedures with the client, again. (See a trend here?) Then headed home for supper and a good night’s rest. The End …

Well, not yet. You see, losing data really irritates our Linux Paladin. His mind would not let go of the problem. He kept thinking there was something he missed. Something he could have done to get all the data back. Something … something … some* … Ah HA! He recalled that lost+found directory with the hundreds of megabytes in it! He quickly called the client and arranged to go on-site after hours on that 1st day of September 2011. He combed through the lost+found directory with the ‘find’ command searching for files around the correct size of our missing, most recent, virtual machine file. There was one hit, just one. But it was enough. He had found the latest copy of the virtual machine. After making a backup(!) he copied this file to the correct directory, set back up the virtual machine using this found file and all the financial data was recovered. Everyone rejoiced and there was much feasting. (Yep, “color”.) The Real End.

What is the moral of our story young Tuxes? It is this: Never rely on someone else to do a backup. Backup, backup, backup, backup, backup for yourself. Then when you think you have enough backups, do another backup. You can be sure our Linux star has learned that lesson … again.

Discuss this article on:

Share

12 comments to Open Source Horror Story – A Linux Recovery Tale

  • Yes, this really happened to me this week. I thought it would make a good object lesson for those of you who do not make backups. I know it was an object lesson for me … again. ;)

    Added later: Some of you folks still are not following our comment policy. I am deleting your comments if they do not follow the requirement for a real e-mail address. No, you are all NOT billg@microsoft.com, or some other obviously fake address. :)

    Even later: oh yeah, someone on reddit mentioned I should have run ‘ddrescue’ before running xfs_repair. Actually I ran ‘dd’ and got a bit level copy of the partition first. I just forgot to include that step. It is good advice though.

  • Frank

    I blame Murphy.  It seems like you really only need the backup 1 out of 100 times you’re doing an upgrade.  So 99 times you do the backup and it isn’t needed.  That 1 time in 100 when you decide not to do the backup (because the client doesn’t want to ‘waste the time, we have backups’ or time constraints), when you decide to take that 1% chance, it ALWAYS bites you in the arse.  It’s Murphy and his damn law at work I tell ya.

  • susecaboose

    Funny- this story kind of reminds me of my first ‘oh sh**’ backup-less experience.
    Using YAST, I created a software RAID-5 stack out of 5 x 1TB hard drives. I was quite proud of this, as it was my first time running a raid of any sort- and it worked rather well. Because of the expected single hard drive failure that is covered by RAID, I thought I was immune to the stack ever blowing up on me.
    Boy was I wrong.
    RAID 5, while able to kick a lot of a**- has one weakness…. 2 or more hard drive failures. One hard drive was in the process of failing, whereas another had a lot of bad sectors. I tried to rebuild, reassemble, and 50 million other attempts. Due to my inexperience, and arrogance, I have lost 900GB of stuff that is not too easy to replace.
    Several hours of cursing later, I replaced the hard drives, took the ‘Well, derpington, you should backup’ comments on the SUSE IRC channel to heart, and decided to rebuild the array as RAID-6.
    Now I generally try and back things up that really matter to me. Especially things that can’t be easily replaced.
    Moral of the story: BACKUP, dammit. Marmalade.

    (Admin: Swearing edit. Please see our comment policy. Thanks for the comment though. :) )

  • James Dixon

    Allow me to add my recommendations wrt ddrescue, It is a lifesaver. We had a hard drive failure here at work (Windows XP machine, no backups) and Windows wouldn’t even mount the drive. Chkdsk just kept giving more and more errors each time it was run, and the drive still wouldn’t mount. Ghost hung at 6% in trying to back up the data. In short, none of the Windows tools at my disposal could do anything with the drive. I took the drive home, hooked it up to my Slackware box, and started ddrescue running. By morning all the data had been recovered save for a few files and one directory.

  • Well written with good SoH.
    Enjoyed!

  • nelsoncs

    IMHO a single drive is bad news for any mission critical machine.  I would suggest always using a RAID mirror.  My personal favorite is zfs.  The cost to the client is pretty minimal compared to the possible loss of data.
    Great story, especially the last part about the lost+found.

    • Don il

      A single disk in this scenario is also not the best option for an upgrade. When I am in need to upgrade a machine with important data in it, I usually make a fresh installation in a new disk, reconfigure all that’s needed, and then copy said critical info from the old disk to the new one. I never touch the old disk again, and I even put it inside a sealed anti-static bag, and keep it in a keyed locker at the customer’s facility. You see, it happened to me once that the new disk died just six days after installing it. No hassle then. I just brought a new one, reinstalled, and copied the info from the old disk, and restored four days worth of backup data –they don’t work on weekends– and everything was ok.

  • Wonderfully written; truly a great read that took me back in time to the day in which I lost my thesis.

  • […] Open Source Horror Story – A Linux Recovery Tale When he rebooted following the last stage of the upgrade, he saw a … a … a … KERNEL PANIC! The system could not find the root / boot partition. So, he booted a PartedMagic Live CD to access the drive and see what was wrong. But PartedMagic refused to mount the partitions too. When he checked with GParted he saw that the /home partition, which he knew to be an XFS file system, was being “reported” as a “damaged” EXT4 file system. This looked bad. Very bad. So, he ran GSmartControl and tested the drive. Oh no! The drive was giving errors by the megabyte! Oh the horror! The angst! The tearing out of the hair … Okay, so he’s 50ish and mostly bald on top with a ponytail. He really avoids pulling out what hair he has left. But you get the picture. […]

  • […] Open Source Horror Story – A Linux Recovery Tale […]

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Follow the directions below to post a comment if you are human. After 3 failed tries reload the page to start with new images.