Wednesday, October 16, 2013

Update on steevie's downtime

So you all probably deserve an update on steevie. That update is this update.
steevie has been down for approximately a month. Here's what happened:
  1. I upgraded steevie.
  2. I rebooted steevie, due to systemd cgroup hierarchy changes.
  3. steevie refused to boot (he failed to mount the root partition).
So basically, here's what's supposed to happen on a normal boot:
  1. GRUB loads.
  2. GRUB loads the Linux kernel.
  3. GRUB loads the initial ramdisk.
  4. LVM, in the initial ramdisk, in userspace, searches for Volume Groups.
  5. LVM creates the device nodes that represent the LVM Logical Volumes in /dev.
  6. systemd mounts (or swapons); the created devices as filesystems. One as /home, one as /, and one as swap.
  7. The initial ramdisk exits, the Linux kernel changes all the mounts to be mounted on the real root, and the system boots.
The problem is that somehow, the system cannot properly complete step 5. This means that the boot process "completes" like this:
  1. Steps 1-4 above complete normally.
  2. LVM tries to create the device nodes. For some reason, this hangs forever.
  3. Eventually, something (possibly systemd, I'm not sure) times out waiting for the device to be created, and kicks you back to a ramdisk shell (which means Busybox).
  4. The shell waits for you to do something to fix the boot attempt.
This is extremely unfortunate. Right now, it's looking like the LVM problem is being caused by a hard drive failure.
You can read all the gory details at this Stack Exchange question, and then this followup, but the tl;dr is that there isn't much I can do. There's still a little more to try, but I don't hold out much hope.
Worst case, I have to completely wipe the drive. Any data in your home directory will be preserved, because there are no problems mounting the /home partition. But if you have any data anywhere else, it will probably be lost. I'll run data recovery tools, of course, but I don't hold out much hope. Unfortunately, this also means that my beautiful README will be lost. :(
I'm not sure what I'll end up doing once the drive it's wiped. It's possible I'll use btrfs on the new root, since it seems to be pretty resistant to this kind of stuff (and at the filesystem level instead of the block level, so it will probably be more effective).
Sorry for the downtime! If you have any questions or any concerns, feel free to reach out to me in the comments or on Twitter (mention either @strugee2 or @strugee_dot_net).