Domhnall Currie's picture

I did a regular ol' update on my TK server yesterday through Cockpit like I normally do and now the server won't boot.  I can't remember what my base image is, but I downloaded the boot-repair-disk and tried that out, but I'm not having any luck with it.  When I updated the server and rebooted it, it came up to the grub prompt.  After several attempts of running the boot-repair-disk, I'm currently at the grub repair prompt.

Question:  When the boot-repair-disk asks you if you are using raid, is that talking about linux's software raid?  I'm using an HP DL360 server with hardware raid, so I didn't let the repair disk install the mdadm util.  Should I tell it I'm using raid and let it install that? 

I can see everything in my TK partition, but it won't let me copy anything to the grub directory, giving me access denied on everything.  I've tried everything I know (which isn't much) but haven't had much success.  My test server that I try things out on before installing them on my main server is not a 1:1 copy of the main server, but it's running the same TK setup and I updated it at the same time and it is working ok.  Any thoughts?

Forum: 
Jeremy Davis's picture

So first thing, just for clarity, it sounds like you have this installed to bare metal? Your note about RAID certainly suggests that.

Also you don't explicitly note what version of TurnKey and/or Debian. E.g. if you've updated a v15.x TurnKey (Debian 9/Stretch based) system to the latest stable Debian version (currently 10/Buster), then it will be something of a hybrid of TurnKey v15.x and Debian 10/Buster.

You also didn't elaborate on what/how you ran updates. E.g. did you run "apt-get update && apt-get upgrade -y" from the commandline? Or something else entirely? Did you "autoremove" anything? If so do you recall what got removed?

You noted that you get a grub prompt, but didn't exactly note what that was? It should have been either 'grub>' or 'grub rescue>'. Boot problems can sometimes leave you at a initramfs or busybox prompt too (seeing as you explicitly said grub prompt, I'm guessing it was neither of those, but I mentioned for completeness). You also didn't note whether there was any other message? Often there will be a message just before the prompt (might be as simple as "error: file not found"). Also FWIW there are commands that can be run from a grub prompt and it's often possible to manually boot the system via the grub prompt...

Finally, re info about your situation, you mention a "rescue disk" but didn't elaborate on exactly what it is? As a general rule, it's best to use a rescue/live disk that is based on the same OS (with the same or at least similar software versions). Having said that, you could probably get away with an Ubuntu based disk of a similar version to the Debian that your TurnKey instance is based on (Ubuntu 20.04 is close to Debian 10/Bullseye; Ubuntu 18.04 is close to Debian 9/Stretch).


Somewhat tangentially, whilst TurnKey does support hardware install, personally it's not something I'd do. As I'm sure you're aware, I personally run Proxmox on (relatively low spec, albeit with lots of RAM) bare metal. I know that you've mentioned (elsewhere) that you've had issues with trying to install Proxmox, but I'm unclear on what the issues were. Your hardware sounds more than capable (TBH better than anything I'e installed it on). I've installed it on lots of different hardware and TBH I've found it to have much better hardware support than TurnKey itself does. I'm no Proxmox expert, but I know Debian (the base of both TurnKey and Proxmox) quite well. If you want my 2c on Proxmox issues, feel free to start a different thread for that. Anyway, I digress and I'll come back to the immediate issue...


The fact that this occurred following updates does sound very much like that may have been an influential factor. However, you also note that even from a rescue disk the filesystem is only being mounted read-only?! If you aren't explicitly doing that yourself (accidentally or otherwise) then that sounds suspiciously like filesystem corruption (by default filesystems should be mounted read/write, but will fall back to read-only when faults are detected). And 9 times out of 10 (unless there's been power loss while writing or similar) filesystem corruption is often an early sign of hardware failure (usually physical HDD(s), but sometimes cords, controller(s), PSU, etc).

Obviously, I'm only guessing, but from my experience it seems possible.

As such, the first thing I'd try to do is to run a filesystem check. To do that ensure that the filesystem is NOT MOUNTED! I.e. to be safe, I suggest a clean boot from your rescue disk and then (without mounting anything) run fsck on the filessystem(s).

You'll need to be root to do that, so either 'sudo su -' first, or prefix 'sudo' to each line. To check the /dev/sda1 volume and auto attempt fixes:

fsck -y /dev/sda1

If you have multiple volumes, then re-run that for each volume. If you are using LVM, then you will need to enable the volume(s) before you fsck them (but DON'T mount them!). I.e. assuming a Volume Group of 'turnkey' and a root volume of 'root', that will go something like this:

vgchange -ay
fsck /dev/mapper/turnkey-root

If that doesn't find any issue, then that makes hardware issues much less likely and does definitely point toward an update issue. And even if it does find and fix issue, it still doesn't necessarily mean my guess is right (personally I do more diagnostic tests on your hardware before feeling confident about that).

As for your question about the rescue disk and RAID, I suggest that you consult with the documentation for your rescue disk. I have nearly zero experience with RAID, but I'll give you my 2c anyway. Usually, questions about installing and/or configuring software in rescue/live environments, relate to installing software within the live OS (i.e. it will only be temporary and exist within RAM, not be written to disk). So usually it won't matter if you do install additional stuff that you actually don't need. Having said that AFAIK mdadm is related to software RAID. IIRC it does include tools that you could use with hardware raid, but there are probably better tools available. The fact that you seem to be able to mount the filesystem (even if it only is read-only) suggests that mdadm wasn't needed anyway. Although TBH, I have no idea; perhaps that's something to do with lack of tools to deal with RAID?

So give the filesystem check a go. It almost certainly won't resolve the booting issue, but it may allow you to get a bit further (e.g. allow you to mount the fs r/w). Once you've posted back with a bit more info, I'll see if there is anything more of value I can add...

Domhnall Currie's picture

Ok, I'm going to print this out and work through it step by step, but I'll give as many facts as I know right now.  Yes, this was installed to bare metal.  It's an HP DL360G6 server with 32GB ram that I bought used off eBay.  I think it was TKL Core I installed on it, but I know it was 16.x.  I've done Debian updates on it since I installed it, but no upgrades and there's no mix of TKL versions on it, it's all 16.x. 

I did the update with Cockpit. https://cockpit-project.org/ It's a web based server manager that I mostly use to help me manipulate Docker containers once I get my docker-compose config files setup.  I was going to do something to the server (can't remember what now), logged into Cockpit and saw that it was showing some bugfix updates available.  I don't think it does autoremove behind the scenes or anything in its default config, but I haven't configured it that way and I didn't choose anything like that when I updated the other day.  I just clicked update, it did its thing, said it recommended a reboot after the updates completed and I rebooted.  After the reboot, that's when it came up to just a Grub> prompt. 

After fiddling around with it a while and trying to repair Grub, that's when I got the Grub Rescue> prompt.  In my OP I said grub repair, but it was the Grub Rescue> prompt.  I fiddled around with that a while, but I've made no further progress since then.  Reason I didn't let it add the mdadm utility was I was concerned that it might try to do something and corrupt my hardware raid setup, so I wanted to try that as a last resort type of thing.  I have backups, but it's been a couple weeks (smh) and I've added a lot of Redmine data in the last couple weeks. (yes, I'm an idiot) 

The boot repair disk I tried was https://sourceforge.net/projects/boot-repair-cd/ . IIRC, it is Lubuntu 18.04 based, which I think is Debian Stretch and my TKL 16.x is running Buster.  Maybe that's a problem, but it is trying to install/fix Grub 2, which is what I have, so I think it should work if it's going to.  Which it apparently isn't at least with the current knowledge I have at hand.  :)  I know I probably could have done the repairs at the Grub> prompt manually, but I was hoping for a boot-run-reboot simple fix.  At 54 years old, you'd think I'd know by now that is not how things work in my world....  Anyway....

The server I'm running has redundant p/s on a UPS, so I'm guessing there was no power interruption.  I have 4-500GB drives in there with a hardware raid controller.  2 drives are in an LVM giving me 1TB (actually it's 800 something GB because I let it leave some blank space when I let the Debian install configure them for "ease of adding drives to the LVM in the future".  Then those two drives are mirrored to the remaining two drives.  I know that's not all perfect if the raid controller writes bad info or if I've got a memory corruption problem writing bad info to the drives, etc, but everything seemed to be working ok before I did the update, so hope everything is ok there.  I never take the time to check logs though, I just look at them when problems start and I'm trying to figure out what's going on (bad habit, I know), so I could have had a hw problem and not known it.

As for the rest of your recommendations, I'll work on them and report back the results.  Thank you very much for taking the time!

Don

 

Jeremy Davis's picture

Ok, so unlike me (haha), this ended up being a bit long and rambly. Running the fsck as I recommend previously is still a good idea, but that won't fix it (but may rule out a hardware issue). If you want to skip straight to the "next steps", please scroll down to the next horizontal line like this:


Yeah, what you've described certainly does sound like something went wrong with an update. I'm not super familiar with Cockpit so I'm not sure what might have gone wrong, or even how it installs updates. As it's been around a while, I would imagine that they've got running updates pretty well figured out. Having said that, I can't help but wonder if perhaps there was something interactive within the update that the Cockpit GUI didn't know how to deal with (and perhaps just forced the wrong option?). FWIW, similar issues have occurred with Webmin in the past. I personally always use the commandline for updates (actually I pretty much use the commandline for everything I can, it's just so much easier and less ambiguous IMO...).

It's obviously a bit late this time, and it's possibly a lesson you've now learned, but as a tip for others: As TurnKey auto installs security updates, I would never recommend running updates on a whim. When you want to run updates, ensure that you have a bit of time up your sleeve and a current backup. FWIW as you have LVM and have left some free space, the creation of system snapshots is pretty easy (what you've noted should be tons for a rollback snapshot for a few updates). Although I would still recommend a creation of a current backup (that is stored elsewhere rather than the snaphost stored on the same disk(s)).

I try to keep track of relevant Debian security updates (they auto install in TurnKey). They're almost always rock solid (hence why they auto install), but on the very, very rare occasion that can cause issues (hence why I try to keep track of them). But unless it was a security update, unfortunately, I have no idea what might have been updated recently on your system. The only thing that comes to mind is the auto grub update that broke. Although as an ISO install, that shouldn't have affected you (in other words, I'm still just guessing...). Assuming that Cockpit just ran apt updates, I'm not sure how useful the info might be, but if you look at the apt logs in /var/log/apt then you may get some insight into what was updated (and thus what might have gone wrong).

FWIW the "grub rescue>" prompt means that only the first stage of the boot process (that is usually embedded in the MBR/GPT) has worked and the bootloader can't find the boot directory (or its contents are missing/corrupted). The "grub>" prompt means that the second stage has started and the bootloader has found the boot info and loaded at least some basic modules. But it has been either unable to locate or unable to use the boot config (i.e. usually errors or corruption in grub.cfg). In my experience, the latter is often caused by not being able to find the kernel, or the disk that the kernel is on (if not on the same disk as the grub config). Usually you can manually (at the "grub>" prompt) give grub the required info, boot into your OS. That won't actually fix grub, but will allow you to boot, so you can (hopefully) fix grub by re-installation and/or re-configuration. The "grub rescue>" prompt is much more limited and use of a live disk is certainly recommended. Although FWIW, a manual boot is still often possible, it will require more commands so use of a boot disk is generally going to be much easier.

The fact that you now have the "grub rescue>" prompt suggests that whatever has been changed has actually made the issue worse (initially it could find it's config or it just wasn't working, now it can't even find that). When fixing (grub) boot issues, as a general rule it is recommended to use a live disk with the exact same version of grub as what is installed on your system. As Ubuntu is not binary compatible with Debian, using a Debian live CD (e.g. the live component of the TurnKey ISO) is preferred. Having said that the version of grub in Ubuntu 18.04 is basically the same (both have v2.02) so I would expect that to work ok (although Ubuntu have their own custom patchset - so I'm not sure...).

Noting that you have redundant PSU and this occurred immediately following updates certainly does suggest a pure update issue. Although the fact that the rescue disk took you backwards does raise some questions... The disk check I recommended is still worth a go. If that doesn't find any issues, then you can rule out any disk corruption.


I note that the rescue disk appears to have the ability to produce a "boot info summary". Perhaps that report has some info of value? If you want to get that and post it here might be helpful? And/or please provide the following:

From the broken system (i.e. when mounted to the live rescue disk, these paths will be relative to the mount point, e.g. if you mounted to /mnt the boot dir will be /mnt/boot/):

/boot/grub/grub.cfg
/etc/fstab

Also the output of these commands (I'm assuming that the default rescue disk user isn't root and you need sudo):

sudo fdisk -l
sudo blkid --garbage-collect
sudo blkid
sudo pvdisplay 
sudo vgdisplay
sudo lvdisplay
Domhnall Currie's picture

I'm thinking my problem was somehow related to the auto grub update issue you mentioned above, Jeremy.  After I got Redmine reinstalled on another server and my data migrated over, I re-initialized the server that broke, so I guess I'll never know what caused that problem.  I do remember update mentioning some "packages held back" or some other issue that I briefly looked into and figured I should wait on trying to fix.  I mentioned in Ewen's post on his new server that I had a hard time getting Proxmox to boot on this server, but it booted fine on a different HP machine.  My solution was to use Ventoy and its default method of booting was MBR.  All the drives in this server seemed to check out ok and I also found the memory test in the system BIOS that I ran and it reported the memory was ok.  I found a memory check package in the Debian repository and I'm going to run that just to see what it reports.  After the update that broke the server, all the hardware *seemed* to check out ok and all my data was present, it was just that the boot files were in grub.bak instead of grub and the .mod files were "missing".  Between the hardware and the software, it looks like there was just some snafu between GPT and MBR and my boot situation got twisted up.  If I had asked for help before I ran that recovery disk (I'm sometimes slow to ask for help, not because I don't like to ask for help, but usually because I don't realize I have a problem at that point) :), I think I could have easily restored grub or reverted to the previous kernel and I would have been able to boot as you mentioned above.  Linux has always been so rock-solid for me and I've never had a problem like this before, so that's why I've always just let it update and "do its thing" and I didn't realize the situation I was in. 

The G6 server line was introduced in 2009 and with the split of HP/HPE, you have to have a subscription for some of the drivers/firmware, etc.  (The availability of support/drivers for everything since the beginning of time is one reason I've been loyal to HP for so many years, but with this split, I'm re-thinking that)  I'm not sure I'm running the latest of everything, but I'm guessing that with the issue I had with Proxmox on one machine, but not the other, maybe there's some hw issue with my BIOS or RAID card that caused the update to burp.  I'll be a lot more conscious of running updates in the future, especially on that machine, but maybe Proxmox will help keep my environments separated and backed up more reliably.  I don't really think I had a hardware issue, a software issue or a recovery disk issue, like usual the issue was between the keyboard and the seat.  :)  As always, I appreciate your help!  I'm back running with a lot more info at my disposal for future use.

Jeremy Davis's picture

I'm glad to hear that you worked through the issues and are now back up and running.

Re not asking for help in time, TBH, historically, I've been the worst at asking for help, but I've learned a lot that way...! :)

In the future, feel free to ask though! :) The earlier you ask when you notice anything not as it "should be" (that you don't understand) the better.

Re installing updates, personally, I would highly recommend a backup (ideally both a backup and a snapshot IMO) before installing updates. 99.99% of the time it will be wasted effort as it will all "just work", but that one time it doesn't will save you a whole ton of pain!

If you ever see "packages held back" when updating, and you haven't (or don't recall) explicitly hold packages to particular versions, it's always worth investigating! Even if there was previously a reason to hold a package, it's still worth double-checking that it's still relevant. In a current TurnKey install, the only packages that should ever be held (by us) will be some PHP packages when using non-default PHP version. But as per always, feel free to ask for specifics. Often I can answer off the top of my head, otherwise, I can usually at least assist you to find the relevant config/settings.

Regarding drivers that you need, if drivers required for your hardware are available from Debian (rather than upstream, such as HP) then I would highly recommend using them, rather than installing from elsewhere... Other than really new hardware, it should generally just be a case of installing missing packages (from "non-free" if they aren't auto installed).

Anyway good luck with it all.

Domhnall Currie's picture

You can bet everything will be in order, as in the recommended backups/snapshots, before I ever do an update again on a production machine!  :)  Speaking of updates, I noticed when I freshly installed RM on this other server, there was a "Yarn Packaging" repository that wasn't signed so it wouldn't allow updates from that repository.  Is that something of concern or can it just be ignored? 

Jeremy Davis's picture

My guess is that the signing key has expired. Check out this thread on how to update it.

Although having said that, if you don't want to update yarn, then it doesn't really matter (FWIW yarn is a package/dependency manager for Javascript - an alternative to npm).

Add new comment