Bug Alert! v17.0 security updates kernel. Reboot results in kernel panic.

2022-05-17 Update: All affected appliances have now been bugfixed and rebuilt as v17.1. The v17.x release will continue, with all new appliances to be released as v17.1.


As first reported by user ZZRabbit, in our forums, there is a serious bug in TurnKey v17.0! Once security updates are installed on v17.0 appliances (namely a kernel update), a reboot will result in a kernel panic. Others have since confirmed the issue and I have a workaround. Other issues have been noted, but they seem to be unique to specific users, so I'll only cover the confirmed resolution for now.

This post will be clearly updated with additional info as it comes to hand.

Am I affected?

This issue only affects v17.0 appliances. If you are running an earlier or different version, then you are not affected. If you are running v17.0, then I suggest that you assume you are affected. The fix (below) will not do any harm if you aren't affected (will fail on earlier releases).

Don't reboot yet! Apply this fix first.

Before you reboot your v17.0 server, please be sure to run these commands:

apt install --reinstall -y initramfs-tools
dpkg-reconfigure linux-image-5.10.0-14-amd64

That should ensure that the issue does not affect you.

Reboot to be sure:

reboot && exit

What if it's already broken?

If your server already is broken, then please reboot back into the old kernel. You can do that within the early boot. On the "GNU Grub" boot screen (blue options box on a black background), hit the down arrow to select "Advanced options for Debian GNU/Linux", then hit enter. Then hit the down arrow twice, to select the older kernel, it should be "5.10.0-13" (NOT the one that says "recovery mode"). Then hit enter to boot. It should "just work'. Run the above noted fix and you should be good. Reboot into the new kernel to be sure.

What went wrong?

I still haven't specifically pinned down the exact cause. But it's clear that this issue is a result of a bug within our build system (lots of infrastructure code upgrades happened between v16.x & v17.0). I will update with more info once I am certain.

Still having problems? Want to report success?

If you are still having problems related to rebooting v17.0 with the new kernel, please post below in the comments. Please feel free to also report success.

What happens next?

Well, once you have applied the above fix and confirmed that you can successfully reboot into the new kernel, then you can keep using your server. The above fix resolves the underlying bug, so you can continue to use your v17.0 for the expected lifetime of v17.x.

Please note that there is a chance that there may be other repercussions of this bug that we're not yet aware of. Regardless, the above resolves this specific kernel update issue. I will thoroughly investigate the root cause and share any further information if/as it comes to light.

As for the v17.0 release in general, I will be pausing that for the moment. I have removed all the v17.0 build download links from the relevant appliance pages (the images are still on the mirror for now). I will also add a note to the top of all the v17.0 blog post announcements so users are aware.

The next step will be to pin-point the bug in our build tools. I've already got a fair idea where it's occurred, but not yet why. Only once I am completely sure I understand what happened, can I also be sure of what (if any) other implications there may be. If any further potential issues are discovered, I'll post back with a relevant workaround/fix.

Once I am sure that I have clarity on the bug and a fix for any and all relevant issues, the next step will be fix the existing v17.0 appliances and re-release them as v17.1. This will mean many appliances will not see a v17.0 release, but that's ok. There are also a few more appliances that have been updated, but not yet published/released, they will also make it into this "rebuilt as v17.1" batch. So it will be close to half the library.

When we've got to that point, we will essentially be back to where we were before this bug cropped up. I'll then return my attention to squashing this bug in the build code (if I haven't already). Once we've confirmed that all is well again there, we will resume the v17.x release, just that the rest will be v17.1.

Comments

Ian Hind's picture

Just reinstalled Core on Proxmox 7.2 and without installing security updates, it boots to Linux 5.10.0-13. "apt update" and "apt upgrade" followed by reboot shows NO kernel panic and Linux 5.10.0-14. If I do the security updates during install,  NO kernel panic and reboots to Linux 5.10.0-14. Is there some subtle difference between the Core and LAMP builds causing the problem?  
Chris's picture

Is that Core as a container or VM? I built a Core 17.0 container yesterday which I've found the same as you, i.e. it runs without  problems on Proxmox 7.2 after applying updates. 

Chris's picture

Not sure what happened to my last post. It definitely had content when I saved it!

What I said was...

Ian - is your Core 17.0 install a Proxmox CT or VM? I've built a Core 17.0 container which I've successfully installed onto Proxmox 7.2. It's survived updates/reboots so it's not exhibiting the 'problem'. However, a container might be a special case because it's not running the full payload of a VM. If you've got a VM running then I'm sure Jeremy will be interested to know about that.

Jeremy Davis's picture

The forums occasionally do swallow posts with no apparent warning. I'm not a PHP guy, we did have a PHP expert lined up to have a look at it, but he disappeared...

I assume Ian was referring to Core in a VM (although could be wrong). For what it's worth, containers would not be affected because they don't include a kernel (they leverage the host's kernel).

As i noted in direct reply to Ian, I did note that Core (and TKLDev) weren't affected. There is a chance that others may have been ok too. But it's easier to just patch them all and re-release as v17.1.

Jeremy Davis's picture

TBH, I don't understand why, but it appears that both Core and TKLDev didn't seem to be affected?!

The only obvious difference between Core and TKLDev, versus the rest of the library is that neither Core nor TKLDev include a webserver. So I suspect that they weren't affected as it takes less resources to build them. It appears that the issue is caused by a race condition, exacerbated by lower resources (FWIW I initially couldn't reproduce it locally; until I reduced my TKLDev's resources). Regardless, considering that the "fix" is idempotent and won't cause any harm, I've run all of them through the patch process.

BTW, the v17.1 builds should be ready really soon (they're actually on the mirror, but I still need to update the pages).

Ian Hind's picture

It is a VM created from TKL17Core of the initial release. Good to know about the Core container. I built one as well so might spin that up to test.

Pages

Add new comment