Richard's picture

Just an FYI I'm getting this issue on my TKL containers where sshd stops running, version below.

It seems there was a recent update that caused it.

Workaround 1 to fix it from here:

https://askubuntu.com/questions/1109934/ssh-server-stops-working-after-reboot-caused-by-missing-var-run-sshd/1110843#1110843

Followed by:

service sshd restart

On a related note, can you please change the update interval to weekly (or even monthly) rather than daily in /etc/cron.d/cron-apt? I changed mine to Sunday:

54 3 * * SUN root test -x /usr/sbin/cron-apt && /usr/sbin/cron-apt

 

Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:        10
Codename:       buster

 

Forum: 
Tags: 
Richard's picture

TurnKey GNU/Linux 16.1 (Debian 10/Buster)

Jeremy Davis's picture

That certainly doesn't sound good! I assume that you used the Proxmox pct tool to enter the container and rescue it?! Regardless, thanks tons for posting. I'm sure it'll be helpful for others.

FWIW I just tried to recreate the issue with our v16.1 WordPress LXC container (running on Proxmox). I installed the security updates and it's still working (my SSH connection did glitch for a moment, but came good fairly quickly so I assume it was just because SSH was restarted). I then also did all available updates just in case, and that made no difference either. I have tested rebooting and SSH consistently comes back up ok?! So there must be some specific difference between our setups? I'm still running an older version of Proxmox, so perhaps that's the difference?

Regardless, as noted in your link, /var/run should be a symlink to /run. The sshd directory, should exist in /run (on my system it's an empty directory, but it's there). So on a properly configured system, /var/run/sshd should definitely exist and should be the same place.

root@wordpress ~# turnkey-version 
turnkey-wordpress-16.1-buster-amd64
root@wordpress ~# ls -l /var/run
lrwxrwxrwx 1 root root 4 Mar  5  2020 /var/run -> /run
root@wordpress ~# ls /run | grep sshd
sshd
sshd.pid
root@wordpress ~# ls /var/run | grep sshd
sshd
sshd.pid
root@wordpress ~# ls -l /var/run/sshd.pid
-rw-r--r-- 1 root root 5 Apr 13 00:22 /var/run/sshd.pid

You didn't mention whether it just happens intermittently, or only after a reboot. I don't understand how it might randomly disappear, so I'll assume it's only after a reboot. I could imagine this issue being caused by some sort of race condition on boot. I.e. under some circumstance, perhaps the /run directory has not (yet) been created (it's tmpfs created on the fly at boot) by the time that SSH tries to start (so /var/run/sshd doesn't yet exist)? That would explain why the workaround works. The workaround explicitly makes /run/sshd the path for SSH to use. Systemd is probably smart enough to not try starting ssh until /run has been set up. I'm only guessing though...

However, a closer look at the SSH service file (/usr/lib/systemd/system/ssh.service) shows that the directory ssh should use, is already set to /run/sshd by this line:

RuntimeDirectory=sshd

The base directory for 'RuntimeDirectory' is /run, so systemd should be creating the required /run/sshd directory when it starts SSH?!

That makes me wonder if you have done some sort of system upgrade/migration steps at some point? If you did a TKLBAM data migration, perhaps something from the old server has inadvertently been included when it shouldn't have? Or perhaps you did an "in place" Debian upgrade? Please share any other details. Also, I'd be interested to see what your ssh.service file includes and if there are any overrides configured. So please share the output of these commands:

cat /usr/lib/systemd/system/ssh.service
ls -la /etc/systemd/system

Regardless though, if the workaround you noted resolves your problem, then that's a good thing (the workaround seems pretty reasonable - albeit should be unneeded with default config). My only concern is that your wording suggests that perhaps the issue was intermittent? If that was the case, then perhaps something else is going on and the workaround you applied hasn't actually done anything (and the issue just hasn't occurred again since - coincidentally)?!


As to your other point, re cron job adjustment. Your change looks ok in essence, but personally I think that for a production server, installing security updates nightly is the desirable and preferable way to go. So unless I hear a really convincing argument for why it should run less often, I won't be changing the default in TurnKey servers. I might be prompted to include a confconsole plugin to change the frequency, but it isn't currently a priority (you are one of a handful of users to make some sort of complaint about the auto updates in 10 years, so it seems that it suits most users).

Why do you want it to run less often? Without understanding your rationale for wanting to reduce the security updates install window, it's hard to make alternative, superior suggestions. If it's related to network traffic, then setting up apt repo caching (either via apt-cacher-ng, or squid) might be a better way to resolve network traffic concerns.

Richard's picture

Hi Jeremy,

The background is: fairly vanilla 16.1 containers running on Proxmox VE 7.1-4. Haven't messed with sshd. Have installed other stuff like crowdsec, web app framework, redis. Deployed from scratch

It happened to 4 containers simultaneously AFAIK, one is nginx & mariadb, the others are TKL core. Yes, proxmox local console to the rescue. It has only happened once so far AFAIK. I don't ssh in every day.

I guess the other explanation is it was caused by the containers being suspended during backup? But not sure why that would kill sshd and nothing else.

So do you think I shouldn't need /usr/lib/tmpfiles.d/sshd.conf ?

Maybe a restart of sshd would have fixed it, but I just Googled the problem and went with the first solution that looked reasonable. I have one container without the file so will see if the problem recurs, but I have made the KillMode=process fix to that so that might change behaviour?

Config requested:

erp@host ~$ cat /usr/lib/systemd/system/ssh.service
[Unit]
Description=OpenBSD Secure Shell server
Documentation=man:sshd(8) man:sshd_config(5)
After=network.target auditd.service
ConditionPathExists=!/etc/ssh/sshd_not_to_be_run

[Service]
EnvironmentFile=-/etc/default/ssh
ExecStartPre=/usr/sbin/sshd -t
ExecStart=/usr/sbin/sshd -D $SSHD_OPTS
ExecReload=/usr/sbin/sshd -t
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartPreventExitStatus=255
Type=notify
RuntimeDirectory=sshd
RuntimeDirectoryMode=0755

[Install]
WantedBy=multi-user.target
Alias=sshd.service

erp@host ~$ ls -la /etc/systemd/system
total 52
drwxr-xr-x 12 root root 4096 Apr 13 02:17 .
drwxr-xr-x  5 root root 4096 Dec 13 19:33 ..
drwxr-xr-x  2 root root 4096 Apr 13 02:17 basic.target.wants
drwxr-xr-x  2 root root 4096 Dec 13 19:22 boot-complete.target.requires
-rw-r--r--  1 root root  499 Dec 14 15:00 crowdsec.service
lrwxrwxrwx  1 root root   33 Dec 13 19:22 ctrl-alt-del.target -> /lib/systemd/system/reboot.target
lrwxrwxrwx  1 root root   44 Dec 13 19:22 dbus-org.freedesktop.network1.service -> /lib/systemd/system/systemd-networkd.service
lrwxrwxrwx  1 root root   44 Dec 13 19:22 dbus-org.freedesktop.resolve1.service -> /lib/systemd/system/systemd-resolved.service
lrwxrwxrwx  1 root root   45 Feb 12  2021 dbus-org.freedesktop.timesync1.service -> /lib/systemd/system/systemd-timesyncd.service
lrwxrwxrwx  1 root root    9 Dec 14 14:39 fail2ban.service -> /dev/null
drwxr-xr-x  2 root root 4096 Apr 12 01:17 getty.target.wants
drwxr-xr-x  2 root root 4096 Apr  7 19:05 multi-user.target.wants
drwxr-xr-x  2 root root 4096 Dec 13 19:22 network-online.target.wants
drwxr-xr-x  2 root root 4096 Dec 13 19:22 paths.target.wants
drwxr-xr-x  2 root root 4096 Dec 13 19:22 sockets.target.wants
lrwxrwxrwx  1 root root   31 Feb 12  2021 sshd.service -> /lib/systemd/system/ssh.service
lrwxrwxrwx  1 root root    9 Feb 12  2021 stunnel4.service -> /dev/null
drwxr-xr-x  2 root root 4096 Dec 13 19:22 sysinit.target.wants
lrwxrwxrwx  1 root root   35 Feb 12  2021 syslog.service -> /lib/systemd/system/rsyslog.service
drwxr-xr-x  2 root root 4096 Dec 13 19:22 timers.target.wants
drwxr-xr-x  2 root root 4096 Feb 12  2021 webmin.service.d

Security updates:

My point of view and rationale is: if it ain't broke don't fix it :-) As far as I'm concerned security vulnerabilities fall into two categories: notified before release and after release.

If maintainers are notified before release then they have plenty of time to fix and release a patch before the world gets to know, so a few more days before the patch is applied makes no difference.

If notified after it will take them a while to fix it anyway, so a few more days of being vulnerable really isn't going to make much difference either.

As far as I'm concerned doing regular updates are much more likely to introduce issues themselves by interfering with applications, or simply being a badly tested patch. Microsoft update has got the world into bad habits. Any change on a production server should really go through a change control procedure where it is tested first, but who has the time? So by delaying it, you're allowing other people to do it first and fix it before you apply the patch. I see every update as Russian roulette. The fewer you do, the better, within reason. Plus I was getting inundated with cron-apt emails from every container.

Jeremy Davis's picture

The background is: fairly vanilla 16.1 containers running on Proxmox VE 7.1-4. Haven't messed with sshd. Have installed other stuff like crowdsec, web app framework, redis. Deployed from scratch

All sounds good so far ... (BTW crowdsec looks awesome - any more to say about your experiences with that?)

It happened to 4 containers simultaneously AFAIK, one is nginx & mariadb, the others are TKL core. Yes, proxmox local console to the rescue. It has only happened once so far AFAIK. I don't ssh in every day.

The fact that it happened somewhat simultaneously to 4 TurnKey v16.x containers certainly does suggest some common cause. I'm assuming that you didn't have any non-TurnKey containers affected, but just to be clear, did you have any other containers running on the same host that weren't affected (TurnKey and/or otherwise)? If so, what were they?

I guess the other explanation is it was caused by the containers being suspended during backup? But not sure why that would kill sshd and nothing else.

Yeah, I agree, I wouldn't have expected that to be the cause. Although I guess that'd be fairly easy to test?!

So do you think I shouldn't need /usr/lib/tmpfiles.d/sshd.conf ?

Short answer: I don't know, but I don't think so.

Long answer: This is what we know for sure:

  • SSH stopped simultaneously on 4 TKL v16.1 CTs (at least within a few days of each other)
  • After adding a file: /usr/lib/tmpfiles.d/sshd.conf (to 3 of the 4 - as per your notes elsewhere)and restarting SSH on all 4 servers, it appears to be working fine again
  • SSH has continued to run fine since then

So we still don't actually have any idea what caused the issue. Thus all we can say about your changes is, they haven't made things worse. I suspect that they aren't actually doing anything, but you'd need to test to be sure. The fact that you didn't implement the change on one of the 4, and that is also still fine lends weight to the idea that your change makes no difference.

Maybe a restart of sshd would have fixed it, but I just Googled the problem and went with the first solution that looked reasonable.

As I noted above, unless you diagnosed the issue or at least tried restarting ssh, then we have no idea whether your change made any difference or not. (But as I explained, probably not).

I have one container without the file so will see if the problem recurs, but I have made the KillMode=process fix to that so that might change behaviour?

Initially I was like, "what 'KillMode=process fix' are you talking about?!?" 'KillMode=process' is in the default ssh.service file. It's in your output and it looks exactly like mine?! I even checked within Debian and it looks like it was added to ssh.service with the initial systemd support - 8 years ago.

Then I remembered your other thread. I missed that you were editing the template file (i.e. ssh@.service) not the default service file (ssh.service) and I didn't actually double check the ssh.service file and realise.

By default, ssh uses a single long running service to manage all ssh connections and traffic. The service file is uses is 'ssh.service'.

I'll post on your other thread too, but unless you've made some SSH config changes (which your output here suggests you haven't), what you've noted on your other thread would not make any difference (the ssh@.service file is never used by default).

It can be configured to instead use multiple ssh services, one for each connection. I'm not particularly familiar with that configuration, but AFAIK, it uses a socket and each ssh session is triggered on connection. Scenarios like that (i.e. multiple instances of a service) use a service template file, in this case, an instance of 'ssh@.service'. You can tell it's a template because of the '@' symbol.

I have one container without the file so will see if the problem recurs

Circling back to this; ah ok. So on at least one server, you just restarted SSH and that appeared to be sufficient?

Config requested: [...]

That all looks fine. Your service matches mine (as per the Debian default) and the other output suggests that you are running the same single long running SSH service as per default.

So in summary, we still don't know why SSH stopped, but it seems likely that simply restarting it was sufficient to get it running again.

Unfortunately, I didn't have any long running v16.x containers (I have v16.x VMs, but not CTs). As I've noted, I haven't been able to recreate any of the issues you've reported.

I have had a look at SSH updates, and there haven't been any since the app was built (so it hasn't been updated). So it wasn't an SSH update that caused any of your issues. There have been updates to systemd since build, but they were some time ago too, and the most recent is a security update, so should have been installed within 24 hours of your initial launch (either at firstboot or the first cron-apt run within 24 hours) - unless of course you changed that before cron-apt ran?

So unfortunately, why SSH stopped on these 4 servers is still a mystery... I personally really hate problems like that. I sort of got used to it with Windows, but find it much rarer on Linux. Although if a restart appears to have fixed it and it doesn't stop again anytime soon, then perhaps the "why" doesn't really matter that much?!


My point of view and rationale is: if it ain't broke don't fix it :-)

I'm inclined to agree. Although it could also fairly be argued that if the Debian security team have released an update, then something is broken! :) They don't do that lightly...

As far as I'm concerned security vulnerabilities fall into two categories: notified before release and after release. If maintainers are notified before release then they have plenty of time to fix and release a patch before the world gets to know, so a few more days before the patch is applied makes no difference. If notified after it will take them a while to fix it anyway, so a few more days of being vulnerable really isn't going to make much difference either.

I get your point and it's not an unreasonable rationale. Although your argument could be used to argue for daily updates instead of hourly, so the timeframe is somewhat arbitrary. So it comes down to which timeframe is the most appropriate. We settled on daily as it seems like the best balance to us. You disagree, which is fair enough. But I'm not sure that your argument is convincing enough for me to change time tested config.

As far as I'm concerned doing regular updates are much more likely to introduce issues themselves by interfering with applications, or simply being a badly tested patch.

I 100% agree! For what it's worth, the 'stable' in Debian stable doesn't refer to how stable the software itself is (although generally it is). It actually refers to the behaviour of all the software. It's been well documented that bugs in some specific software have explicitly been left unpatched in Debian stable for the explicit reason that it would change expected behaviour!

Microsoft update has got the world into bad habits. Any change on a production server should really go through a change control procedure where it is tested first, but who has the time?

As an ex-Windows administrator, I would add the qualifier of "recent" or "modern" Microsoft update! I pity the poor fool who left auto updates running in Win XP on a Win Server 2003r2 network (I learned the hard way...). It's really only been since Windows 10 that I've found the Windows updates robust enough to auto enable (and still sleep at night).

But to your point, again I'm broadly inclined to agree. But that's why we only install the security updates (not all updates). The Debian security team carefully crafts patches that introduce the minimal possible changes to address the security issue.

Even with the security updates enabled, minor CVEs may remain unpatched. Those sometimes just go to the "updates" repo, so are rolled out together at the next "point release" (neither of which TurnKey users will ever get unless they manually install available updates).

So by delaying it, you're allowing other people to do it first and fix it before you apply the patch. I see every update as Russian roulette. The fewer you do, the better, within reason. Plus I was getting inundated with cron-apt emails from every container.

Again I understand your concern. I think there is some validity in changing the frequency, even disabling them altogether. But in our experience, the risk is low. For it's entire ~14 year history, TurnKey has always installed available updates from the security repo (only - never updated packages in main or updates). In the time that I've been closely involved with TurnKey (incidentally about the same time we've been based on Debian - roughly 10 years), IIRC the auto security updates have caused 2 hiccups. Both those occasions were more to do with the way that updates are installed that was the issue, rather than the updates themselves. Both these times new dependencies (from main) were required and as our updates (can) only install from "security", the updates failed. This did cause DoS, but no significant data loss.

I do also vaguely recall a Samba sec updates that introduced a regression. But the Debian security team released a revised security update the next day.

So I'm totally open to having how to reduce the frequency documented. I'd even be open to making it easier to change (e.g. via script or a confconsole plugin) but I'm not sure I'd want to change the default behaviour. Actually, I would like to make it so that it could be configured to allow installation of new dependencies from main if needed (thus eliminating the risk of the 2 times it has caused issues). That config should be possible, but I haven't spent tons of time trying to work it out. And obviously it would need testing prior to implementation.

Anyway, apologies on the essay...

Richard's picture

Just to answer your questions:

Crowdsec: I have run fail2ban and liked it but just thought I'd try crowdsec for this server. I haven't really monitored or played with it much (I really should) but it just seemed like a good idea to have detections from the community replicated. fail2ban was blocking so many IPs from ssh attempts alone I figured it was a good idea to share that info with the community. Though there is a lot of trust to put into something with not a great deal of history from what I gather. I have it installed as a "no-api" on my containers, then have the "lapi" and "firewall-bouncer" installed on the proxmox pve. The container services are registered with the pve instance. So in theory any strange internal behaviour should cause a block on the public interface. Though if they're already inside a container, blocking an internal IP isn't going to do much good for inter-container traffic. I do have ports forwarded through to containers though.

I'm not running any other containers, just 4x TKL.

Backups run nightly and I haven't seen any other issues.

Just to recap the timeline:

It seems like sshd was killed for some reason at some point on 3 containers, a service restart probably would have fixed it.

A daemon log snippet from a random time before I noticed the issue on one container. No idea if it's related:

Apr 11 02:44:49 shared1 systemd[1]: ssh@13-10.0.254.1:22-10.0.0.1:44938.service: Succeeded.
Apr 11 02:44:49 shared1 systemd[1]: ssh@13-10.0.254.1:22-10.0.0.1:44938.service: Consumed 1.154s CPU time.
Apr 11 03:58:26 shared1 systemd[1]: Starting Daily apt download activities...
Apr 11 03:58:27 shared1 systemd[1]: apt-daily.service: Succeeded.
Apr 11 03:58:27 shared1 systemd[1]: Started Daily apt download activities.
Apr 11 03:58:27 shared1 systemd[1]: apt-daily.service: Consumed 537ms CPU time.
Apr 11 04:57:52 shared1 systemd[1]: Started OpenBSD Secure Shell server per-connection daemon (10.0.0.1:44940).
Apr 11 04:58:57 shared1 systemd[1]: ssh.socket: Succeeded.
Apr 11 04:58:57 shared1 systemd[1]: Closed OpenBSD Secure Shell server socket.
Apr 11 04:58:57 shared1 systemd[1]: Starting OpenBSD Secure Shell server...
Apr 11 04:58:57 shared1 systemd[1]: Started OpenBSD Secure Shell server.
Apr 11 04:59:06 shared1 systemd[1]: ssh@14-10.0.254.1:22-10.0.0.1:44940.service: Succeeded.
Apr 11 04:59:06 shared1 systemd[1]: ssh@14-10.0.254.1:22-10.0.0.1:44940.service: Consumed 624ms CPU time.

Then I restored a backup of one container and modified, and on that one, for some reason, the socket took over from the service. Since then everything seems to have been stable.

Thanks for the essay anyway!

Jeremy Davis's picture

Crowdsec certainly looks pretty cool. I'll keep a bit of an eye on it...

Also, that log output you shared is using the socket (i.e. the first line notes an instance of the termplate: 'ssh@13-10.0.254.1:22-10.0.0.1:44938.service').

So it's still a crazy mystery...

Richard's picture

Just as a quick update.

I ran into the `Missing privilege separation directory: /var/run/sshd` problem again on a vanilla Debian 11 container and it was caused by sshd crashing/stopping. Still don't know why but it was resolved with a `service sshd restart`.

It seems when the service stops/crashes the socket takes over because I was still able to log in before the restart.

It was noticed because mail-in-a-box runs a status check every night which was running a `sshd -T` which failed with the above error.

Oddly, I had to close the ssh socket connection (log out) and restart sshd from a local console, otherwise I couldn't log in even though the service said it was up.

More crowdsec info:

If you're using it in containers on a parent server and forwarding ports, make sure you enable the FORWARD rule in crowdsec-firewall-bouncer.yaml, otherwise bouncer blocks won't affect your forwarded traffic.

It's under active development so make sure you install the latest version.

Once you've learnt the cli tool it's pretty good, I prefer it over fail2ban now. If you like to use recidive to permanently block IPs, crowdsec has the advantage that you get a huge list of blacklisted community IPs out of the box.

The only issue I found was when an ssh rule breach occurred, there appeared to be an internal loop trying to process it which made 300,000 attempts. I raised it as an issue and nothing was done except a 'try the latest version' response even though the loop hadn't changed. I haven't checked back to see if the upgrade made a difference.

Add new comment