Richard's picture

Please can you make the following change for the next release please?

 

Add KillMode=process

to [Service] section of

/lib/systemd/system/ssh@.service

 

Taken from here

Forum: 
Tags: 
Jeremy Davis's picture

I just tested a number of TurnKey servers (both as LXC CT and KVM VM) and I can not recreate the issue you are noting?!

Here is the steps I took (from a Debian Bullseye desktop - to a fairly fresh v16.1 WordPress LXC CT running on PVE):

  • Open terminal application (on Desktop).
  • Log into (v16.1 CT) server via SSH.
  • Start screen session on server.
  • Close terminal application on desktop (thus severing the SSH connection, without closing the screen session).
  • Then opening a new terminal window on desktop, open a new SSH session into the (v16.1 CT) server.
  • Confirm existing screen session still exists. It does!?

Digging a little deeper, whilst I have the ssh@.service file as noted by you, but it's not currently in use for me?! So I'm at a bit of a loss why your system seems so different to the ones I've looked at? You mentioned in your other recent post that you are running as an LXC container on Proxmox. OTTOMH, the only "simple" difference that might make sense of such significant differences between our servers (beyond you doing lots of customisation) is perhaps if you are running it as a privileged container?

If you can recreate these issues from a fresh TurnKey instance, please share how you do that. If I can recreate the issue, I can almost certainly fix it (or at least provide a solid workaround).

Richard's picture

Hi Jeremy,

Here are my steps on TKL Core 16.1 container deployed from scratch on proxmox 7.1-4:

  • Start top in local console to container in proxmox
  • Open terminal application (on Desktop)
  • Log into (v16.1 CT) server via SSH from pve as root using publickey
  • Start screen session on server
  • Ctrl-A, d    - to detach
  • exit
  • Watch sshd & screen both killed in top

The above change fixed the behaviour. How can you tell if your system is using the systemd .socket or .service? The article talked about swapping from socket to service but didn't say how, so I went with the easy option and it worked.

Unprivileged container = Yes

Just tried it on a freshly deployed turnkey-core 16.1-1 container and got the same result.

Called testTKL, oh how I smiled :-)

Richard's picture

Ctrl-a, d (not uppercase)

Jeremy Davis's picture

This is a super weird one!

I definitely cannot reproduce this at all!

I had left the container running the other day when I first responded to you. I logged in via PCT (from the Promxox host) and ssh and my 2 screen sessions are still running from when I was testing the other day. My laptop has gone to sleep (at least once every 24 hours) within that time frame, so even if I hadn't exited out of any remaining SSH sessions, all connections would have been closed. Regardless, I double checked for any open SSH sessions (within the pct terminal) to be sure. There weren't any.

I logged in via SSH (in another terminal) and again confirmed both of my screen sessions. It was interesting to note that a new 'sshd' process did start when I logged in (for a total of 2). That new 'sshd' process exited once I exited the SSH session. I tried opening 2 separate SSH sessions and there were 3 'sshd' processes running. So it seems additional processes are still used by SSH, even if not using the socket/one instance per connection config. I double checked and my system is definitely not using the socket.

Bottom line, the change you are suggesting will only make difference if other config is changed. Unfortunately, I have no idea how to move to the socket config and I'm not clear if there is any real advantage.

Also, unless you understand the consequences of what you are doing, editing files in /usr/lib (or /usr in general, with the exception of /usr/local) is generally frowned on. That is because those file trees are generally managed by package management. Creating new files that exist already is rarely going to be an issue (although most apps that accept modifications to data in /usr usually have "safe" places for changes in /etc). That main issue is that future apt updates will nuke your changes.

For the specific case of systemd service files, the right way is to either; create a new complete service file, or just an "override" snippet with the additions/changes.

To create a full replacement service file, put it in the same subpath as the file you're overwriting, but in /etc. E.g. To overwrite /lib/systemd/system/ssh@.service, put your new service file: /etc/systemd/system/ssh@.service

To just add a setting as you have, you can just use an override snippet. Do that by creating a directory in /etc wth the same name as the service, but with '.d' on the end. E.g: /etc/systemd/system/ssh@.service.d/. Then create a file 'override.conf' inside that. Then just add the section heading and the setting you are adding. E.g. for your case, you would have a file "/etc/systemd/system/ssh@.service.d/override.conf" and it's contents would be:

[Service]
KillMode=process

Both options have pros and cons, although for a single addition such as you are noting here, I'd suggest that this latter method is cleaner.

Richard's picture

I think the thing we need to establish is why my containers are using the socket and yours appear to be using the service. Is there some sort of decision or set of conditions? I honestly haven't messed with anything and don't have time to figure it out right now, sorry.

Jeremy Davis's picture

Yes, I'd be very interested to understand how that happened too!

It may not explain it, but could you please share the output of the following:

systemctl status ssh.service
systemctl status ssh.socket

Thinking some more, it's not default Debian behaviour, nor how we configure it. It's definitely not like that on any of our other builds (I've double checked & and it's not like that on a full clean install from ISO, not even in the extracted ISO squashfs filesystem). I've also had a quick grep through our ISO to LXC conversion tool (all our builds start as ISO, then we use that base for all our other builds). As you can see, that doesn't touch anything to do with ssh service config.

Perhaps there's something that I'm missing, but I feel fairly confident that it's something that's happening after you've downloaded the template. As you are 100% sure that it's nothing you've done AND you've recreated the issue locally, the only thing that makes any sense to me is that it's Proxmox doing it (at launch time I assume?). Or perhaps there is some specific config and/or issue on your server that forces SSH service to fail and it tries to fall back on the socket? As you are likely aware, LXC leverages the host system, so unlike most other virtual platforms the host config can impact the container. (Although this specific scenario makes no sense to me).

Actually, another possibility that does occur to me is that my Proxmox was originally installed when it was v1.x. (about 10 years ago). Since then I've done Debian style "in place" upgrades, plus I'm still only running v6.x (we've just started our v17.0 release and I rely on my PVE server during development). I will update it, but only after I've done the initial v17.0 LXC builds (to confirm they work ok on v6.x, as well as v7.x). So there is a possibility that my install has some old cruft that affects the behaviour and/or PVE v7.x does some additional container pre/post launch configuration that changes the behaviour. I don't really understand why they might do that and after looking through the release notes for v7.0beta1, v7.0 & v7.1 there doesn't seem to be anything explicit that might be a cause...

Maybe it's worth asking about it on the Proxmox forums? Perhaps the installation and configuration subforum? (the other PVE subforum is networking and firewall which doesn't seem appropriate).

FWIW we have lots of Promxox users, and this has never been reported elsewhere. Perhaps very few of them use screen, tmux or similar? Or perhaps there is something specific to your setup? I've also spent a fair bit of time searching to see if I could find anyone else reporting this, unfortunately I couldn't. Obviously that doesn't mean that it's just you experiencing this, but it does suggest that it's not that common. OTOH it is worth noting that there is precedent for people experiencing TurnKey PVE LXC guest issues that I can't reproduce. So perhaps this is another of those...?

Ultimately though, the (spirit of the) change you suggest doesn't make any significant impact. We try to avoid changing Debian defaults as much as possible, beyond explicit security hardening and/or bugfixes. But even though I can't recreate it, making this change isn't really that big a deal. So perhaps I might consider that. I've opened a new issue on our tracker so it doesn't get forgotten.

Final word though, if I haven't already mentioned it (and apologies if I have) you shouldn't edit the files in /lib/systemd. They will be overwritten on package update. Instead use 'systemctl edit' (or manually create the override file structure in /etc/systemd). E.g.:

systemctl edit ssh@socket
Richard's picture

The plot thickens!

Here's the output you asked for:

root@host /home/erp# systemctl status ssh.service
* ssh.service - OpenBSD Secure Shell server
   Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:sshd(8)
           man:sshd_config(5)
root@host /home/erp# systemctl status ssh.socket 
* ssh.socket - OpenBSD Secure Shell server socket
   Loaded: loaded (/lib/systemd/system/ssh.socket; enabled; vendor preset: enabled)
   Active: active (listening) since Tue 2022-04-12 01:17:48 BST; 1 weeks 3 days ago
   Listen: [::]:22 (Stream)
 Accepted: 47; Connected: 1; Refused: 1
    Tasks: 0 (limit: 9426)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/ssh.socket
Apr 12 01:17:48 host systemd[1]: Listening on OpenBSD Secure Shell server socket.

The odd thing is I can't reproduce it on my other containers and the same commands yield very different results! The only thing unusual about this container is it was created by restoring a backup of one of my other containers and modifying the ip, hostname, etc in proxmox. Here's the output of the one that was cloned:

root@shared1 /home/erp# systemctl status ssh.service
* ssh.service - OpenBSD Secure Shell server
   Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2022-04-12 06:18:24 BST; 1 weeks 3 days ago
     Docs: man:sshd(8)
           man:sshd_config(5)
 Main PID: 442028 (sshd)
    Tasks: 9 (limit: 9426)
   Memory: 25.0M
      CPU: 2.548s
   CGroup: /system.slice/ssh.service
           |-442028 /usr/sbin/sshd -D
           |-498772 sshd: erp [priv]
           |-498792 sshd: erp@pts/2
           |-498793 -bash
           |-498804 sudo su
           |-498805 su
           |-498806 bash
           |-498815 systemctl status ssh.service
           `-498816 less -X -R -F
Apr 22 20:44:17 shared1 sshd[498717]: Received disconnect from 10.0.0.1 port 44954:11: disconnected by user
Apr 22 20:44:17 shared1 sshd[498717]: Disconnected from user erp 10.0.0.1 port 44954
Apr 22 20:44:17 shared1 sshd[498698]: pam_unix(sshd:session): session closed for user erp
Apr 22 20:53:32 shared1 sshd[498772]: Accepted publickey for erp from 10.0.0.1 port 44956 ssh2: ED25519 SHA256:O6EnwdTuWtvFY
Apr 22 20:53:32 shared1 sshd[498772]: pam_unix(sshd:session): session opened for user erp by (uid=0)
Apr 22 20:54:05 shared1 sudo[498804]:      erp : TTY=console ; PWD=/home/erp ; USER=root ; COMMAND=/usr/bin/su
Apr 22 20:54:05 shared1 sudo[498804]: pam_unix(sudo:session): session opened for user root by LOGIN(uid=0)
Apr 22 20:54:05 shared1 su[498805]: (to root) erp on pts/2
Apr 22 20:54:05 shared1 su[498805]: pam_limits(su:session): Could not set limit for 'core' to soft=0, hard=-1: Operation not permitted; uid=0,euid=0
Apr 22 20:54:05 shared1 su[498805]: pam_unix(su:session): session opened for user root by erp(uid=0)
root@shared1 /home/erp# systemctl status ssh.socket
* ssh.socket - OpenBSD Secure Shell server socket
   Loaded: loaded (/lib/systemd/system/ssh.socket; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2022-04-11 04:58:57 BST; 1 weeks 4 days ago
   Listen: [::]:22 (Stream)
 Accepted: 15; Connected: 0;
Feb 28 22:48:46 shared1 systemd[1]: Listening on OpenBSD Secure Shell server socket.
Apr 11 04:58:57 shared1 systemd[1]: ssh.socket: Succeeded.
Apr 11 04:58:57 shared1 systemd[1]: Closed OpenBSD Secure Shell server socket.

Maybe I should try a restart?

I have only just started using screen so I just assumed all my containers would be the same. I guess to assume really does make an ass out of u and me!

Jeremy Davis's picture

Even the CT you have that is using the ssh.service (not ssh.socket) has different ss.socket config to mine:

Here's yours:

* ssh.socket - OpenBSD Secure Shell server socket
   Loaded: loaded (/lib/systemd/system/ssh.socket; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2022-04-11 04:58:57 BST; 1 weeks 4 days ago

And here's mine:

* ssh.socket - OpenBSD Secure Shell server socket
   Loaded: loaded (/lib/systemd/system/ssh.socket; disabled; vendor preset: enabled)
   Active: inactive (dead)

So yours is enabled and has been running (but has/was stopped or killed). Mine is disabled and hasn't ever run.

Richard's picture

Like this issue, I think this is caused by the sshd service crashing/stopping.

I think it's a fair assumption that if the service is stopped, the socket-based ssh session will die on logout and take screen with it.

Whereas if the service is running, processes are persistent.

Now I just need to figure out why my container sshd has the tendency to crash.

Jeremy Davis's picture

I've never struck that scenario, but I see the value of it falling back to socket if the service fails. Although it shouldn't fail in the first place really!

So long as you can get access via an alternate method (to restart crashed ssh daemon - e.g. 'pct enter VMID' - where VMID is your VM's ID number), you could disable the ssh.socket. You'll then know straight away and be able to check what happened to the ssh service more easily.

systemctl disable --now ssh.socket

I suggest first checking the ssh.service journal:

journalctl -u ssh.service

It might also be useful to view other journal messages from around the same time. You can view the full journal for the last hour like this:

journalctl --since "1 hour ago"

You can also use '--since' and/or '--until' to limit the results. Format is "YYYY-MM-DD HH:MM:SS" e.g. "2022-07-12 23:15:00".

Feel free to share any logs, etc if you want a hand trying to work it out.

Add new comment