Buyasta's picture

I'm setting up a small Proxmox cluster, and had intended to deploy several TKL appliances on it, however I've run into an issue with Webmin.

I've done some troubleshooting, and narrowed the issue down to systemd, presumably the unit file.

If launched by systemd, it'll run for a few seconds before dying, then relaunch, then rinse and repeat.

If executed manually (/usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf), it'll run just fine and continue to work.

Almost all of my testing was with TKL Core, but I did also test the LAMP template to make sure it wasn't isolated to Core.

I fired up a full VM running Core, and Webmin worked perfectly there, so it appears to be specific to LXC containers.

I'm using unprivileged containers - I quickly tested it in a privileged container, and it was broken there as well (albeit in a different manner), but from what I gather, v16 appliances shouldn't work properly in privileged containers anyway.

Unfortunately there was zero helpful information in either the webmin logs or syslog, however it is trivial to replicate, so you shouldn't have much trouble gathering further info if needed.

 

root@lamptest ~# ps aux | grep miniserv
root        4110  0.0  0.1   3080   664 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4116  0.0  2.4  22508 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4118  0.0  0.1   3212   736 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4116  0.0  2.4  22508 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4125  0.0  0.1   3212   680 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4116  0.0  2.4  22508 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4132  0.0  0.1   3212   740 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4116  0.0  2.4  22508 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4139  0.0  0.1   3212   732 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4116  0.0  2.4  22508 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4146  0.0  0.1   3212   700 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4116  0.0  2.4  22508 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4153  0.0  0.1   3212   736 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4161  0.0  0.1   3080   728 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4168  0.0  0.1   3080   732 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4175  0.0  0.1   3080   732 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4181  0.0  1.4  14572  7408 ?        Rs   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4183  0.0  0.1   3080   740 pts/0    R+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4181  0.0  2.4  22516 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4190  0.0  0.1   3212   736 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4181  0.0  2.4  22516 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4197  0.0  0.1   3212   700 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4181  0.0  2.4  22516 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4204  0.0  0.1   3212   664 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4181  0.0  2.4  22516 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4211  0.0  0.1   3212   700 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4181  0.0  2.4  22516 12788 ?        Ss   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4218  0.0  0.1   3212   664 pts/0    S+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4181  0.0  4.1  28940 21792 ?        Rs   01:24   0:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
root        4225  0.0  0.1   3080   664 pts/0    R+   01:24   0:00 grep --color=auto miniserv
root@lamptest ~# ps aux | grep miniserv
root        4233  0.0  0.1   3080   664 pts/0    S+   01:24   0:00 grep --color=auto miniserv

 

Forum: 
Jeremy Davis's picture

Thanks for your report. I use LXC lots myself, but I don't use Webmin. I did the Webmin service development and testing on a Core VM. Although, I did test on my local Proxmox too. I just tested it again on my Proxmox server and Core at least seemed to work ok?!

However, I'm still using Proxmox v5.4.x (I know I need to upgrade...). I assume that you are on v6.x? So perhaps the differences in the host have an impact?

Also, after your prompting, I looked at webmin entries from the journal (i.e. 'journalctl -u webmin'). I notice, that whilst it seems to be running fine now (I just logged in and browsed around), it did restart numerous times (5 to be precise), which I have no explanation for?! See here:

-- Logs begin at Fri 2020-06-12 04:34:54 UTC, end at Fri 2020-06-12 04:39:34 UTC. --
Jun 12 04:34:55 jed-test-core systemd[1]: Starting Webmin Web based Admin UI...
Jun 12 04:34:56 jed-test-core systemd[1]: Started Webmin Web based Admin UI.
Jun 12 04:34:57 jed-test-core perl[236]: pam_unix(webmin:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
Jun 12 04:34:59 jed-test-core webmin[236]: Webmin starting
Jun 12 04:35:00 jed-test-core systemd[1]: webmin.service: Succeeded.
Jun 12 04:35:01 jed-test-core systemd[1]: webmin.service: Service RestartSec=1s expired, scheduling restart.
Jun 12 04:35:01 jed-test-core systemd[1]: webmin.service: Scheduled restart job, restart counter is at 1.
Jun 12 04:35:01 jed-test-core systemd[1]: Stopped Webmin Web based Admin UI.
Jun 12 04:35:01 jed-test-core systemd[1]: Starting Webmin Web based Admin UI...
Jun 12 04:35:01 jed-test-core systemd[1]: Started Webmin Web based Admin UI.
Jun 12 04:35:01 jed-test-core perl[446]: pam_unix(webmin:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
Jun 12 04:35:03 jed-test-core webmin[446]: Webmin starting
Jun 12 04:35:03 jed-test-core systemd[1]: webmin.service: Succeeded.
Jun 12 04:35:04 jed-test-core systemd[1]: webmin.service: Service RestartSec=1s expired, scheduling restart.
Jun 12 04:35:04 jed-test-core systemd[1]: webmin.service: Scheduled restart job, restart counter is at 2.
Jun 12 04:35:04 jed-test-core systemd[1]: Stopped Webmin Web based Admin UI.
Jun 12 04:35:04 jed-test-core systemd[1]: Starting Webmin Web based Admin UI...
Jun 12 04:35:04 jed-test-core systemd[1]: Started Webmin Web based Admin UI.
Jun 12 04:35:04 jed-test-core perl[448]: pam_unix(webmin:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
Jun 12 04:35:06 jed-test-core webmin[448]: Webmin starting
Jun 12 04:35:07 jed-test-core systemd[1]: webmin.service: Succeeded.
Jun 12 04:35:08 jed-test-core systemd[1]: webmin.service: Service RestartSec=1s expired, scheduling restart.
Jun 12 04:35:08 jed-test-core systemd[1]: webmin.service: Scheduled restart job, restart counter is at 3.
Jun 12 04:35:08 jed-test-core systemd[1]: Stopped Webmin Web based Admin UI.
Jun 12 04:35:08 jed-test-core systemd[1]: Starting Webmin Web based Admin UI...
Jun 12 04:35:08 jed-test-core systemd[1]: Started Webmin Web based Admin UI.
Jun 12 04:35:08 jed-test-core perl[450]: pam_unix(webmin:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
Jun 12 04:35:10 jed-test-core webmin[450]: Webmin starting
Jun 12 04:35:14 jed-test-core systemd[1]: Stopping Webmin Web based Admin UI...
Jun 12 04:35:14 jed-test-core systemd[1]: webmin.service: Main process exited, code=killed, status=2/INT
Jun 12 04:35:14 jed-test-core systemd[1]: webmin.service: Succeeded.
Jun 12 04:35:14 jed-test-core systemd[1]: Stopped Webmin Web based Admin UI.
Jun 12 04:35:14 jed-test-core systemd[1]: Starting Webmin Web based Admin UI...
Jun 12 04:35:14 jed-test-core systemd[1]: Started Webmin Web based Admin UI.
Jun 12 04:35:14 jed-test-core perl[1426]: pam_unix(webmin:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
Jun 12 04:35:15 jed-test-core webmin[1426]: Webmin starting
Jun 12 04:35:16 jed-test-core systemd[1]: webmin.service: Succeeded.
Jun 12 04:35:17 jed-test-core systemd[1]: webmin.service: Service RestartSec=1s expired, scheduling restart.
Jun 12 04:35:17 jed-test-core systemd[1]: webmin.service: Scheduled restart job, restart counter is at 1.
Jun 12 04:35:17 jed-test-core systemd[1]: Stopped Webmin Web based Admin UI.
Jun 12 04:35:17 jed-test-core systemd[1]: Starting Webmin Web based Admin UI...
Jun 12 04:35:17 jed-test-core systemd[1]: Started Webmin Web based Admin UI.
Jun 12 04:35:17 jed-test-core perl[2292]: pam_unix(webmin:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
Jun 12 04:35:19 jed-test-core webmin[2292]: Webmin starting
Jun 12 04:37:38 jed-test-core webmin[3870]: Non-existent login as admin from 127.0.0.1
Jun 12 04:37:45 jed-test-core perl[3869]: pam_unix(webmin:session): session opened for user root by (uid=0)
Jun 12 04:37:46 jed-test-core webmin[3869]: Successful login as root from 127.0.0.1

It shoudl restart at least once (when the SSL certs are generated) and it's not unreasonable for it to restart a second time, but 5 seems excessive. But it's weird that it eventually settles down... I'm not sure if it is a factor, but FTR my container has 2 cores (@2.4Ghz) and 2GB RAM allocated.

We supply the systemd service file for Webmin in v16.0. Previously we provided an old style SysVInit script (i.e. a /etc/init.d/ script). systemd has a backwards compatible generator mode which will generate a service file from an init.d script on the fly. That works fine, but I thought it'd be neater to actually provide a proper service file.

Part of my rational was that I split the stunnel config up so the webmin stunnel service is now completely separate to the webshell (aka shellinabox) stunnel service. In theory it should work fine with the dynamically generated service file, but my experience was that wasn't the case. FWIW the rational behind splitting the stunnel services is that it would make it easier to disable one, without affecting the other. I haven't got there yet, but I hope to add a confconsole plugin to disable/enable these. But I digress...

Anyway, at some point during development, I did notice that at times, Webmin wouldn't be running at boot (TBH, I'm still not sure why that was/is) but restarting it worked fine. I did put a little time into trying to work out why it wasn't always running at boot, but couldn't see why, hence why I added the auto restart bit to the service file. That seemed to resolve it ok, but it seems not under LXC?!

I wonder why the behaviours is so different under LXC? FWIW, you can see that the service calls "/usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf" - the same command that you note works fine outside of a service?!

I don't recall exactly why I changed it, but the service file I initially created (last year) was quite different to what we have now. So perhaps that's worth trying? Another option would be to delete the current webmin.service file altogether and let it revert to the old behaviour of generating one. Although I'm not sure how that might impact the stunnel service?

Actually, another thought is that perhaps the stunnel service is a factor in all this. IIRC there is some dependency relationship between the stunnel service (stunnel4@webmin.service) and Webmin (and webshell too, with it's associated stunnel service).

I'd be really interested to hear how you go with your investigations and would warmly welcome a better set up. I'd be more than happy to include it in the package, or if it's only relevant ot LXC, then just implement it for LXC builds.

Also FWIW, the systemd docs are pretty reasonable (although things don't always seem ot work as I expected and I did lots of trial and error). StackExchange sites are also pretty good IMO.

You can find the webmin.service file here: /usr/lib/systemd/system/webmin.service and if you modify it, before you can test it, you need to run daemon-reload. E.g..:

systemctl stop webmin
vim /usr/lib/systemd/system/webmin.service
# tweak the webmin service file and save the changes
systemctl daemon-reload
systemctl start webmin
systemctl status webmin
Buyasta's picture

Yeah, I'm really not a big fan of Webmin either, I find it much quicker and easier to just edit the config files. In this case though, I need to provide management access to people with little to no linux sysadmin experience, so a GUI is essential.

I was in a bit of a hurry as I finished that post, an hour or so later I was thinking to myself "sure, trivial to replicate in your sample size of 1". I was working under the assumption it's just an inherent problem with LXC containers, but it's certainly possible that it may be specific to my setup.

As you guessed, I'm running Proxmox 6.2, although I can pretty easily throw 5.4 on one of the servers, or spin up a container on my Arch desktop, for testing other hosts.


Yesterday while I was testing, I'd done a fair bit of messing with the systemd unit file to try to get it working. Initially I'd hoped looking at another unit file for it might be helpful, but the upstream repo for Debian was just wrapping the old init.d scripts, and the Arch repo was wrapping a custom bash script.

I'd mostly been messing with the environment variables, and I'd also noticed the stunnel4 dependency and wondered if that might be part of it, but disabling it made no difference.

When I started the node back up this morning and launched my testing containers to start doing some more digging, I realized webmin was actually running properly on the initial TKL Core container I'd been doing all my testing on. Unfortunately I was unable to replicate that on other containers, and then at some point in my testing, the first one broke again for no apparent reason, so that didn't wind up being useful.

However while I was looking for a way to debug the process a little better, I came across this stack exchange answer that suggested the unit file should be using Type=forking rather than Type=exec - exec isn't actually addressed there, but it's most similar to simple:
https://superuser.com/questions/1274901/systemd-forking-vs-simple/1274913

After that, I tried Type=forking and everything worked nicely, I've also quickly tested it on a fresh container with none of my other changes, and on a VM to make sure it didn't introduce any issues there.

Also just FYI, I'm pretty sure PIDFile is ignored unless Type=forking - if type=simple or exec, they shouldn't be forking and systemd already knows their PID, and in a one-shot it doesn't care. Definitely it was irrelevant for me - Webmin never actually got far enough through launching to create the pidfile or one of the logfiles.

So yeah, some more testing is no doubt in order to make sure it doesn't introduce issues anywhere else, but as long as nothing else crops up, it looks like a pretty easy fix.

root@tkltest2 ~# cat /etc/systemd/system/multi-user.target.wants/webmin.service 

[Unit]
Description=Webmin Web based Admin UI
Requires=local-fs.target
After=network.target remote-fs.target nss-lookup.target
StartLimitBurst=20
StartLimitIntervalSec=30

[Service]
Type=forking
Environment="PERLLIB=/usr/share/webmin" LANG= PERLIO=
ExecStart=/usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
KillSignal=SIGINT
PIDFile=/var/webmin/miniserv.pid
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target

Jeremy Davis's picture

Thanks for the testing and debugging help! That's interesting that forking is proving more reliable. I'm not sure whether I ever tried that or not?! I do recall spending a ton of time testing though... It was ages ago when I did the initial development and unfortunately, I didn't take any detailed notes.

Re use of 'exec' vs 'forking' service type, my understanding is that 'exec' is a more recent addition, hence why it probably didn't get a mention in that post you liked to. My understanding is that they are (or at least should be) mostly the same. The fundamental difference is that 'forking' considers the service started as soon as the executable systemd starts, calls 'fork()'. Whereas 'exec' waits for calls to both 'fork()' & 'execve()'. FWIW, here's the relevant systemd doc page.

Reading through that doc page again, makes me wonder if my choice of 'exec' was related to trying to order stuff for some reason?! Anyway, thanks for sharing and I see that you've open a PR too, thanks for that as well. I'll aim to do some more testing ASAP and assuming that it works better (or at least doesn't work any worse) I'll rebuild Webmin sometime soon. Having said that, considering that Webmin (upstream) hasn't had an update for a while, it wouldn't surprise me if one comes out soon/ So perhaps waiting might be pertinent? Regardless, thanks for you contribution.

It would be great if you could give me an indication of the container specs too though (especially RAM & CPU specs). As it seems likely that it may be some sort of race condition, system specs may be a factor that I need to consider in my testing.

Buyasta's picture

No worries, happy to help. And yeah, I've often found getting things working just right in systemd unit files tricky - much harder than the days of simple init scripts.

I could certainly be wrong, but my understanding is that simple and exec are appropriate for processes that run interactively - they don't fork themselves, so systemd does the fork for them. I noticed when executing webmin manually that it was returning my shell, so it was clearly forking itself, which suggests to me that Type=forking is the right one to be using here, and that using Type=exec is causing 2 forks per launch - one from systemd, one from webmin/miniserv.pl.

My working theory is that it's due to the speed of the restarting. As soon as webmin itself forks, systemd thinks it's dead and launches it again, so you wind up with multiple instances still in the early stage of launching, they fight over control of the port and/or pid/log files (or something else entirely), and they die.

On a faster/more powerful machine, it might launch fast enough that the pid file is created before systemd launches a second instance, which suggests that bumping up the RestartSec a bit might also do the trick. That would let you keep it as exec, although I still think forking is the correct type to be using here.

So yeah, I may be finding it easy to replicate because it's older and slower hardware.
I was just using the default container specs, so 1 core and 512MB of RAM, 8GB disk.
The host system is a 2U Supermicro server circa roughly 2012-2014, running 2x E5-2640 2.5-3.0Ghz 6C/12T CPUs, 128GB of DDR3-1333 Reg ECC, a pair of 120GB SSDs in ZFS RAID-1 for vm/container storage, and 3x6TB HDDs in RAIDz1 for bulk storage.

Tomorrow morning I'll do a bit more testing and see what happens if I bump RestartSec up, and also see what happens as I scale a container up and down a bit, although obviously host frequency and IPC can't be changed.

Buyasta's picture

Ok, I did some more testing this morning. Changing RestartSec did nothing, so I guess it's not down to how quickly it's restarting - even when it was waiting 5 seconds to relaunch, the initial launch was failing.

Scaling up the VM did fix it though - I started with 12C & 8GB, but RAM didn't appear to make any difference, and noticeable differences only occurred between 1-3 cores.
Using 1C & 512MB, it was consistently broken, but once (probably of about 20-25 times) it did succeed on boot.
Using 2C & 512MB, it was consistently succeeding on boot, on a restart it'd fail a few times before successfully launching.
Using 3C and 512MB, it would often fail once or twice before successfully launching.

So yeah, adding more cores does fix it, but I suspect probably raw frequency and IPC are the largest factors, so even a single core container on modern hardware likely wouldn't run into this issue.

I did find another issue with the unit file, which also solves the issue on 1C containers - Restart=always is telling it to always restart the process if it's not running, even if it exited 0, ie success. When a process forks, the initial process will generally return success when it exits, and definitely Webmin does.
If you change it to Restart=on-failure, that's always minus a clean/successful exit code/signal, and that fixes the problem.

I'd also probably remove the StartLimitIntervalSec and StartLimitBurst from the unit file, or at least tweak them - you've got it set to stop trying to restart if it exceeds 20 attempts to launch within 30 seconds, but it takes longer than 1.5s for it to try and fail to launch (at least on my hardware, newer stuff maybe not), so it's never actually going to hit that cap even when it's consistently failing to launch.

So yeah, overall my recommendation would be to change Type=exec to Type=forking, Restart=always to Restart=on-failure, and maybe remove or tweak the StartLimit stuff, although that shouldn't really be relevant unless something is pretty borked, so yeah, up to you really. *shrugs*

Jeremy Davis's picture

I really wish that I'd taken more incremental notes when I did the dev on this...

Anyway, looking back through the commit history shows that I originally had Restart=on-failure but changed it to Restart=always for some reason?!

Also thanks for the feedback re StartLimitIntervalSec and StartLimitBurst. FWIW those values were somewhat arbitrary. I added them in just to be sure that it didn't give up too early, but didn't wait too long (because using 'Type=exec' is blocking, unlike 'Type=forking'). The rationale was to try restarting no more than 20 times or over 30 seconds, whichever comes first (I figured with RestartSec=1, even if the restart time was up to 2 secs, it should retry ~15-20 times).

FWIW, I tested 'Type=forking' on a VM and that seems to work fine. On a container it also worked fine, but I did start seeing this message in the journal:

systemd[1]: webmin.service: Can't open PID file /var/webmin/miniserv.pid (yet?) after start: No such file or directory

However, it doesn't seem to cause any real issue.

I also noticed that according to the journal, the time to start was much longer (5sec vs 1sec - on this same container) after changing to 'Type-forking'. FWIW this was a LAMP container with 1 core@2.4GHz, 512MB RAM).

After a bit more reading, I realise my misunderstanding/mistake. Ultimately, it comes from my lack of familiarity with C and a conclusion that I jumped to when reading the systemd docs. When I read that 'Type=exec' waits for "both 'fork()' & 'execvt()'" before deeming a service "started", I did do a bit of reading on 'execvt()' but just assumed that both were related to the guest process; essentially that a call to 'fork()' was related to the process forking itself. After a little more research and re-reading the systemd docs, it seems that the noted of 'fork()' in relation to 'Type=exec' is systemd calling 'fork()' (& 'execvt()'), not the guest process... Doh!

So I agree, let's go with 'Type=forking' and "Restart=on-failure'. Assuming we go that way, we can possibly get rid of 'StartLimitBurst=20' & 'StartLimitIntervalSec=30'? Any thoughts on that?

Buyasta's picture

Hmmn, that message is interesting, if you manually check for the PID file, is it there?.. Given it has to fork before creating it, I would assume it's probably just a case of it looking for the file a few seconds early - the pid file was definitely being created for me, I just don't know for sure that I ever looked in a container where I hadn't run it manually at least once.

And yeah, that's also interesting that it's taking longer to start. Admittedly I was never really paying much attention to how quickly, just whether it successfully launched or not, but I hadn't noticed any difference.

I definitely noticed when looking at the systemd.service docs that it's very easy to get confused, because it keeps talking about forking in both the simple and exec sections, without making it clear that systemd itself will be performing the fork.
Once you read the first bit of the forking section, it becomes a little easier to tell, but if I were them, I'd definitely make that a bit clearer - even knowing the difference between the three of them, every time I look at that page, I spend a few seconds thinking I must have it wrong.

I'd probably lean toward just removing StartLimitBurst and StartLimitIntervalSec, but not especially strongly - dropping StartLimitBurst down to 10 or bumping StartLimitIntervalSec up to 40-60 would be fine too.

I was curious how commonly used those options are, based on my Arch desktop, it looks to be pretty rare:

root@heimdall ~ $ find /usr/lib/systemd/ -type f -name *.service | wc -l
290

root@heimdall ~ $ grep StartLimit /usr/lib/systemd/* -R
/usr/lib/systemd/system/ceph-crash.service:StartLimitInterval=10min
/usr/lib/systemd/system/ceph-crash.service:StartLimitBurst=10
/usr/lib/systemd/system/ceph-fuse@.service:StartLimitInterval=30min
/usr/lib/systemd/system/ceph-fuse@.service:StartLimitBurst=3
/usr/lib/systemd/system/ceph-mds@.service:StartLimitInterval=30min
/usr/lib/systemd/system/ceph-mds@.service:StartLimitBurst=3
/usr/lib/systemd/system/ceph-mon@.service:StartLimitInterval=30min
/usr/lib/systemd/system/ceph-mon@.service:StartLimitBurst=5
/usr/lib/systemd/system/ceph-osd@.service:StartLimitInterval=30min
/usr/lib/systemd/system/ceph-osd@.service:StartLimitBurst=3
/usr/lib/systemd/system/ceph-radosgw@.service:StartLimitInterval=30s
/usr/lib/systemd/system/ceph-radosgw@.service:StartLimitBurst=5
/usr/lib/systemd/system/ceph-rbd-mirror@.service:StartLimitInterval=30min
/usr/lib/systemd/system/ceph-rbd-mirror@.service:StartLimitBurst=3

 

Jeremy Davis's picture

A bit more reading (and the message itself) suggests that it's likely a(n insignificant) race condition. By the time the process has started, the PID file is there. And systemd seems happy to kill the process and remove the PID file when the service is stopped, so doesn't seem to be a deal breaker. Out of interest, I could only reproduce that behaviour in a container, it doesn't seem to occur in a "proper" VM (even with lots of resources assigned).

Re StartLimitBurst and StartLimitIntervalSec, yeah let's just ditch them! FWIW I don't have any services that use them! :)

Buyasta's picture

Yeah, I figured it was probably just a case of the PID file not having been created yet when it first checks for it - as you said, it's insignificant as long as it doesn't cause it to kill the process.

Awesome, I think we've made some good changes here, thanks a lot for your help! :)

The first couple of Turnkeys I was going to set up were the domain controller and the file server, so depending on how much free time I have, I might have a go at updating those to v16 - if I do, you'll see a couple more pull requests from me.

I can also knock up a brief guide on joining the file server to the domain - I see you've got a fairly empty spot in the docs for that.

Jeremy Davis's picture

So I've made the changes as discussed and pushed to my personal repo. And here is the updated file. TBH, I haven't actually tested it at all yet (it's just the PR you issued plus I've added the further changes that we've discussed; it should work, but there may be typos?!).

Re Domain Controller and Fileserver, that would be awesome! FWIW I have actually started doing some work on the Domain Controller, including a total rewrite of the inithook but I don't recall what state it's in (I was doing that in my own time for another purpose and I've just committed and pushed what I had). I'm not attached to that being completed for v16.0 if you would rather just leverage the existing code. Although, it is worth noting that inithooks is now python3 so the inithooks on both apps will need to be ported to that at the least.

Most of the inithook update work can be handled by 2to3. Other than that, the shebang needs to be updated (to the python3 path). And the only other main thing is that we've deprecated executil and are instead using subprocess. If you need any more guidance, please feel free to ask.

Also much of the fileserver code is in common, as it's used in multiple appliances. Please ask if you need any pointers there.

Add new comment