Dan's picture

The new LAMp V15 is in production for me now.... I have noticed that after it runs for about 4 or 5 days when I log in to the console it takes a really long pause after typing in my password.

 

I am worried that something is building up thats not getting cleaned up or cleared out....   what should I do to troubleshoot this?

 

Thanks so much for all your detailed and highly informative answers Jeremy!

 

Forum: 
Jeremy Davis's picture

If it definitely appears to be directly related to the length of time that the LAMP has been running, my immediate guess is RAM leakage somewhere. But I'm only guessing...

Assuming that you mean log in via SSH, then a better way to properly diasnose what the issue is, would be to make the inital SSH connection really verbose. Then you'll see what part of the process is causing the delay. Hopefully that might make the next steps of what to look at, a bit easier (rather than just jumping to random conclusions like I often do! :). I'm not sure how you'd do that from Windows (using PuTTY for example) and I'm not 100% sure for MacOSX either (although 'ssh --help' might head you in the right direction). But I know connecting from Linux, using the '-v' switch with ssh will make it really verbose. E.g.:

ssh -v root@YOUR_SERVER

If you can't work out how to do that from your OS (and you're not running Linux as a desktop) then possibly the easiest way would be to set up a local TurnKey Core VM and ssh into that first, then ssh -v from that, into the problematic server.

Then you should see where the connection is stalling. Hopefully that will give you insight into where the real issue may be.

Monitoring resources is another angle which may be worth consideration. Webmin has a resource usage dashboard which may help, plus all TurnKey servers include Monit. Monit has a webUI, but we disable it by default as our default config is pretty simple and we don't believe it adds a ton of value. So it may be worth re-enabling that as an easy way of checking resource usage? To re-enable the webUI, edit and uncomment the line `set httpd port 2812 and` and also one of the other lines (likely 'allow admin:monit`). Then restart monit.

Hope those ideas head you on the right track. Please share what you learn, as perhaps there's something that TurnKey could be doing better/differently to improve the user experience.

Dan's picture

As always Jeremy you are top notch with your information and help.

On OSX (Mac) the -v option works like on a linux desktop... so in using it I can see that it hang where marked below.

**** SSH Output Below ****

debug1: Entering interactive session.

debug1: pledge: network

**** HANGS AWHILE ****

debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0

debug1: Sending environment.

**** SSH Output Above ****

Also rather than being a memory leak, which as you say would be bad, I think this may be related to the cifs drive mappings to 2 Windows machines that I am using.

One of them gets turned off at night and doesn't get turned on each day until later in the day due to their time zone.     The Linux server is in Central time at the main warehouse location where one of the folders is mounted...  the other is 2 hrs behind in Pacific time.

Also, these folders have lots of very small files in them....   I am working on a way to clean them up since the many small files are not needed and then each folder could start clean each day.  I just sort of feel that its hanging because the machine is turned off at night and the Linux server is unhappy about that.  When I just move about the console with LS and that machine is off I get a "Host down" message... so something is designed to keep retrying that mount that might be "hanging" up my system.

I am going to see if I can get them to leave their machine on all the time.... then I am going to clean up those two folders so they are empty at the start of each day and see if this hanging issue goes away.

The "hanging" doesn't effect anything but the process thats working with those folders... so I think I am on the right track.  Other DB and web page access does not appear to be effected so I think I can rule out the memory leak.

Thanks so much for helping with this Jeremy!

Jeremy Davis's picture

TBH, memory leak is my "go to" guess when things are sluggish after a few days of running. But as I say, it was just a guess and the only way to be sure, is to diagnose... :)

Also FWIW, ssh (on Linux at least) allows you to crank up the verbosity by adding extra "v"s (to a max of 3). E.g. ssh -vvv root@REMOTE_SERVER. Although in your case, I doubt that will add anything much of value.

The fact that it hangs at "debug1: pledge: network" lead me to find this Q&A on ServerFault. Judging by the range of answers, it seems that there can be a myriad of causes! Note though, that unless you are using something other than default user management (i.e. default Linux user authentication - PAM) I would advise you to NOT test the first answer! Setting "UsePAM no" in your servers sshd_config file WILL BLOCK SSH LOGIN (via PAM authentication)! So don't do that unless you understand the consequence!

I would urge you to not just try any of those suggestions without first confirming that they are relevant and/or ensuring that you understand the potential consequences. But they may give you some pointers in what to be looking our for. Viewing the logs such as the syslog, dmesg and/or systemd journal should give you some more clues.

Your suggested note re the "missing" network share may indeed be a better guess. As noted above, the syslog (/var/log/syslog) and/or dmesg and/or (systemd) journal may assist with confirming that. Also a fairly easy way to test your hypothesis would be to ensure that the shares are unmounted when the remote server (that hosts the shares) is down. If the "hang" goes away with the remote shares unmounted, then that would be pretty good evidence that it is indeed the issue.

If you can confirm that the issue is related to the missing shares, then you could set up a cron job to check the availability of the remote machine. If the remote machine is available and the share isn't mounted, mount it. If the remote machine isn't available and the share is mounted, unmount it.

FWIW, I just did a quick bit of googling and found what appears to be a pretty cool approach. It's quite an old thread on the Ubuntu forums, but if nothing else, it could give some good ideas? (seeing as it's over 10 years old, I personally wouldn't just copy paste it - although it may still work?).

I also found a few other things that may assist; here, here and here.

Dan's picture

Jeremy... just reporting back... it is network share for sure. If its taking a long while to log in, after it finally comes up and I unmount the share.... everything is fast again.

Also if the network share comes back online again... everything is fast again.

So the issue is indeed that it must keep trying to resolve that share each time SSH comes up or even when you move about from the terminal.... causing it to pause unless the share is live or I unmount the share.

Once again you are a wealth of knowledge and pointers in the right direction. Thank you again so much for all your help.

P.S.  I signed up for TurnkeyHub and am loving how easy it is to roll out a new box in the AWS cloud.... that is SO sweet.

 

Jeremy Davis's picture

TBH, I haven't hit that before and I'm not sure on the mechanism of why it would do that, but thanks for confirming that is the issue.

You're most welcome on the assistance. Thanks for your kind words. And glad to hear that you got to the bottom of it.

To add an additional option to assist you to work around this, perhaps autofs is worthy of further investigation? TBH, I've never used it myself, but from a glance it looks like it may be perfect for your purposes?! The man page will probably give some more insight, plus I found a tutorial that might also assist. Note the tutorial is a bit dated, but will hopefully be near enough (especially if considered in context to the current Stretch man page).

Glad to hear that you're loving the Hub. Sorry about the hiccup we just had! Hopefully I'll hear back from my colleague Alon once he's investigated a bit more deeply.

Add new comment