Bjorn Struebing's picture

Have a weird issue that i do not know where to start from on.  I am running the turnkey wordpress vm and its been soild for almost a year now, however in the last few months the website will go down and the only fix is to reboot the server. I can get to the webmin page no problem, its just the wordpress site that will not come up.

My question is where do i start in trouble shooting this. 

Thanks for the help in advance.

 

Bjorn

Forum: 
Tags: 
Jeremy Davis's picture

All the log files will be in /var/log - Apache specific ones in /var/log/apache2. Also are there any errors showing up when you browse to the WP site? They might give you clues (i.e. if it is a 404 something is probably up with Apache, if it is complaining about a DB connection then maybe its MySQL, etc).

Also one other thing that might be worth a look is to see if phpMyAdmin works (https://domain-or-ip:12322) as that runs under Apache too and requires a MySQL connection. might help you pin it down to a Apache/MySQL issue or a strictly WP issue.

OnePressTech's picture

Hi Bjorn,

As Jeremy indicated you can have a look in the apache logfiles but if nothing jumps out I would suggest the following:

1) Clarify "goes down"...do you get 500s? Do you get no response at all? How long does this last if you wait? How often are you rebooting (daily, weekly, monthly)? Are you on your own hardware appliance (CPU fan failure typical culprit) or on a cloud service provider (AWS Micro Instance auto-throttle)? Is your disk full? Does the site "go down" for users or wordpress admin or both?

2) Enable Wordpress debug (http://codex.wordpress.org/Debugging_in_WordPress)

3) Set up a Pingdom account and monitor its responsiveness ahead of it "going down"...does performance degrade or does it just stop!

4) Clone the appliance and see if the problem goes away on the clone appliance.

5) Turn off all plugins on the clone appliance and see if the problem goes away.

NOTE: If you are using a AWS Micro instance AWS will shut it off completely for 10+ minutes if you breach peak usage. See http://www.turnkeylinux.org/forum/support/20120626/cpu-usage-spikes-100-... for details.

NOTE: If you are using a AWS Micro instance increase it to a small instance and see if problem goes away.

 

Cheers,

Tim (Managing Director - OnePressTech)

Bjorn Struebing's picture

thanks for the info. let me run thru this and see what i can find.

Bjorn Struebing's picture

took about 10 days but it finnaly went down. When i go to the page i either get "This page cant be displayed" in IE or "No Data Recived" in Chrome.

 

Here is what the appache log shows from the time Pingdom tells me it went down:

 

[Fri Jul 25 05:42:56 2014] [notice] child pid 28961 exit signal Segmentation fault (11)
[Fri Jul 25 05:57:37 2014] [notice] child pid 28914 exit signal Segmentation fault (11)
[Fri Jul 25 05:57:39 2014] [notice] child pid 26434 exit signal Segmentation fault (11)
[Fri Jul 25 06:06:33 2014] [notice] child pid 28069 exit signal Segmentation fault (11)
[Fri Jul 25 06:10:33 2014] [notice] child pid 29476 exit signal Segmentation fault (11)
[Fri Jul 25 06:55:28 2014] [notice] child pid 29691 exit signal Segmentation fault (11)
[Fri Jul 25 07:58:01 2014] [notice] child pid 30391 exit signal Segmentation fault (11)
[Fri Jul 25 08:57:24 2014] [notice] child pid 29688 exit signal Segmentation fault (11)
[Fri Jul 25 08:58:51 2014] [notice] child pid 32022 exit signal Segmentation fault (11)
[Fri Jul 25 09:52:57 2014] [error] server reached MaxClients setting, consider raising the MaxClients setting
[Fri Jul 25 11:01:28 2014] [notice] child pid 2281 exit signal Segmentation fault (11)
[Fri Jul 25 11:28:46 2014] [notice] child pid 1068 exit signal Segmentation fault (11)
[Fri Jul 25 11:42:33 2014] [notice] child pid 2859 exit signal Segmentation fault (11)
[Fri Jul 25 17:15:49 2014] [notice] child pid 7730 exit signal Segmentation fault (11)
[Fri Jul 25 17:15:50 2014] [notice] child pid 8150 exit signal Segmentation fault (11)
[Fri Jul 25 17:15:52 2014] [notice] child pid 6726 exit signal Segmentation fault (11)
[Fri Jul 25 17:42:51 2014] [notice] child pid 8147 exit signal Segmentation fault (11)
[Fri Jul 25 23:43:08 2014] [notice] child pid 13949 exit signal Segmentation fault (11)
[Sat Jul 26 01:20:53 2014] [notice] child pid 16463 exit signal Segmentation fault (11)
[Sat Jul 26 01:20:54 2014] [notice] child pid 16469 exit signal Segmentation fault (11)
[Sat Jul 26 02:54:29 2014] [notice] child pid 16289 exit signal Segmentation fault (11)
[Sat Jul 26 02:55:52 2014] [notice] child pid 16965 exit signal Segmentation fault (11)
[Sat Jul 26 03:32:58 2014] [notice] child pid 16492 exit signal Segmentation fault (11)
[Sat Jul 26 03:32:59 2014] [notice] child pid 16135 exit signal Segmentation fault (11)
[Sat Jul 26 05:43:09 2014] [notice] child pid 16548 exit signal Segmentation fault (11)
[Sat Jul 26 05:43:33 2014] [notice] child pid 18754 exit signal Segmentation fault (11)
[Sat Jul 26 05:43:44 2014] [notice] child pid 16554 exit signal Segmentation fault (11)
[Sat Jul 26 07:59:09 2014] [notice] child pid 16500 exit signal Segmentation fault (11)
[Sat Jul 26 08:00:52 2014] [notice] child pid 20710 exit signal Segmentation fault (11)
[Sat Jul 26 13:01:28 2014] [notice] child pid 20703 exit signal Segmentation fault (11)
[Sat Jul 26 15:02:53 2014] [notice] child pid 23326 exit signal Segmentation fault (11)
[Sat Jul 26 18:14:00 2014] [notice] child pid 30017 exit signal Segmentation fault (11)

 

 

any idea what this means:  

[Fri Jul 25 09:52:57 2014] [error] server reached MaxClients setting, consider raising the MaxClients setting

I also tried to hit :12322 and that will not load either. So i guess im looking at either Apache or MySQL?

any other log i should look at?

 

thanks again for the help.

OnePressTech's picture

If it takes 10 days for a failure to occur it could be a memory leak in one of the plug-ins, disk getting full due to excessively verbose recording, external DOS. Regarding answers to the previous questions I asked...

1) what isp are you using?

2) if on AWS what size service are you subscribed to (Micro)?

3) What did the pingdom response log show...reducing responsiveness ahead of the lockup or good responsiveness up to the time of lockup or a quick spike in load ahead of the lockup? The first could indicate memory leak / zombie process, the second could be apache resource load buildup, third could indicate DOS attack.

Cheers,

Tim (Managing Director - OnePressTech)

Bjorn Struebing's picture

1) So i'm self hosting in my own Vmware Cluster. ISP is AIS

2) not using AWS

3) pingdom shows two spikes of 1000 to 3000ms. normal is 850-900ms Could I be looking at a DOS?

Jeremy Davis's picture

Especially considering the "server reached MaxClients setting" error.

But like Tim said, could be other stuff...

Segfaults in Apache can be caused by a myriad of different things, however running out of RAM is a possibility. A flaky PHP module, WP module or other PHP issue is a common cause too. Have you installed any new PHP modules, WP modules or adjusted any PHP settings prior to this issue starting?

This blog post looks like it gives a pretty clear explanation on how to narrow down segfaults which might be useful!?

Also did you check any of the things that Tim mentioned? E.g. HDD space?

As for the MaxClients setting, I don't know a lot about it, but a quick google suggests that it relates to the number of concurrent connections that Apache can handle (as you would probably expect by the name). It may be related to the segfaults (e.g. system running out of RAM? DDOS attack? etc...) or it may be coincidental. This answer on StackOverflow explains it quite well (although note that they are discussing a CentOS server so whilst the theory is the same, the way the config is tweaked is different in Debian). Also FWIW here is the official Apache docs on it.

Bjorn Struebing's picture

ok let me look over those posts. thanks for the point of direction.

Jeremy Davis's picture

Whilst I am an 'official' TurnKey guy, he is a long term trusted TurnKey user and always has valuable input. He has much more 'real world' experience than I do. My knowledge comes from a lot less experience, a lot more googling and a fair bit of 'best guessing'! :)

Good luck! And keep us posted.

OnePressTech's picture

Thanks for the vote of confidence J ...let's hope our advice helps out in this case :-)

 

Cheers,

Tim (Managing Director - OnePressTech)

OnePressTech's picture

Up to you how much diagnosis you want to do further.

1-3 second response would not be abnormal if not sustained ...i.e. not likely a DOS attack. Wordpress admin is a bit of a pig. Enough logged in users at the same time on a low cpu server would show a temporary high load.

Regarding maxclients...if you get enough hits in a short period maxclients can peak on an apache server because the default socket timeout is 300 seconds so a traffic burst even a legitimate one would suck up all your sockets and it will take 5 minutes to be able to access the server again. A quick load test should verify this if this is the problem http://performance-test.compuware.com/instant-load-test

My suggestion...

1) Check the apache access logs to see what the traffic looks like

2) After the server locks up wait for 10 minutes and see if it clears once the traffic is reduced

3) If you are self-hosting see what external network traffic logs are available so you can see when the traffic peaks and clears.

2) put a cron job that resets the server every night at 2 a.m. and see if the server stays up over the long term. There are times in the technology world that solving the problem is more important than knowing definitively what the problem is.

Up to you.

Cheers,

Tim (Managing Director - OnePressTech)

Add new comment