Bill Carney's picture

I'm having a similar issue to https://www.turnkeylinux.org/forum/support/20151116/tklbam-backup-mysql-...

When running tklbam-backup, it gets to the point where it has dumped the very last table, and at this point MySQL should continue and operate as normal.  But it doesn't, and tklbam-backup pauses.  If I log in and issue a service mysql restart, then tklbam-backup will continue and start on to the Duplicity stuff.

table: wp_votetonya/wp_votetonya_wfStatus
table: wp_votetonya/wp_votetonya_wfThrottleLog
table: wp_votetonya/wp_votetonya_wfVulnScanners
(this is where it pauses until a  service mysql restart is issued)

UNCOMPRESSED BACKUP SIZE: 10.32 GB in 330877 files

Executing Duplicity to backup system changes to encrypted, incremental archives
===============================================================================

# export PASSPHRASE=$(cat /var/lib/tklbam/secret)

I've turned off nightly backups and have to run them manually since I have to be sitting on the machine to restart the MySQL server for my websites to be available. 

Any ideas on what's causing this?  I've restored from a backup to a new instance and the error carries over to that one as well.  Running on an m4.large instance so I wouldn't think it's RAM related.

 

Forum: 
Jeremy Davis's picture

Apologies on the slow response. I've been laser focussed on getting the RC (release candidate) of v15.0 ready for public consumption. It's been a long time coming, but it's really close now...!

Anyway, first thing, that link you posted gives me a 404?! I wonder if it's something to do with the website update?

Regarding your issue, that does sound a bit weird. I suggest that you check what the RAM and CPU are doing when it "hangs". I'm totally guessing, but I'm wondering if perhaps you are running out of RAM?

top is probably a good option to check the CPU/RAM usage (and what's using it). If you log in in 2 SSH sessions, and run TKLBAM in one and top in the other, hopefully that may at least give us some additional info.

Another thing worth doing is to check the MySQL logs to see if they are giving any additional info. They should be in /var/log although I forget the exact location OTTOMH.

Out of interest, how long have you let it run for to see if it eventually gets over it? I'm guessing quite a while...

Bill Carney's picture

Some extraneous characters were appended to the URL, give this a shot:  
https://www.turnkeylinux.org/forum/support/20151116/tklbam-backup-mysql-error

It appears I'm very close to running out of RAM; I would have expected 8GB would be plenty though.

Here is top right after starting tklbam-backup: http://www.billcarney.com/DuringDump.txt  The lines ending with "/usr/sbin/+" is Apache.

During the process where it's dumping tables, some - but not all - of the Wordpress sites become unavailable due to MySQL not being able to serve data.  Here is top at that point: http://www.billcarney.com/AtFreeze.txt

I have to issue service mysql restart for tklbam-backup to continue.  When I first encountered this, it had been stalled out for over four hours.    

And here is a few moments after tklbam-backup completion.  http://www.billcarney.com/AfterCompletion.txt

The odd thing is that I can run tklbam-backup --skip-databases and it works, and I can do a manual mysqldump without a problem, but tklbam-backup on it's own takes it down.

Looks like I'll have to start looking into what is gobbling up so much RAM.
 

Jeremy Davis's picture

Firstly, yes that new link works a treat, thanks! :) That issue is a really weird one. It's not completely unique from my experience, but we're yet to be able to pin it down to a TKLBAM bug (or even a bug in something else that we could work around). It only appears for some users under some specific circumstance. The few cases (like 2 or 3 it the last 8 years), we have been able to replicate (using user's MySQL data) it appears to be caused by MySQL edge case bugs, under very specific circumstance. TBH, I'm still not (yet) completely convinced that your case is one of those.

Depending on a myriad of factors, in theory 8GB should be more than enough RAM for a small to medium traffic site. However, doing a dump of a largish DB is a very RAM intensive process. Essentially it requires MySQL to replicate everything it has into RAM and write it to the disk. By my understanding (which I admit is somewhat limited) some DB data may be present in memory up to 3 times! (Once for the DB to use with Apache and PHP, once for the DB dump and a third time for the disk cache as it's being written). It's like that because to backup the data it needs to be in a consistent state, but as it's also being used by the webapp (and is in a volatile state), it must be copied before it's backed up!

My guess is that you are running out of RAM during the backup. So operations are being pushed to swap (assuming you have swap enabled?) which is causing disk thrashing and the the hang. The fact that MySQL doesn't completely crash, suggests that you have swap enabled (as if you run out of accessible RAM/swap altogether the OOMKiller would step in and likely kill MySQL altogether - as I noted in that other thread).

The other factor regarding RAM consumption is the interaction between Apache and PHP. From the bit of tklbam log you've posted, I gather you are using WordPress, right? Third party WordPress plugins have a bit of a reputation for being poorly written (obviously that doesn't apply to all of them, but an alarmingly high number). Historically, many have been notorious for memory leaks. So the first thing I would recommend is making sure you only have wp plugins enabled which you are actually using. Any that you have installed but aren't using, get rid of them (disable them first, then remove them once you're sure they don't break anything).

You probably already have/are, but I'd also double check that all your plugins are up to date. I'd also double check that all the plugins you are using are being maintained. If they haven't had any updates in a while, it may be time to see if you can get the same or similar functionality from an alternate (better maintained) plugin?

If MySQL appears to be using excessive RAM, it may also be worth seeing if you can tune MySQL a bit. I'm certainly no expert in that field, so I won't give you any advice, but I'm sure you could find some pointers online. Once you've done a bit of research yourself, then any remaining questions or queries are possibly best asked on a DB focused/specialist forum, such as https://dba.stackexchange.com/

As I mentioned, `top` may assist you to troubleshoot what is actually using all the RAM (although IMO `htop` is even better, but you'll need to install it). My guess is that during the backup MySQL (and probably TKLBAM itself) are chewing up plenty of RAM. But I'm also guessing that the major culprit of the RAM usage is Apache. If that's the case, try restarting Apache to see if that improves things. If it is a PHP memory leak somewhere, you should see an almost immediate improvement, which will again degrade over time.

With regard to your logs, it appears that authentication is required to view them? If you don't want to post raw logs publicly, then perhaps you could obfuscate them (i.e. remove any sensitive bits)? If you do that, please make sure that you note where/what you have obfuscated so I can ask for further info if it appears that something that might be useful is missing.

Bill Carney's picture

My apologies, I added password protection to the site while I was working on another project and neglected to remove it before I shut down for the evening.  The links/files are now available.

I did not have a swap file enabled.  This morning I added a 2GB swap file per these directions https://www.cyberciti.biz/faq/linux-add-a-swap-file-howto/ and ran tklbam-backup.  MySQL stopped responding as before, and watching top it showed me the swapfile useage was minimal.  I wound up removing the swapfile.

I do have a number of Wordpress sites on the server.  I try to keep the the installs as lean as possible with only necessary plugins.  I also use InfiniteWP to update the plugins, themes, etc daily.

I issued a service apache2 stop, then ran tklbam-backup --skip-files; this too caused tklbam-backup to stop at the end of dumping the tables and would not continue until I service mysql restart.  Here is the usage at the point it stopped:

KiB Mem:   8194396 total,  6695696 used,  1498700 free,   293560 buffers

KiB Swap:        0 total,        0 used,        0 free.  5344376 cached Mem

It's odd that the problem is immediately after the database dump, rather than during the dump - I would expect that if MySQL has 3 copies of the db in memory, it would die off during the process rather than immediately afterward.  Very perplexing.

Jeremy Davis's picture

Thanks for the additional info Bill.

It sounds like all my guesses were wrong then and the issue runs a little deeper than I had hoped. So sorry for my rant above. It sounds like you are already all over the points I noted...

Now I can see the top info you've posted (thanks for fixing that) I'm even more perplexed! Even when it freezes, there's still about 180MB RAM free and almost no CPU load! You are right that it is weird that it stops after dumping the DB and that restarting MySQL allows it to continue?!

At first glance, I noticed that when it freezes ksoftirqd is the second highest usage of CPU. TBH, I don't know a lot about it, but AFAIU excessive ksoftirqd CPU usage is often related to hardware issues. But on closer look, it's only using 0.7% CPU - which is hardly excessive!

So it doesn't appear to be RAM or CPU related (at least not on face value). It seems to me, like TKLBAM is waiting for MySQL to finish, but as far as MySQL is concerned, it's done...!

Did you check the MySQL logs to see if that is reporting anything amiss at that point? TBH, my guess is that there won't be much (if anything) of interest there, but probably worth a look. Beyond a MySQL and/or TKLBAM edge case bug, the only other thing that I can think to check would be I/O stuff (again I doubt it's that, but who knows...). I just discovered SysStat and it looks like it's in the Debian repos. TBH, I've never used it, but it looks like it may assist with double checking IO related stuff?!

If you run out of ideas and you're willing to share your backup with us, we can do some testing? We possibly just need a copy of your DB? If you do, please feel free to santise the data first (i.e. remove anything sensitive, e.g. user info). But either way, please make sure that the issue still occurs with the data you share. Your backup looks pretty big, so assuming that it's reproducible with just the DB, that's probably best. I'm not 100% sure on the best way to get it to me, but if you can upload it somewhere, then I can download it. Obviously don't post it somewhere public! If I need credentials to access it, you can email me directly via jeremy AT turnkeylinux.org. If you use google drive, then that could be an option (my google account is jeremy AT jeremydavis.org).

Bill Carney's picture

Nothing of note in the MySQL log.

My workplace blocks NTP so I can't run a restore to a local VM unless I'm at home.  I'll try to get you a sanitised, tested copy of the db over the weekend.  Thanks, as always!

Bill Carney's picture

I just sent you a link to the file, Jeremy.  Hope that helps.

Bill

Jeremy Davis's picture

Thanks Bill, I got it! I'll try to have a look ASAP.

Jeremy Davis's picture

Can you please test just doing a tklbam dump to the local FS and see if the same thing happens?

mkdir /tklbam-dump
tklbam-backup --dump=/tklbam-dump

TBH, I don't expect it to give a different result, but I'd like to be sure...

I've just loaded your DB into MySQL on a locally installed v14.2 LAMP appliance (I figured that would that be close enough). FWIW the VM is running on KVM with a 15GB vHDD, 2GB RAM and 2 vCPU cores (the hardware CPU is an Intel Atom octo-core running @2.4GHz) - so should be much lower powered than your AWS instance, but it completes quite quickly and with no obvious issues.

[...]
table: wp_votetonya/wp_votetonya_wfScanners
table: wp_votetonya/wp_votetonya_wfStatus
table: wp_votetonya/wp_votetonya_wfThrottleLog
table: wp_votetonya/wp_votetonya_wfVulnScanners

UNCOMPRESSED BACKUP SIZE: 1.88 GB in 6978 files

TIP: test your backups with a trial restore BEFORE something bad happens.
root@lamp ~# 

Unfortunately I missed it finishing, so I'm not sure whether it paused or anything, but I will try again over the weekend. I may even do a screen recording so I can replay it.

It's probably also worth double checking that we're using the same versions. Please give me the output of:

apt-cache policy tklbam mysql-server | grep Installed

FWIW this is what I get:

  Installed: 1.4.1+17+g71478bd
  Installed: 5.5.59-0+deb8u1

(top one is tklbam; bottom one is mysql)

So the next step will be to try to reproduce the issue on AWS. Assuming we're on the same version etc, my hunch at this point is that there is some corner case AWS hypervisor issue, or perhaps . Can you please clarify the version of TurnKey you are using and the size and region the server is running. Hopefully that way I have a good shot of reproducing it.

Bill Carney's picture

I'm running 14.2, one on a m4.large instance in us-east-1c, and on a tklbam-restored local virtual machine (2CPU, 1GB RAM).  Both are LAMP appliances.  The error appears on both.  I left the local VM running overnight, and it never could get past the last table dump on it's own w/out having to restart the MySQL server.  

I did the tklbam dump to the local FS on my AWS instance, and it showed the same problem.  

Here is the output of on the AWS instance.

  Installed: 1.4.1+17+g71478bd

  Installed: 5.5.59-0+deb8u1

Since it worked for you, I'm starting to become concerned it's something *I've* done.  Do you happen to know what's being used at that point in the backup process?  Maybe I edited a .ini file in a way that tklbam-backup doesn't like.  I know I had to bump php.ini's ini_set('memory_limit', '32M');  to 512M  to accomodate InfiniteWP (the Wordpress updater I refered to earier in the thread).  Yeah, I know, grasping at straws here.

 

 

Jeremy Davis's picture

Ok so that confirms we're using the same versions of TKLBAM and MySQL. The fact that you can also recreate on a local VM rules out AWS (sorry I got sidetracked over the weekend and never did test that myself...).

That suggests that there is something fundamentally different going on between your server and mine. IMO that means one of two things:

  1. it's nothing to do with MySQL at all (seems unlikely considering your testing shows that restarting MySQL works around it); or
  2. there is something specifically different between your MySQL setup and mine

At this point, I'm fairly sure it must be something specifically related to MySQL (not PHP or anything else) but here's one way to test that theory. Assuming that you'll need to do this at home, to speed up the testing, I suggest that you just do a TKLBAM local dump of your VM (no need to wait for it to upload then). Do that like this:

mkdir /tklbam-dump
tklbam-backup --skip-files --skip-packages --dump=/tklbam-dump

If /tklbam-dump already exists (from previous testing) either make up a different dump dir, or remove the existing one (if you have no further need for it). Let me know how that goes and we can take it from there.

One thing that does come to mind, is did this backup originally come from a previous version of TurnKey? If so, perhaps there is something (old) in the MySQL config that it causing it to choke in this scenario?

FWIW, TurnKey uses etckeeper to keep track of changes in /etc. They're stored in a git repo. So if you cd into /etc, find the initial commit ID and check against that, you will be able to see any changes to the MySQL config files. Something like this should do the trick:

cd /etc
git log --oneline | tail -1

Should return:

<COMMIT_ID> initial commit

Then do a git diff for files in /etc/mysql:

git diff <COMMIT_ID>..HEAD -- /etc/mysql

To use my test server as an explicit example:

root@lamp ~# cd /etc
root@lamp /etc# git log --oneline | tail -1
5bf34d6 initial commit
root@lamp /etc# git diff 5bf34d6..HEAD -- /etc/mysql

Another thing that you could check whether it's some of the config that has changed, is doing a restore (to a new VM) but exclude the whole /etc directory. I.e.:

tklbam-restore --limits="-/etc"

If you'd like to do some testing around including/excluding certain directories during the restore, then you may be better downloading the whole backup first, then just doing test restores (which you can rollback, then retry; without needing to redownload the whole thing again). In other words, just do a download first:

mkdir /tklbam-download
tklbam-restore BACKUP_ID --raw-download=/tklbam-download

Then to restore from this cache, instead of specifying a backup ID, specify the file location (and in this example, I'm just excluding the /etc/mysql dir):

tklbam-restore /tklbam-download --limits="-/etc/mysql"
Bill Carney's picture

This fails:

mkdir /tklbam-dump
tklbam-backup --skip-files --skip-packages --dump=/tklbam-dump

So I tried 

git diff <COMMIT_ID>..HEAD -- /etc/mysql

and the results are as I expected (I add my name to comments so I can more easily see my changes).  The results are below.  Mostly increasing values to deal with me running so many WordPress sites.

diff --git a/mysql/my.cnf b/mysql/my.cnf
index c2390e8..bbd3391 100644
--- a/mysql/my.cnf
+++ b/mysql/my.cnf
@@ -48,21 +48,27 @@ bind-address                = 127.0.0.1
 #
 # * Fine Tuning
 #
-key_buffer             = 16M
+# key_buffer           = 16M                   #Removed 12/19/16 CARNEY since its deprecated
+key_buffer_size                = 32M           #Added 12/9/16 CARNEY
 max_allowed_packet     = 16M
 thread_stack           = 192K
-thread_cache_size       = 8
+thread_cache_size       = 64           #Changed to 64 from 8  12/9/16 CARNEY
 # This replaces the startup script and checks MyISAM tables if needed
 # the first time they are touched
-myisam-recover         = BACKUP
+#myisam-recover         = BACKUP       #Removed 12/19 CARNEY since its deprecated
+myisam-recover-options = BACKUP
 #max_connections        = 100
 #table_cache            = 64
 #thread_concurrency     = 10
 #
 # * Query Cache Configuration
 #
-query_cache_limit      = 1M
-query_cache_size        = 16M
+query_cache_limit      = 8M                    #Changed to 8M from 1M  12/9/16 CARNEY
+query_cache_size        = 128M         #Changed to 128M from 16M  12/21/16 CARNEY
+query_cache_type                  =0            #Added 12/21/16 CARNEY
+join_buffer_size        = 1024K        #Added 12/21/16 CARNEY
+
+
 #
 # * Logging and Replication
 #
@@ -126,3 +132,12 @@ key_buffer         = 16M
 #   The files must end with '.cnf', otherwise they'll be ignored.
 #
 !includedir /etc/mysql/conf.d/
+
+# CARNEY added on 7/26/16 to help deal with MySQL stopping for no apparent reason
+innodb_buffer_pool_size = 3G  #Changed to 3G from 20M  12/19/16 CARNEY
+innodb_log_file_size      = 768M 
+tmp_table_size        = 24M    #Added 12/21/16 CARNEY
+max_heap_table_size   = 24M    #Added 12/21/16 CARNEY
+table_open_cache                = 400  #Added 12/21/16 CARNEY
+innodb_file_per_table   = ON           #Added 12/21/16 CARNEY
 +                                                             
\ No newline at end of file
Bill Carney's picture

I have two Turnkey accounts, one for my small business, and one for my day job.  My work account has virtually the same LAMP setup that my small biz account has, only it's on an m4.xlarge instance.  So for kicks I launched an m4.xlarge instance on my small biz account, and restored the backup to it last night.  No joy, it has the same issue as well.  But, that's one more thing ruled out.

Bill Carney's picture

Yes, this backup started back at 12.1 I believe.  Quite some time ago, early 2013.

Jeremy Davis's picture

This is just a hunch, but my guess is that it's something to do with your my.cnf

If this backup originally came from a previous major TurnKey version (i.e. my guess is v13.x if the original backup was 2013) then perhaps you are missing some config option(s) (or perhaps have some that are causing unexpected results)? When I first saw your changes due to deprecated features, I thought that may have been the issue, but on reflection, that may have been due to the version changes, but perhaps you missed something important?

The thing that I previously overlooked when I suggested checking the /etc git changes, was that the .git directory (where the changes are saved) would also have come from your backup! So whilst it's interesting, in this case it won't show what I was hoping it would show!

So IMO, what you'd want to do next is check what the default my.cnf for v14.x should be and compare that to your current one.

Unfortunately, I don't have that LAMP server I was working with previously, but I just quickly created a new one to double check. This should work, but no guarantees...

Make sure everything is committed in /etc (it should automatically commit when you run the package management commands I'm about to get you to use, but let's be sure...).

cd /etc
git add .
git commit -m "manual commit before reinstalling default config"

I've just doublechecked where all the MySQL config comes from, but for the record, here's how you do it:

dpkg -S /etc/mysql

On my test system that returned this:

mysql-server-5.5, mysql-common: /etc/mysql

So the contents of /etc/mysql comes from those 2 packages (mysql-server-5.5 & mysql-common). So now we'll reinstall those and specify that we want the current config files overwritten:

apt-get update
apt-get -o Dpkg::Options::="--force-confnew" install --reinstall mysql-server-5.5 mysql-common

Now the current HEAD of the git repo in /etc should be the new (original) config files from the mysql-server-5.5 & mysql-common packages. So this should do the trick:

git diff HEAD^ HEAD

FWIW "HEAD" is the latest commit and "^" is shorthand for the parent commit, so "HEAD^" is the parent of (i.e. one before) "HEAD". So essentially we are doing a diff between the latest commit and the one before (actually the other way round, but hopefully you know what I mean!).

My guess is that there is something in there that isn't as it should be. Either something missing that it needs, or something that doesn't act as it probably should. Whilst it won't be tuned to your ideal production scenario, I suggest doing a quick test as per above, i.e.:

mkdir /tklbam-dump
tklbam-backup --skip-files --skip-packages --dump=/tklbam-dump

That will at least confirm (or not) my suspicion.

Hopefully it does and the backup now works! If so, the next step will be troubleshooting which bit was actually problematic. I suggest that you adjust the defaults in your new my.cnf to what you note above. Although I'd recommend that you don't add any new ones (at least not at this stage). If you want to keep track of your changes as you go, you can manually commit to the /etc git repo and use git tools to rollback changes that break things.

Fingers crossed that works as I hope it does. If you can narrow it down to exactly what the issue was/is, that'd be super awesome, but even if you can confirm (or not) that I'm on the right track may assist others in the future!

If I'm completely off on the wrong track, then I'm not really sure what to fiddle with next. But let's cross that bridge if we get to it! :)

(PS hope you don't mind but I edited the layout of your git diff output above to make it a bit easier to read)

Bill Carney's picture

Just for the record, I downloaded the my.cnf file from my other working server and moved it to the local VM. That did not solve the issue. So I reverted back to my previous snapshot.

I am not well versed with git at all. The command

git commit "manual commit before reinstalling default config"

gave me the error

error: pathspec 'manual commit before reinstalling default config' did not match any file(s) known to git.

So I did what a typical user would do – I just ignored it and moved on to the next step. ;)

Got to this step

git diff HEAD^ HEAD

and it showed me pretty much the same dump as I posted yesterday (thanks for reformatting it for me).

I ran a service mysql restart, then the tklbam-backup command noted.

It worked!

Out of curiousity I re-uploaded the modified my.cnf file, restarted mysql, and ran tklbam-backup again. It worked as well. So it appears this is what fixed it.

 

apt-get update
apt-get -o Dpkg::Options::="--force-confnew" install --reinstall mysql-server-5.5 mysql-common

All this was done on a copy of the server on a virtual machine locally. Over the weekend when there is less interruption I’ll apply this to the server on AWS; I’m expecting to work, but if it doesn’t I’ll post back here.

Thanks again as always, this was really “above and beyond”!  You are a good teacher.

 

Jeremy Davis's picture

Awesome to hear that worked! And thanks for your kind words.

Apologies on the broken git commit command! I forgot the '-m' switch (or '--message')! It should have been:

git commit -m "manual commit before reinstalling default config"

So all you'll need to do now, is tune your MySQL server config for improved performance. FWIW TurnKey actually comes bundled with a MySQL-tuner script. The version included is quite old (and I've added an issue to the tracker so a new version gets bundled for our next release). If you do plan to use that, I'd recommend updating it first. That may be useful to assist you tuning things up? FWIW, it's operates in a read only fashion, so only gives advice on what you may want to do. You'd be well advised to research it's suggestions before doing anything though (as they advise).

As an aside, if you're running WordPress and want to optimize a bit more, then it may be worth considering cleaning up the DB. I've never used it, but have heard good things about the WP-Sweep module. From what I gather, Lester Chan (the module developer) is a respected WP developer.

Bill Carney's picture

I ran 

apt-get update
apt-get -o Dpkg::Options::="--force-confnew" install --reinstall mysql-server-5.5 mysql-common

and then tested with 

mkdir /tklbam-dump
tklbam-backup --skip-files --skip-packages --dump=/tklbam-dump

Same problem as before.  I checked my.cnf and it still had my tuned configuration, so I created a new virtual machine and copied the default my.cnf from it to my server.  That didn't work either.

So I though twhat the heck, I'll reboot.  Three times it got stuck after "Loading initial ramdisk ..." which freaked me out, but after the fourth attempt it finally came back online.  I tested the backup again, and it did not work.

I think eventually, probably when 15 comes out, I'll just create a new LAMP server and migrate my sites over to it by hand rather than continuing to troubleshoot this one.  

Jeremy Davis's picture

What a pain! I thought we had it nailed for sure... :(

v15.0 shouldn't be too far away. I'm hoping to publish the Core RC this week. Once that's published, the others should follow within a week or 2. Most of them have already been updated, we just need to do final testing.

Add new comment