Ronan0's picture

I have TKLBAM installed on a server that runs an application on the Java Virtual Machine (through a Java servlet).

The generated log file when this occurs is attached to this post.

The server is a 2GB 2 core processor machine (full details in the log file) and normally, when the backup runs (in the morning when there is no other load on the machine) I can see the CPU spike up to around 40%. (This does seem a lot.)

But when it crashes the JVM, I can see it spikes up to ~115%. 

I do not have a real time monitor on the memory load at present, but you should be able to get information on this from the attached log.

I have another machine with 4GB running a similar set up (although I have it so that only the database backs up, whereas on this machine, everything backs up, as per default settings.). I do not experience this problem on that machine.

How can I avoid this? What causes the CPU and memory load? Can I reduce it say by only backing up the database? 

Or could it be some type of memory leak? It only happens every month or two. It happened this time on an incremental backup.

Just to clarify, the backup completes successfully.

Thanks.

Forum: 
Jeremy Davis's picture

From looking at your log it seems clear that your server is running out of RAM. As you are no doubt aware, Java apps tend to be quite hard on resources themselves so on a system with only 2GB RAM it's not really surprising that something would crash when the system is already under load and TKLBAM starts.

TKLBAM uses RAM when collecting all the files to backup; particularly when dumping databases (as the DB is dumped, it is stored in memory before it is written to disk). It also uses memory when creating the archive files which are to be uploaded.

TKLBAM mainly uses CPU when creating the archive of backup files. AFAIK TKLBAM uses gzip for compression. By default the archive process (gzip) runs as a single thread so will soak up as much CPU as it can - but only on one core. On a system with a single core, it is expected behaviour to see CPU spikes up to (and often over) 100%. However, I can't explain why you are seeing that on a system with 2 cores.

So you have a few options. You may find one of these resolves the issue for you, or perhaps using a combo of them may be your best approach? Here's the ones that come to mind:

  1. Increase RAM:

    This may not be that practical or desirable in your instance, but it's certainly the best way to avoid this issue if you have the ability to increase RAM. You'll possibly notice improved performance of your Java app too.

  2. Configure swap:

    It seems that you don't have swap enabled. TBH there's nothing wrong with that (it's generally my preference so long as I have enough RAM). But it does mean that when your system runs out of RAM, it has nowhere to turn. Whether this is a good and/or cost effective idea depends lots on your individual scenario. On a hosted platform such as AWS, enabling swap can often be false economy as you pay for storage and disk I/O. So if your server is using swap lots, simply using a larger server (with more RAM) will often give you improved performance at a comparable cost. OTOH if your system rarely uses swap, then a small swap file/partition may be a reasonable solution.

  3. Specify the time of your backups:

    If you have windows where your server is only ever under low load, then simply timing your TKLBAM backup to run at that time may be sufficient to make it more reliable? TKLBAM backups are triggered by cron.

  4. Split your backups:

    It is possible to have multiple backup sets for the one server. So you could have one backup job which backs up your DB, and another that just backs up the files. Or you could have one job which just collects all the files (inc DB) and dumps them somewhere (on the local filesystem), then another that collects those files and uploads them. The downside of the first method is that when the different parts are backed up separately, they may be at different points, so a restore may cause unexpected results. The second method may make your restores significantly more mucking around. If you choose either of these possibilities, I urge you to spend some time testing restores (probably to a clean install) to ensure that it will work if/when you need it!

  5. Use a TKLBAM hook (or something else) to restart the JVM (if/when required):

    TBH I've never used TKLBAM hooks for this sort of purpose, but I can't think of a reason why it wouldn't work. There are also other apps that you could use to monitor the JVM and restart it if it crashes.

Some relevant links to docs and info:

  • TKLBAM docs
  • tklbam-backup man page
  • tklbam hooks documentation
  • multiple backups of single server
  • exclude DB or files from backup .
  • old forum thread on multiple backups of same server

    Hope that helps.

  • Ronan0's picture

    Thanks. It would seem the solution is to only back up the database (the files don't change.). For I run backup when there is no load on the system.

    I have a little confusion on this -  does the very first backup backup all the files?

    Or when you go to do a restore, do you need to restore to a server that has all the files you skipped in tklbam already in place, and tklbam only restores what was specified in the backup? (In my case the mySQL database).

    Following on from that, if I have been backing up the whole system, and then specify only the database, can that cause problems?

    Thanks again, Jeremy.

    Jeremy Davis's picture

    Ok, if the files rarely change and it only happens once per month, then it could well be that backing up both files and DB is the load that broke the camel's back to mix my metaphors...

    FYI TKLBAM does a full DB dump even for incremental backups. So if it is only dying on the full backups then splitting your backups may well work, at least in the short to medium term (depends how fast your DB is growing).

    If the crash doesn't always coincide with your full backups then splitting backups may not help. If it is truly intermittent and the files rarely change then that would suggest to me that during backups it's probably close to the edge a lot of the time.

    To answer your question directly; if you split your backups into files and DB, then yes there is a risk you may encounter issues down the track. Ideally you still want to be doing occasional full backups to be safe. And you should also be regularly testing restores to make sure it's all working as it should (before you need it).

    I would not recommend only backing up your DB (and nothing else). If the files rarely change, then doing full (or perhaps file only?) backups less frequently and a separate more regular DB backup job may work.

    Regardless, if I were you, I'd document what you do as you go. And I can't stress enough to make sure that you test your assumptions.

    You can test your ideas, by restoring backups and/or subsets of backups to a test server to see if it will work as you hope. E.g. do a restore of a month old backup to a new server, then restore just the DB from your most recent backup and see what happens.

    On another tangent, it's perhaps worth monitoring RAM usage on your server anyway. Bottom line is that you either need to increase the RAM it has, or reduce the RAM its using. If it's running close to the edge lots, then spending time tuning TKLBAM may be wasted once your DB grows a bit more (if you are left with no other option than to add RAM).

    Ronan0's picture

    The issue is not related to the server "running close to the edge". There is plenty of headroom even when under a large load. But as I said, there is never any other load on the server when the backups happen.

    Neither is it related to doing full backups. This happens on incremental backups.

    The database is only around 150MB.

    I regularly monitor the memory from within the application. I have 1GB committed to the Heap of which typically 100 MiB to 350MiB is used by the application. And 150 MiB commited to non-Heap. 

    So there should be plenty of memory for tklbam, especially considering that it works fine on 0.5GB Amazon instances.

    The oom crash seems completely random, although usually around 5-7 weeks elapses between each event.

    I guess the strong CPU spike I am observing is related to the system trying to find the needed memory, garbage collecting. - Again, an AWS micro instance with much less processor power than I have available does not seem to have much problem with the processing demands of tklbam.

    I suspect some kind of memory leak. I am going to put an additional realtime monitor on the server memory so I can more carefully observe whether it is slowly creeping up over time, and whether tklbam contributes to that. But in the meantime if your developers can offer any insights that would be great.

    Thanks.

     

    Ronan0's picture

    Ok, I put on a memory monitor and it looks like there is a memory leak in my own application causing this. 

    Appreciate the feedback. Your diagnosis was correct.

     

    Jeremy Davis's picture

    I'm glad to hear I wasn't leading you astray! :) Bit of a pain that you now need to troubleshoot your app's memory leak though. :(

    Please let me know if you discover anything interesting. Especially if you find something TKLBAM related. If you'd rather not send stuff publicly, you can email it to me direct (jeremy AT turnkeylinux.org) and I can pass it on to Liraz.

    Good luck.

    Add new comment