Rick Kipfer's picture

Hi Folks!

I have just started to realize the devistating effect that TKLBAM can have on the performance of a Micro instance. It seems that, realistically, anyone wanting to have both uniform website performance and an hourly backup of their server must either move to a Small instance or use some sort of CPU limiter to reduce or eliminate the CPU spiking responsible for AWS CPU throttling.

I would like to experiment with setting up the CPULIMIT utility to accomplish practical throttling for the TKLBAM job, and invite anyone interesed in participating with the experiment by giving me feedback on how to do it. Maybe by the end of the experiment, we may have a bit of an instuction manual on how to practically set it up so TKL enthusasts can keep their micro instances as long as possible before needing to upgrade. Hopefully the last post in this thread can be a comprehensive CPULIMIT AND TKLBAM HOW TO.

To start, I think I can work on setting CPULIMIT up, but I have no idea what names of the process are that I would need to limit. Anyone know;

a) what processes TKLBAM itself uses?

b) is it possible to limit the MYSQLDUMP for the database portion? Would this be counterproductive because it would simply delay table unlocking?

c) Anyone know what part of TKLBAM is most CPU intensive? Is it the rsync or whatever it is that does the comparison to build the incementals that takes the most power? Or is it simply the TKLBAM compression that is the biggest consumer?

I'm starting off in the dark with my understanding of the true nature of TKLBAM, so any tips would be helpful.

Forum: 
Jeremy Davis's picture

I must admit that something similar crossed my mind too, although it never got any further than that...

A few bits and pieces of info which may be useful:

TKLBAM uses Duplicity as a backend so that may be worth further investigation. Although I suspect that the backup compression also loads up the CPU when running backups (and I assume that it uses tar). Whether tar is called by TKLBAM itself, or Duplicity I'm not sure. IIRC the devs released the sourcecode for TKLBAM so that may be worth investigation (if they have it should be on GitHub - otherwise you could download it from the TKL repo and pull the deb apart).

Top (will need to install - 'apt-get install top') is probably best to find out what is chewing CPU cycles. Run it while running a TKLBAM backup and see what is happening. Although from my quick reading, cpulimit at some point implemented the ability to limit a process and all it's child processes (ie if you limit TKLBAM it will also limit any other processes TKLBAM launches). But I'm not sure whether that is part of the version in the Debian repos? It looks like it's hosted on GitHub now though, so it would be easy to pull down the latest version.

Some time ago I discovered cpulimit daemon, which may also be useful?

Chris Musty's picture

Personally I think you are wasting your time.

You are taking a cheap car and trying to make it perform like a sports car. Stripping things out will just make it harder to do what it was designed for. In this instance Micro's were designed for peaky cpu usage (upto 2 cores) with low background processing. Break either of those rules and your throttled.

I have tried websites, web apps, email and a raft of the available turnkey images but consistently I used 100% cpu for more than 5 mins and was penalised by throttling. Many people have commented on the same thing - "I use it once per day and after 5 days its slow". It is this fact that makes them unreliable and trying to limit one process will not stop it.

I have used trickle with some success

apt-get install trickle

it limits bandwidth and seems to also limit processor load but results were broad. It is especially useful when several things happen at once.

Its usage is simple. In your cron job replace

tklbam-backup

with

trickle -u 250 -d 250 tklbam-backup

where -u is upload speed and -d is download, the units are in Kb/s

The following is also helpful

  • schedule processor intensive tasks at different times (good luck predicting web traffic though)
  • realise that memory is only 650Mb and swap usage will produce processor spikes
  • websites can server upto 1000 ish pages daily but with even distribution - at once will crash it
  • ajax (general term) kills the processor with just a few users
  • some joomla components, modules and plugins use more cpu time than others

As for limiting a process I did not bother with this. 1 CPU is not enough to share around in most of my testing. Small instances are faster and better in every respect, micro is only used for the lightest loads in my business.

Chris Musty

Director

Specialised Technologies

Jeremy Davis's picture

But in some usage scenarios (eg low traffic sites for personal or hobby purposes) I think that ensuring your Micro instance won't be throttled heavily for long periods of time is not an unreasonable aim.

And the cpulimit daemon script I linked to above (I just fixed the links...) resolves the issue that cpulimit has (ie that it only limits the CPU usage of a single process - although it will also limit child processes). I haven't actually tested it, but the idea is that you can either use it with a 'black process list' (only limits specified 'blacklisted' processes) or a 'white process list' (which limits all processes except those whitelisted).

Chris Musty's picture

You were a politician in a past life weren't you Jeremy!

Its ok if you dont agree.

Micro's are basically unused host resources. How this is managed exactly is a mystery but AWS give some indication here.

Contention rate of the host server is critical in your performance, so given the wide range of processing ability limiting a process to say 30% is going to be variable. Also remember its the background processing thats monitored.

To what extent AWS scale you back under load is also a mystery but this is why my tests have been inconclusive and why I never use micro's for mission critical stuff. 

I currently host 3 low volume websites on micro's and all of them have been restarted at some point because of scaling.

I am very interested in the outcome but I feel, as mentioned above and because of the variables, its not going to work too well in reality.

EDIT: forgot to metion this one but often people use the nice command to prioritise processes. It does not limit CPU it modifies priority. A combination of both might work better.

Chris Musty

Director

Specialised Technologies

Jeremy Davis's picture

Perhaps...! And yes I am know for my diplomacy... :) Although I have been known to 'go off' on the odd occasion. I just often tend to tread carefully when I'm not completely sure of myself (and in writing too...)

I had a read of your link and a little more online research and I still think it could be done, but not 100% reliable without considerable effort and a bit more scrpting knowledge than I have.

...Having said that, I did read some posts on Amazon forums a while ago (when I originally found the cpulimit daemon thread) where others claimed to have successfully tweaked their micro servers to not throttle.

Rick Kipfer's picture

Thanks guys, I apprecate both your input. I'm going to try your CPU daemon script Jeremy, wish me luck and I hope it's not beyond me. I only touched my first linux shell 3 or 4 months ago, which is why I love TKL so much in the first place. ;o)

Chris, that link you provided for an outline of AWS instance resource use was exactly what I needed to get an idea of what the frak they are doing. It makes so much sense now. You have to hand it to AWS, they sure seem to be trying hard to educate, that was a great article and WAY more honest than I thought they'd be. The problem for me with AWS is there are so MANY services, I just get lost in the sea.

I'll let you guys know if the daemon smooths out performance at all. After all, the cloud is all about a safe place to test!

I also like your idea about nicing TKLBAM, because maybe if it was a lower priority, even with the throttling I still might get a 6-8 second page load during backups instead of the 45-60...

Rick

Rick Kipfer's picture

After some testing and playing around, I have to agree with Chris that this may be a futile effort. When I conceived the idea of balancing backup load over time on a micro instance, I was under the false impression that it was a simple reactionary formula that AWS uses to throttle the CPU. I thought it was

"CPU spiked high? Cut it down and let it sneak back up."

The link Chris provided tells a much different story (thanks for that Chris!), the reality is;

"CPU spiked to high? CPU MODERATELY HIGH FOR TOO LONG? Cut it down and punish." (haha)

They have it figured out, they will not allow these micros to be used for any CPU load other than very low, low use. Just as Chris said, they are designed to very effectively take a short term spike in load (such as, maybe, some serious database work with a maintenance script for 15 seconds), maybe even better for that spike than a small instance, but anything that naturally falls into "moderate background work" causes the throttling just as much as an extended spike in CPU use, rendering the web server distasteful.

Thanks for the input Chris. I hereby abandon the project :o) My plan is to do my own dumps and transfer them offsite by other means than TKLBAM, and once our budget allows for an upgrade to small instances, revisit TKLBAM as an option.

Rick

Chris Musty's picture

I think there will be benefit in quantifying the point of throttling.

The biggest hurdle is the amount of CPU they allow when things are cruizing on the server vs when its under load - in this case restarting seems to fix the issue.

I might look into this further, given enough time.

Chris Musty

Director

Specialised Technologies

Chris Musty's picture

OK so here we go.

The green server in the graphic below is a small joomla appliance that served up 2754 pages during the 2.5 hours shown. This was remarkeby average during the entire time period just shy of 20 page views per minute this equates to a background processing utilisation of about 10%. The blip of 40% (during the time period 7:45 to 8:45) was a newsletter sent at 5 per batch then pause for 1 second to 8000+ subscribers through php-mailer.

The blue server is a micro core and I compiled stress on it last night before retiring for the night. The background processes on this server equate to low single digit percentages so it has the advantage here.

Stress has been used extensively by my company to stress test servers ability to remove heat from processors, test i/o and hdd's. I honestly dont know what it does but it works great. Possibly just some floating point calcs or something.

The following is the timeline for each server (approx times used).

7:35 - micro bootup
7:50 - 8:00 - ran 1 cpu stress for 60 sec and 120 sec and 240 sec (note that the graph did not have the resolution to detect the period of no processing ie I performed the tests within 1min of each other)
8:10 10 sec run on 2 cpu's for 10 sec (note utilisation stayed below single cpu capacity ie it did not spike to 2 like AWS promise)
8:50 - 6 minutes stress on 1 cpu
9:10 - 10 minutes stress on 2 cpus (again only single core used)

If you are wondering what the blip on the end of the small appliances graph is - TKLBAM!

Conclusion

Running stress slowed the response of the server slowed down as expected but did not cripple it. I ran top while under stress in a seperate session to monitor and it did not fail. Not once was the server scaled like I have seen before. No real conclusion could be made and I need to perform more tests.

My hypothesis is that TKLBAM, an update and serving webpages would cause enough load for AWS to start scaling. I need to prove this with another mechanism. Maybe the scaling threshold is greater than 10 mins?

Chris Musty

Director

Specialised Technologies

Add new comment