TurnKey Linux Virtual Appliance Library

(?) Large Backups Failing / Stalling - Turnkey Filerserver

Scott Howard's picture

Hi Guys,

Have successfully been using Turnkey Fileserver Appliance now for over 12 months to fileserve and back up system on daily basis. Bare-metal install of Lucid appliance.

Tklbam settings are default ( full backup monthly , inc in between , and 50 M volsize)

Each night I do a local tklbam-backup to a local machine and to the remote amazon hub.

Recently backup size has got to between 12 -13 G uncompressed data footprint size. and I have noticed some remote backup failures, which I put down to internet dropouts.....

I duplicated my install on another machine to test a full back up , and while the inital backup of the system succeeded (it was only 1 MB) without all of my data, the next backup including my data fails 100% of the time, and the original machine also now fails 100 % because it has got to full backup stage as well.

What appears to be happening is that after doing the initial preparation the backup starts uploading volumes , but stalls after a certain number. I have repeated this on both my original machine and the duplicate machine with the same result.

The screen output hangs at the "Uploading ..... etc "

Analysing the output of 'ps aux' command indicates tklbam-backup still running in SL+ mode i.e in interuptable sleep mode, waiting for an event to happen.

'Netstat'  shows the amazon socket in "CLOSE_WAIT" status indicating that it has closed at its end and is waiting  for my socket to close.

After <Control-C> to kill tklbam-backup I get the following traceback ...

 

Uploading s3://s3-ap-southeast-1.amazonaws.com/tklbam-yfcamybrzeeisvc4/duplicity-inc.20110825T035002Z.to.20110829T061239Z.vol49.

difftar.gpg to STANDARD Storage

Processed volume 49

Uploading s3://s3-ap-southeast-1.amazonaws.com/tklbam-yfcamybrzeeisvc4/duplicity-inc.20110825T035002Z.to.20110829T061239Z.vol50.

difftar.gpg to STANDARD Storage

^CTraceback (most recent call last):

File "/usr/bin/tklbam-backup", line 266, in <module>

main()

File "/usr/bin/tklbam-backup", line 239, in main

trap = UnitedStdTrap(transparent=True)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 266, in __init__

self.stdout_splice = self.Splicer(sys.stdout.fileno(), usepty, transparent)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 213, in __init__

vals = self._splice(spliced_fd, usepty, transparent)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 175, in _splice

events = poll.poll()

KeyboardInterrupt

Traceback (most recent call last):

File "/usr/bin/tklbam-backup", line 266, in <module>

main()

File "/usr/bin/tklbam-backup", line 242, in main

b.run()

File "/usr/lib/tklbam/backup.py", line 311, in run

backup_command.run(passphrase, conf.credentials)

File "/usr/lib/tklbam/duplicity.py", line 77, in run

exitcode = child.wait()

File "/usr/lib/python2.6/subprocess.py", line 1170, in wait

pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)

File "/usr/lib/python2.6/subprocess.py", line 465, in _eintr_retry_call

return func(*args)

KeyboardInterrupt

Dont know really what any of that means but hoping  the devs do.

Anyway , to me it appears that the amazon end  is timing out , or in some indefinite loop waiting for something  at my end to happen and then gives up and closes, while tklbam continues to run on my machine.

 There is no other problem at my location to indicate internet connection issues. I can transfer large files in other scenarios with no problems, and up to now tklbam has been working for me with no issues.

The log file is of no help as it just mirrors the screen output. I tried redirecting stderr to stdout but nothing , so that coupled with the fact that tklbam-backup is still running indicates  to me that no error has actually occurred in the program.

A by-product of this is that I now have 1 backup showing on my hub dashboard that still indicates "First backup in progress" and there is no way to delete it.

I would appreciate any ideas from devs/others as to where and why my problems are now occuring, hopefully the callback trace helps.

Thanks in anticipation and once again sorry about the length of the post.

Scott H.

Jeremy's picture

Hey Scott

Glad to hear that things have been good up until now. Not so great to hear about your issues noe though.

Sorry I don't have anything at all to add. Anything I may have suggested seems to have been ruled out by your tests...

Regardless I can suggest that if you haven't already, that you use the hub feedback feature to at least get the devs to get your incomplete backup deleted. I'd be inclined to mention this thread (and post a link) in your feedback and hopefully they may have some ideas.

Chris Musty's picture

Same Here

I have the same issues but with larger file sizes.

I have given up on using TKLBAM for fileservers for now.

Chris Musty

Director

Specialised Technologies

Chris Musty's picture

That sounded harsh

I didnt mean to sound so scathing - I love TKL! I was just stating that I have not deployed it for file servers for a while :)

Chris Musty

Director

Specialised Technologies

Liraz Siri's picture

This may be a Duplicity bug...

Thanks Scott for reporting this issue and thanks to Chris for confirming he can reproduce it consistently. TKLBAM uses Duplicity on the back-end. There's a fairly robust timeout / retry mechanism that should have handled any temporary error on the Amazon side but this may be failing for some reason. I have a long overdue maintenance round for TKLBAM coming up soon, and amongst other things I'll look into upgrading the branch of Duplicity we're using. There have been a round of promising bugfixes which may fix the issue. I'll be asking you guys to help me test whether it does. If you can provide (private) access to a server where the bug can be reproduced consistently that will be very useful.
Chris Musty's picture

Happy to help

I will send you admin access to my affected server if you wish, where should I send the details?

Chris Musty

Director

Specialised Technologies

Alon Swartz's picture

Replied to Hub feedback

I replied to the Hub feedback you sent with the relevant info.

Scott Howard's picture

Thanks Liraz, Chris

Thanks for replying guys, Alon has also contacted me regarding this. I have a turnkey - core server running on amazon with openvpn,  upon which I am going to install samba  and try to do a "backup" file transfer (a tarball of my data  = ~ 7.5 G) this way to see if it will go through to completion with the file size, and Ill report back. Naturally I'll be happy to be a guinea pig for any tklbam testing that you need to do to resolve this issue as I'd really prefer to keep doing things the way Ive previously been.

 

Thanks again

Scott H.

Any developments on this issue?

I am running into the same problem.  We also store large files on our server.  Some are 7GB and larger.

A disturbing additional symptom I observed was that TKLBAM seemed to retry failed transfers over and over. The impact on our office network was so severe that our VOIP phone system became unusable!  I had to disable TKLBAM nightly backups and was called to task for the business interruption.

When I examined the logs it seemed the errors might be a symptom of either request time out or request entity too large errors.  I know from my development experience that trying to ram several GBs of data with a REST call (or a call to any API that is HTTP-based) is bound to run into one of the above errors.

Often to protect against DoS attacks the request entity size is limited on the server side.  So to support large file uploads over HTTP the maximum allowed request data size might be increased on an application server. But there are limits to how large of a size one should allow.

When a server has been configured to support large request entities as described above reqest timeouts may occur.  The truth is that no request containing data that might take 2 minutes or more to service should ever be made.  That's begging for trouble.

The solution to the problem is to do exactly what I read above:  Obviously it's good practice to archive many smaller files to minimize the number if network I/Os, but very large files must be split into pieces.  From what I read above this issue was anticipated and designd for by the TKLBAM developers.

So I'm scratching my head.  If such logic has been incorporated these sorts of errors should not be occurring.

And I'm concerned that it appears that this thread has been dormant for quite a while.  Is anyone on the TKLBAM development team able to find time to run this issue down and resolve it?

Jeremy's picture

What version of TKLBAM are you using?

apt-cache policy tklbam

The latest/current version is v1.4

This issue should have been resolved some time ago (hence the lack of activity on this thread...). We updated our forked version of duplicity to resolve this and a few other bugs.

TKLBAM is Liraz's baby so he'd be able to give more detail but AFAIK the data is broken up into chunks (default 50MB IIRC) and uploaded like that so it should be irrelevant how big the actual files are (although obviously with a lot of data there is more chance that you will have the odd chunk fail).

As we have literally thousands of TKLBAM users and you are currently the only user having this issue (that I am aware of) my suspicion is that it is something to do with your network/internet connection, although I'm only guessing... Perhaps there is some other edge case scenario going on here?

Regardless if you could post more of your TKLBAM logs and I'll get Liraz to have a quick glance.

Liraz Siri's picture

To fix an issue we have to be able to reproduce it first

For what it's worth I use TKLBAM to backup our rsync master which has nearly a 1TB worth of storage. No problems, though it is running inside AWS so that might smoother access to S3 than your network.

In any case, the first step in fixing a bug is reproducing it reliably. The harder it is to reproduce a problem the harder it is to track down and fix.

If the problem has something to do with your network configuration (e.g., a misbehaving router/proxy) then that's going to be hard for me to reproduce and it might not even be something that I can fix within TKLBAM. Sometimes there are workarounds for these issues, or you can add more redundancy, sometimes not.

For what it's worth, TKBLAM uses Duplicity as the storage backend. Duplicity is fairly well tested and like Jeremy mentioned it breaks down big files into volumes. You can configure the volume size and that might help I guess. Take a look at /etc/tklbam/conf if you want to try that.

Also, if you want the backups but don't want to use Duplicity, you can dump the raw backups to the local filesystem using the --dump option and then use whatever method works best under your circumstances to stash it safely somewhere. Though I do recommend incremental backups over just keeping dumb copies.

Finally, you can ask Duplicity to store your data on other backends. You don't have to use AWS S3 if it doesn't work well for you.

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account, used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <p> <span> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <strike> <caption>

More information about formatting options

Leave this field empty. It's part of a security mechanism.
(Dear spammers: moderators are notified of all new posts. Spam is deleted immediately)