(?) Large Backups Failing / Stalling - Turnkey Filerserver

Scott Howard - Tue, 2011/08/30 - 10:00

Hi Guys,

Have successfully been using Turnkey Fileserver Appliance now for over 12 months to fileserve and back up system on daily basis. Bare-metal install of Lucid appliance.

Tklbam settings are default ( full backup monthly , inc in between , and 50 M volsize)

Each night I do a local tklbam-backup to a local machine and to the remote amazon hub.

Recently backup size has got to between 12 -13 G uncompressed data footprint size. and I have noticed some remote backup failures, which I put down to internet dropouts.....

I duplicated my install on another machine to test a full back up , and while the inital backup of the system succeeded (it was only 1 MB) without all of my data, the next backup including my data fails 100% of the time, and the original machine also now fails 100 % because it has got to full backup stage as well.

What appears to be happening is that after doing the initial preparation the backup starts uploading volumes , but stalls after a certain number. I have repeated this on both my original machine and the duplicate machine with the same result.

The screen output hangs at the "Uploading ..... etc "

Analysing the output of 'ps aux' command indicates tklbam-backup still running in SL+ mode i.e in interuptable sleep mode, waiting for an event to happen.

'Netstat' shows the amazon socket in "CLOSE_WAIT" status indicating that it has closed at its end and is waiting for my socket to close.

After <Control-C> to kill tklbam-backup I get the following traceback ...

Uploading s3://s3-ap-southeast-1.amazonaws.com/tklbam-yfcamybrzeeisvc4/duplicity-inc.20110825T035002Z.to.20110829T061239Z.vol49.

difftar.gpg to STANDARD Storage

Processed volume 49

Uploading s3://s3-ap-southeast-1.amazonaws.com/tklbam-yfcamybrzeeisvc4/duplicity-inc.20110825T035002Z.to.20110829T061239Z.vol50.

difftar.gpg to STANDARD Storage

^CTraceback (most recent call last):

File "/usr/bin/tklbam-backup", line 266, in <module>

main()

File "/usr/bin/tklbam-backup", line 239, in main

trap = UnitedStdTrap(transparent=True)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 266, in __init__

self.stdout_splice = self.Splicer(sys.stdout.fileno(), usepty, transparent)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 213, in __init__

vals = self._splice(spliced_fd, usepty, transparent)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 175, in _splice

events = poll.poll()

KeyboardInterrupt

Traceback (most recent call last):

File "/usr/bin/tklbam-backup", line 266, in <module>

main()

File "/usr/bin/tklbam-backup", line 242, in main

b.run()

File "/usr/lib/tklbam/backup.py", line 311, in run

backup_command.run(passphrase, conf.credentials)

File "/usr/lib/tklbam/duplicity.py", line 77, in run

exitcode = child.wait()

File "/usr/lib/python2.6/subprocess.py", line 1170, in wait

pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)

File "/usr/lib/python2.6/subprocess.py", line 465, in _eintr_retry_call

return func(*args)

KeyboardInterrupt

Dont know really what any of that means but hoping the devs do.

Anyway , to me it appears that the amazon end is timing out , or in some indefinite loop waiting for something at my end to happen and then gives up and closes, while tklbam continues to run on my machine.

There is no other problem at my location to indicate internet connection issues. I can transfer large files in other scenarios with no problems, and up to now tklbam has been working for me with no issues.

The log file is of no help as it just mirrors the screen output. I tried redirecting stderr to stdout but nothing , so that coupled with the fact that tklbam-backup is still running indicates to me that no error has actually occurred in the program.

A by-product of this is that I now have 1 backup showing on my hub dashboard that still indicates "First backup in progress" and there is no way to delete it.

I would appreciate any ideas from devs/others as to where and why my problems are now occuring, hopefully the callback trace helps.

Thanks in anticipation and once again sorry about the length of the post.

Scott H.

Forum:

Tags:

Add new comment

Hey Scott

Jeremy Davis - Tue, 2011/08/30 - 11:16

Glad to hear that things have been good up until now. Not so great to hear about your issues noe though.

Sorry I don't have anything at all to add. Anything I may have suggested seems to have been ruled out by your tests...

Regardless I can suggest that if you haven't already, that you use the hub feedback feature to at least get the devs to get your incomplete backup deleted. I'd be inclined to mention this thread (and post a link) in your feedback and hopefully they may have some ideas.

Same Here

Chris Musty - Tue, 2011/09/06 - 16:02

I have the same issues but with larger file sizes.

I have given up on using TKLBAM for fileservers for now.

Chris Musty

Director

Specialised Technologies

That sounded harsh

Chris Musty - Wed, 2011/09/07 - 13:25

I didnt mean to sound so scathing - I love TKL! I was just stating that I have not deployed it for file servers for a while :)

Chris Musty

Director

Specialised Technologies

This may be a Duplicity bug...

Liraz Siri - Wed, 2011/09/07 - 02:23

Thanks Scott for reporting this issue and thanks to Chris for confirming he can reproduce it consistently. TKLBAM uses Duplicity on the back-end. There's a fairly robust timeout / retry mechanism that should have handled any temporary error on the Amazon side but this may be failing for some reason. I have a long overdue maintenance round for TKLBAM coming up soon, and amongst other things I'll look into upgrading the branch of Duplicity we're using. There have been a round of promising bugfixes which may fix the issue. I'll be asking you guys to help me test whether it does. If you can provide (private) access to a server where the bug can be reproduced consistently that will be very useful.

Happy to help

Chris Musty - Wed, 2011/09/07 - 13:18

I will send you admin access to my affected server if you wish, where should I send the details?

Chris Musty

Director

Specialised Technologies

Replied to Hub feedback

Alon Swartz - Wed, 2011/09/07 - 13:39

I replied to the Hub feedback you sent with the relevant info.

Thanks Liraz, Chris

Scott Howard - Wed, 2011/09/07 - 03:15

Thanks for replying guys, Alon has also contacted me regarding this. I have a turnkey - core server running on amazon with openvpn, upon which I am going to install samba and try to do a "backup" file transfer (a tarball of my data = ~ 7.5 G) this way to see if it will go through to completion with the file size, and Ill report back. Naturally I'll be happy to be a guinea pig for any tklbam testing that you need to do to resolve this issue as I'd really prefer to keep doing things the way Ive previously been.

Thanks again

Scott H.

What version of TKLBAM are you using?

Jeremy Davis - Sun, 2014/08/17 - 02:53

apt-cache policy tklbam

The latest/current version is v1.4

This issue should have been resolved some time ago (hence the lack of activity on this thread...). We updated our forked version of duplicity to resolve this and a few other bugs.

TKLBAM is Liraz's baby so he'd be able to give more detail but AFAIK the data is broken up into chunks (default 50MB IIRC) and uploaded like that so it should be irrelevant how big the actual files are (although obviously with a lot of data there is more chance that you will have the odd chunk fail).

As we have literally thousands of TKLBAM users and you are currently the only user having this issue (that I am aware of) my suspicion is that it is something to do with your network/internet connection, although I'm only guessing... Perhaps there is some other edge case scenario going on here?

Regardless if you could post more of your TKLBAM logs and I'll get Liraz to have a quick glance.

To fix an issue we have to be able to reproduce it first

Liraz Siri - Thu, 2014/08/21 - 22:49

For what it's worth I use TKLBAM to backup our rsync master which has nearly a 1TB worth of storage. No problems, though it is running inside AWS so that might smoother access to S3 than your network.

In any case, the first step in fixing a bug is reproducing it reliably. The harder it is to reproduce a problem the harder it is to track down and fix.

If the problem has something to do with your network configuration (e.g., a misbehaving router/proxy) then that's going to be hard for me to reproduce and it might not even be something that I can fix within TKLBAM. Sometimes there are workarounds for these issues, or you can add more redundancy, sometimes not.

For what it's worth, TKBLAM uses Duplicity as the storage backend. Duplicity is fairly well tested and like Jeremy mentioned it breaks down big files into volumes. You can configure the volume size and that might help I guess. Take a look at /etc/tklbam/conf if you want to try that.

Also, if you want the backups but don't want to use Duplicity, you can dump the raw backups to the local filesystem using the --dump option and then use whatever method works best under your circumstances to stash it safely somewhere. Though I do recommend incremental backups over just keeping dumb copies.

Finally, you can ask Duplicity to store your data on other backends. You don't have to use AWS S3 if it doesn't work well for you.

Add new comment

form.antibot { display: none !important; } You must have JavaScript enabled to use this form.

Main menu

User menu

You are here

Hey Scott

Same Here

That sounded harsh

This may be a Duplicity bug...

Happy to help

Replied to Hub feedback

Thanks Liraz, Chris

What version of TKLBAM are you using?

To fix an issue we have to be able to reproduce it first

Add new comment

Plain text

Search form

You are here

Add new comment

Plain text