TurnKey Linux Virtual Appliance Library

(?) Large Backups Failing / Stalling - Turnkey Filerserver

Scott Howard's picture

Hi Guys,

Have successfully been using Turnkey Fileserver Appliance now for over 12 months to fileserve and back up system on daily basis. Bare-metal install of Lucid appliance.

Tklbam settings are default ( full backup monthly , inc in between , and 50 M volsize)

Each night I do a local tklbam-backup to a local machine and to the remote amazon hub.

Recently backup size has got to between 12 -13 G uncompressed data footprint size. and I have noticed some remote backup failures, which I put down to internet dropouts.....

I duplicated my install on another machine to test a full back up , and while the inital backup of the system succeeded (it was only 1 MB) without all of my data, the next backup including my data fails 100% of the time, and the original machine also now fails 100 % because it has got to full backup stage as well.

What appears to be happening is that after doing the initial preparation the backup starts uploading volumes , but stalls after a certain number. I have repeated this on both my original machine and the duplicate machine with the same result.

The screen output hangs at the "Uploading ..... etc "

Analysing the output of 'ps aux' command indicates tklbam-backup still running in SL+ mode i.e in interuptable sleep mode, waiting for an event to happen.

'Netstat'  shows the amazon socket in "CLOSE_WAIT" status indicating that it has closed at its end and is waiting  for my socket to close.

After <Control-C> to kill tklbam-backup I get the following traceback ...

 

Uploading s3://s3-ap-southeast-1.amazonaws.com/tklbam-yfcamybrzeeisvc4/duplicity-inc.20110825T035002Z.to.20110829T061239Z.vol49.

difftar.gpg to STANDARD Storage

Processed volume 49

Uploading s3://s3-ap-southeast-1.amazonaws.com/tklbam-yfcamybrzeeisvc4/duplicity-inc.20110825T035002Z.to.20110829T061239Z.vol50.

difftar.gpg to STANDARD Storage

^CTraceback (most recent call last):

File "/usr/bin/tklbam-backup", line 266, in <module>

main()

File "/usr/bin/tklbam-backup", line 239, in main

trap = UnitedStdTrap(transparent=True)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 266, in __init__

self.stdout_splice = self.Splicer(sys.stdout.fileno(), usepty, transparent)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 213, in __init__

vals = self._splice(spliced_fd, usepty, transparent)

File "/usr/lib/python2.6/dist-packages/stdtrap.py", line 175, in _splice

events = poll.poll()

KeyboardInterrupt

Traceback (most recent call last):

File "/usr/bin/tklbam-backup", line 266, in <module>

main()

File "/usr/bin/tklbam-backup", line 242, in main

b.run()

File "/usr/lib/tklbam/backup.py", line 311, in run

backup_command.run(passphrase, conf.credentials)

File "/usr/lib/tklbam/duplicity.py", line 77, in run

exitcode = child.wait()

File "/usr/lib/python2.6/subprocess.py", line 1170, in wait

pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)

File "/usr/lib/python2.6/subprocess.py", line 465, in _eintr_retry_call

return func(*args)

KeyboardInterrupt

Dont know really what any of that means but hoping  the devs do.

Anyway , to me it appears that the amazon end  is timing out , or in some indefinite loop waiting for something  at my end to happen and then gives up and closes, while tklbam continues to run on my machine.

 There is no other problem at my location to indicate internet connection issues. I can transfer large files in other scenarios with no problems, and up to now tklbam has been working for me with no issues.

The log file is of no help as it just mirrors the screen output. I tried redirecting stderr to stdout but nothing , so that coupled with the fact that tklbam-backup is still running indicates  to me that no error has actually occurred in the program.

A by-product of this is that I now have 1 backup showing on my hub dashboard that still indicates "First backup in progress" and there is no way to delete it.

I would appreciate any ideas from devs/others as to where and why my problems are now occuring, hopefully the callback trace helps.

Thanks in anticipation and once again sorry about the length of the post.

Scott H.

Jeremy's picture

Hey Scott

Glad to hear that things have been good up until now. Not so great to hear about your issues noe though.

Sorry I don't have anything at all to add. Anything I may have suggested seems to have been ruled out by your tests...

Regardless I can suggest that if you haven't already, that you use the hub feedback feature to at least get the devs to get your incomplete backup deleted. I'd be inclined to mention this thread (and post a link) in your feedback and hopefully they may have some ideas.

Chris Musty's picture

Same Here

I have the same issues but with larger file sizes.

I have given up on using TKLBAM for fileservers for now.

Chris Musty

Director

Specialised Technologies

Chris Musty's picture

That sounded harsh

I didnt mean to sound so scathing - I love TKL! I was just stating that I have not deployed it for file servers for a while :)

Chris Musty

Director

Specialised Technologies

Liraz Siri's picture

This may be a Duplicity bug...

Thanks Scott for reporting this issue and thanks to Chris for confirming he can reproduce it consistently. TKLBAM uses Duplicity on the back-end. There's a fairly robust timeout / retry mechanism that should have handled any temporary error on the Amazon side but this may be failing for some reason. I have a long overdue maintenance round for TKLBAM coming up soon, and amongst other things I'll look into upgrading the branch of Duplicity we're using. There have been a round of promising bugfixes which may fix the issue. I'll be asking you guys to help me test whether it does. If you can provide (private) access to a server where the bug can be reproduced consistently that will be very useful.
Chris Musty's picture

Happy to help

I will send you admin access to my affected server if you wish, where should I send the details?

Chris Musty

Director

Specialised Technologies

Alon Swartz's picture

Replied to Hub feedback

I replied to the Hub feedback you sent with the relevant info.

Scott Howard's picture

Thanks Liraz, Chris

Thanks for replying guys, Alon has also contacted me regarding this. I have a turnkey - core server running on amazon with openvpn,  upon which I am going to install samba  and try to do a "backup" file transfer (a tarball of my data  = ~ 7.5 G) this way to see if it will go through to completion with the file size, and Ill report back. Naturally I'll be happy to be a guinea pig for any tklbam testing that you need to do to resolve this issue as I'd really prefer to keep doing things the way Ive previously been.

 

Thanks again

Scott H.

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account, used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <p> <span> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <strike> <caption>

More information about formatting options

Leave this field empty. It's part of a security mechanism.
(Dear spammers: moderators are notified of all new posts. Spam is deleted immediately)