Indefinite restore time to hub Amazon instance

David McNeill - Wed, 2017/04/05 - 22:26

So, needed to restore a 100k file from yesterday's nightly tklbam.

Disk space is a bit tight locally, the backup image is 400Gb, and duplicity seems to want to go back to the previous full backup and work it's way forward to the target file.

So, restore to Amazon server instance seems like a good idea. Couple of clicks and it's on it's way.

12 days later, it's still going...

https://hub.turnkeylinux.org/servers/ dashboard says "Restoring backup... (spinner)"

I can ssh into the instance, it's running. However it only has it's own small 10Gb root partition. Who knows where it's putting all the data it's restoring to. Perhaps another volume that it will bring online when the restore completes. And who knows what machine is running the mystical restore process, as it doesn't appear in the process list of the target machine. Perhaps hub's own control server.

There doesn't appear to be any other controls in either hub or AWS to see what's going on. Or to set things up differently when initiating the restore.

So the restore rate seems to be less than 400k/byte/second (based on 12 days), which makes the whole hub server restore fairly useless for resonably large backups. 12 days later, the original file user is getting annoyed.

Any clues anyone?

Forum:

Support

Tags:

tklbam

restore

Add new comment

Hmmm, that sounds like a real pain!

Jeremy Davis - Fri, 2017/04/07 - 13:26

The behaviour of TKLBAM going back to an old backup and then processing newer and newer backups is expected behaviour. It's because TKLBAM (by default) does monthly full backups and daily incremental backups. An incremental backup is essentially just the changes that have occurred. For text files, that will likely be in the form of a diff. And a diff has no real value, unless you have the original file (and the full chain of diffs between the latest diff and the original file).

Have you checked whether the instance is full? I suspect that the instance has run out of space and TKLBAM has failed. For some reason, it was unable to communicate that to the Hub. I would suggest that you check the TKLBAM log on the instance (/var/log/tklbam-backup).

As for your mention of lack of configuration for the server which you launch (to restore your backup too), you are absolutely right! It's certainly a flaw. Unfortunately at this point, the only way to work around that, is to manually launch a server, then log into it and manually trigger the restore.

So in the short term, you'll need to launch a new server (with sufficient free space for the full backup plus. Personally, I'd be inclined to go 2x backup size as a bare minimum. Although it will depend on what is in the backup. Text files compress really well (up to 95%) so can explode when restored. Other files which are essentially already compressed like pictures, movies and music (e.g. jpeg,gif,mp3,avi,etc) barely compress at all.

Also if you ensure that you launch your server in the same region as your backup is stored (check the backup record), that will make the restore run as fast as you're likely to get.

Moving forward, a 400GB backup is MASSIVE! Perhaps it's worth trying to tune it a bit? FWIW all of mine are less than 500MB, many less than 100MB. Usually once it get's much bigger than that, I usually either tune it to trim down cruft. Or create additional backup sets. It sounds like your usage may be a great candidate for that.

You can do it a few different ways, but one way is for one backup set as you currently do (i.e. a full backup). The other would only contain files that change fairly often and are more likely to need restoration. You can read more about that idea in the TKLBAM FAQ. Alternatively, depending on where/what all the big files are, perhaps it would be easier to create a full backup which excludes the big files, and a separate backup with just the big files?

No more detail

David McNeill - Sat, 2017/04/08 - 02:01

Thanks for your considered reply.

Yes, I've checked many times if the instance is full, or even changing at all...

df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda2 10186040 1140080 8505496 12% /

There is no /var/log/tklbam-backup Perhaps because the restore was running elsewhere? I'm still not clear on where it actually does run, or if it ran & failed & left a stale running tag.

I'd consider 500Mb backups test loads, or just configuration. 400Gb is real user data, 100 staff in a complex business with lots of photos, detailed documents etc.

Perhaps you can see the typical scale of actual user backups across the whole hub? What is the average size? How frequently are people using or testing restore?

Yes I'll reorganise the backup into smaller chunks, and setup a local rsync for quick day to day file restore requirements. Then the remote tklbam would only be for full disaster recovery.

I see you've dropped automated restore to hub AWS server instance anyway, perhaps as you refactor it.

I'm proceeding with manually starting an instance, and restoring to that.

How'd you end up going?

Jeremy Davis - Fri, 2017/04/21 - 01:50

Your experience all sounds a bit strange.

I'm guessing that you used the Hub to "restore to new cloud server" (i.e. clicked that button from the relevant backup record within the Hub)? Assuming so, then it should have launched a new server (which should then show in your server list; booting, installing updates, restoring backup, etc). It sounds like that bit all happened according to plan. Except that something obviously went wrong. Considering the size of your backup (and the flaw in the Hub that doesn't allow you to set the default volume size), I'm not surprised it failed. However, it should have filled the server with your backup until it failed and there certainly should have been a log!

Although I just realised, that I made a mistake! The logfile should be /var/log/tklbam-restore (not tklbam-backup). Deep apologies on the misdirection! I'm guessing that the server is long gone by now, but if it's not, then it'd be useful to have a quick look at that log, just to confirm that it failed because of backup size.

Regarding your question of other user's backups, I don't have hard stats, but I do have anecdotal evidence. From what I've seen, you are one of very few with backups that large. As the info that I see only gives total storage size and number of records, I can only guess at the actual backup sizes. But I would guess that most individual "full backup" sessions are less than 1GB.

As for testing backups, we always do preliminary TKLBAM testing for each release we provide. But generally I would suggest that many users (probably most) don't check their "real world" backups as often as they should (IMO). IMO backups have limited value if they aren't tested regularly. In a perfect world, I'd be inclined to do monthly tests, however I don't think I've ever done them that regularly.

As for the automated restore being "dropped" in the Hub, that is only supported for versions of TurnKey provided by the Hub. When we release a new version, the Hub is one of the first places to get it. In the past we have kept the old version around for a little while, but in more recent times, we have just replaced the old version with the new. So I suspect that whatever appliance you are using, there has been a new version released. Currently only Core has been released for v14.2 so I'm guessing that's what your server is based on, or perhaps you were looking at an older backup record when you noticed that?

Add new comment

form.antibot { display: none !important; } You must have JavaScript enabled to use this form.

Main menu

User menu

You are here

Hmmm, that sounds like a real pain!

No more detail

How'd you end up going?

Add new comment

Plain text

Search form

You are here

Hmmm, that sounds like a real pain!

No more detail

How'd you end up going?

Add new comment

Plain text