I am testing a restore of a v14 Fileserver appliance (with some extra packages) to a v15 Fileserver. The backup is approx 80GB, managed by the HUB. Target is an AWS t2.medium instance with 600GB storage.

After more than 36 hrs tklbam-restore seems stuck. The last lines in the log show 

  chown -h www-data:www-data /var/www/ajaxplorer/data/plugins/auth.serial/jacob
  chown -h www-data:www-data /var/www/ajaxplorer/data/plugins/auth.serial/root
  chown -h www-data:www-data /var/www/ajaxplorer/data/tmp/update

Enhough disk space available

root@fileserver /# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           396M   43M  353M  11% /run
/dev/xvda2      550G  287G  242G  55% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup

But tklbam-restore is still hogging cpu

top - 10:20:56 up 1 day, 15:03,  3 users,  load average: 1.20, 1.06, 1.02
Tasks: 123 total,   2 running, 121 sleeping,   0 stopped,   0 zombie
%Cpu(s): 37.9 us, 12.1 sy,  0.0 ni, 49.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.2 st
KiB Mem :  4049088 total,  1856228 free,   282844 used,  1910016 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  3457636 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1433 root      20   0  255432  70180   3588 R 100.0  1.7   2317:00 tklbam-restore
  765 root      20   0  213680  15800   6312 S   0.3  0.4   0:35.99 fail2ban-server
 5003 root      20   0   41332   3388   2732 R   0.3  0.1   3:44.81 top
    1 root      20   0   57488   7192   5304 S   0.0  0.2   0:04.15 systemd

IO and network traffice only spiked first few hours and are almost zero now.

What is keeping tklbam-restore so busy? Why hasn't is finished or errored? What else can I check?

 

Forum: 
Jeremy Davis's picture

Could you provide any further info?

Regardless, my first thought is that perhaps you've run out of free space on your HDD? To check that:

df -h

If that's not the case, please post back with as much info as you can.

odd, I provided extensive info in my opening post, but somehow I cannot see that anymore. Well, here we go again:

I have a v14 fileserver appliance (with some mods), running on bare metal. I intend to replace my server, and upgrade to v15 fileserver on VMware. The backup is approx 80GB. I want to test the procedure first, so I started ant2.medium instance on AWS, with 600GB HDD. The tklbam-restore has been running for over 36 hours now. Afer peaking the first hours, now showing very little IO or network activity.

last lines in /var/log/tklbam-restore

chown -h www-data:www-data /var/www/ajaxplorer/data/plugins/auth.serial/jacob

chown -h www-data:www-data /var/www/ajaxplorer/data/plugins/auth.serial/root

chown -h www-data:www-data /var/www/ajaxplorer/data/tmp/update

 

output df -h

root@fileserver /# df -h

Filesystem Size Used Avail Use% Mounted on

udev 2.0G 0 2.0G 0% /dev

tmpfs 396M 43M 353M 11% /run

/dev/xvda2 550G 287G 242G 55% /

tmpfs 2.0G 0 2.0G 0% /dev/shm

tmpfs 5.0M 0 5.0M 0% /run/lock

tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup

 

tklbam-restore keeps hogging cpu

 

top - 16:40:36 up 1 day, 21:23, 3 users, load average: 1.00, 1.00, 1.00

Tasks: 123 total, 2 running, 121 sleeping, 0 stopped, 0 zombie

%Cpu(s): 37.9 us, 12.0 sy, 0.0 ni, 49.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.2 st

KiB Mem : 4049088 total, 1892076 free, 275680 used, 1881332 buff/cache

KiB Swap: 0 total, 0 free, 0 used. 3462884 avail Mem


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

1433 root 20 0 255432 70180 3588 R 99.7 1.7 2696:39 tklbam-restore

652 stunnel4 20 0 579080 6812 4448 S 0.3 0.2 0:01.60 stunnel4

667 shellin+ 20 0 37456 3316 2552 S 0.3 0.1 0:00.30 shellinaboxd

5003 root 20 0 41332 3504 2732 R 0.3 0.1 4:21.04 top


Why is tklbam so busy? What is it doing? Why doesn't it finish or throw an error? What else can I check?

Jeremy Davis's picture

Firstly, I've just finished processing a new release of TKLBAM to fix the issue where if paths that included a space were included in the /etc/tklbam/overrides file, it would cause failure. That has now been fixed as of v1.4.2. It's not yet uploaded to the Stretch/v15.x repository but you can get it here (install notes are also there).

Regarding this current issue of taking a long time to restore, TBH I'm not at all clear what might be the cause of that. But the length of time it's taking certainly seems very unreasonable! Unfortunately, as you've noticed, TKLBAM doesn't actually do a very good job of noting exactly what it's up to, so why it may have stalled is unclear.

I'm not convinced that it's likely, but perhaps considering the amount of CPU usage it's showing, I'm wondering if you've used up all your CPU credits and you're being throttled at the "baseline" CPU performance? AWS T2 instances have what is referred to as "burstable CPU performance". That means that as opposed to traditional instances, they have a fairly low "baseline" level of CPU performance, but with the ability to burst well above that baseline level. The ability to "burst" is controlled by a system of "CPU credits", one CPU credit is equal to one vCPU running at 100% utilization for one minute. t2.medium servers start with 60 "Launch credits" and after that credits are earned at 24 per hour (to a maximum of 576 credits). t2.medium servers have 2 vCPUs and TKLBAM will (mostly) run on a single thread. I'm only guessing, but if your backup is particularly large, perhaps your CPU credits have been exhausted? Because the CPU is running so hard, it's never recovered, so is running at the "baseline" rate (essentially throttled).

If that seems possible, then you may just need to wait. Beyond that, you could retry and break the restore up into 2 distinct parts (with a bit of a rest between), first download the backup, then restore it. FWIW, that's probably the best way to go when migrating between major version anyway, as there may well be components that are best left out. E.g. v14.x Fileserver came with Ajaxplorer, but v15.x uses WebDAV-CGI (which leverages the Samba user accounts).

So to download your backup:

mkdir /tklbam-dump
tklbam-restore BACKUP_ID --raw-download="/tklbam-dump"

(Where BACKUP_ID is the actual ID# of your backup). Then you can restore from the downloaded backup. E.g. to try a full restore and see how it goes:

tklbam-restore /tklbam-dump

If you have problems, you can roll it all back:

tklbam-restore-rollback

Then to try again, limited to some particular paths, include them (space separated list) via the --limits switch, e.g.:

tklbam-restore /tklbam-dump --limits="/path/to/restore /another/path/to/restore /and/so/on"

Or alternatively, exclude particular paths similar to how you do via the ovrrides file:

tklbam-restore /tklbam-dump --limits="-/path/to/exclude -/another/path/to/exclude -/and/so/on"

You can mix and match exclusions and inclusions to pinpoint specific files/directories. Please note though, that rollback data is only stored for a single restore. Generally that's all you need, but if you do a staged restore, it's important to keep in mind...

I remembered the cpu throttling from a while ago. That's why I choose a bigger instance, and enabled the t2/t3 unlimited option. That allows applications to burst beyond the baseline for as long as needed at any time. You just get billed for the extra usage if you cannot equalize it out with cpu credits.

t2.medium is 2 vCpu's and tklbam mostly single threaded. That did show. Usage was 100% over both cpu's while it was really working on the restore. But after it was 50% overall, but tklbam close to 100% on 1 cpu. After more than 48hrs I killed tklbam-restore.

I did run the whole restore again on a t2.small, also unlimited enabled. That ended with: "We're done. You might want to reboot now to reload all service configurations." I didn't change anything else...

Splitting the download and the actual restore will help me a lot when moving my testing to the vm host. 

I think I will need to exclude some packages from the restore. First thing I already noticed is the new shiny Turnkey branded Webmin on the v15 appliance gets destroyed by my restore :( And I never used AjaxExplorer, so I want to exclude that too.

Do I need to make a config file containing a list of exclusions? And is that by pathname, or is it different for webmin, ajaxexplorer?

Jeremy Davis's picture

Ah ok, so it sounds like you were way ahead of me on the potential for CPU throttling. So TKLBAM on the t2.medium obviously "got stuck" somehow/somewhere. Unfortunately, I have no idea on what might have gone wrong, but I'm glad to hear that it worked ok on the t2.small instance.

Re splitting the backup, I'm glad to hear that you will likely find that useful. FWIW the path you use to dump the backup files to (and restore from) doesn't really matter. But my suggestion of /tklbam-dump makes it clear exactly what it is and ensures that it won't automatically be included in new backups on the new server. So I'll continue to use that in examples. If you use a different path, obviously adjust my examples as required.

Regarding the shiny new Webmin theme disappearing, I expect that is because the default Webmin config (/etc/webmin) is being overwritten from your backup. It's your call on whether you exclude that from the backup on your original server via an override (and rerun a backup) or exclude it from the restore.

To do the former, add '-/etc/webmin' to your overrides file as per usual. To do the latter (assuming 2-part restore as noted previously):

tklbam-restore /tklbam-dump --limits="-/etc/webmin"

Also, if you haven't installed any custom packages, then it might be worth considering skipping the package install step. You do that via the '--skip-packages' switch. E.g. (assuming that you're also skipping Webmin conf as above):

tklbam-restore /tklbam-dump --skip-packages --limits="-/etc/webmin"

That way you'll just get the default set of packages that v15.x ships with. Unfortunately regarding packages, there isn't any middle ground, it's all or nothing. Although if you split the backup into 2 steps as discussed, then you can view the packages which it'll want to install like this:

cat /tklbam-dump/TKLBAM/newpkgs

Note that if there are no additional custom packages installed, that will throw an error like this:

cat: /tklbam-dump/TKLBAM/newpkgs: No such file or directory

Feel free to share the list of packages if you're unsure on what (if any) to install.

Re Ajaxplorer (or any other stuff you don't want) as per Webmin config above, you can explicitly exclude it by adding it as a "limit". Note the the limits on restore are a space separated list enclosed in double quotes. I.e. to add a new limit of /some/path/to/exclude:

tklbam-restore /tklbam-dump --skip-packages --limits="-/etc/webmin -/some/path/to/exclude"

It's worth noting that you can mix and match inclusions and exclusions to pinpoint specific files and/or directories. E.g. to exclude all of /some/path/to/exclude, except for /some/path/to/exclude/file-to-keep:

tklbam-restore /tklbam-dump --skip-packages --limits="-/etc/webmin -/some/path/to/exclude /some/path/to/exclude/file-to-keep"

Alternatively, if you want to do a minimalist restore, you could go the other way and rather than excluding paths you don't want, you could just ensure that you only restore explicit paths. I'd love to give you an example that should probably work for you, but I don't have a fileserver handy and don't really want to risk leading you astray. But off the top of my head, the files that you'll most likely want to keep when transferring are your samba config files (/etc/samba & /var/lib/samba), Linux user account info (/etc/passwd & /etc/shadow) as well as your fileserver files (/srv/storage). Assuming that is enough (it may not be...):

tklbam-restore /tklbam-dump --skip-packages --limits="/etc/samba /var/lib/samba /etc/passwd /etc/shadow /srv/storage"

FWIW, if you want to see what files are included in your backup dump (e.g. to tune the above command), 'tree' is a preety cool way to view that via a terminal. You'll need to install it first:

apt update && apt install tree

Then to view all files within your tklbam dump:

tree -a /tklbam-dump

With the exception of the TKLBAM directory (which is the internal tklbam backup info), all the files in there will restore relative to /. E.g. "/tklbam-dump/etc/samba" will restore as "/etc/samba". Hopefully that gives you plenty of info to work with. You may also find the doc (aka man) pages for tklbam-backup and tklbam-restore useful.

If you need any further clarification or you have any further questions, please ask.

Thanks Jeremy. I have a love/hate relationship with tklbam. When it works, it is perfect. But if she's not in the mood, it's hard to understand why and not giving much clues ;)

It will probably take me a week or so before I can get back to testing, but I will let you know how it goes.

Jeremy Davis's picture

I agree it would certainly be better if it gave more feedback. It's overdue for some love really, so I'd certainly like to see that included in any future works (among other things...).

I can't promise too much at this stage, but one way or another I really hope to be able to get some work done on it in the not too distant future...

Regardless, please post back when you circle back around to this. I'm more than happy to help out as best I can.

I have over 200 new packages / libs. I think I will do a big spring cleaning and go for a clean install, with minimal restore. I estimate reinstalling a handfull of packages will be less hassle and give a cleaner system.

I noticed sudo is not standard on TKL-FS, is there a special reason for that?

Jeremy Davis's picture

I think your idea of a bit of a "spring clean" is a good one! :)

Re no sudo by default, the only reason is that we try to keep the images as trim as possible and when running as root, there is no need for it (you can use 'su - USERNAME' to run as the USERNAME user). However, it does no harm to be installed.

I have the following list of packages I want to re-install:

Name

Why

Sudo

Security

ProFTPD

receive scans from old MF printers

Webmin proftpd

Config proftpd from webmin

Gdrive

Upload to google drive (is a github download, not via apt)

Incrontab

Trigger hot folder actions

Incrontab user stuff

config, allow, deny, rules, user scripts

Pdftk

Process pdf’s in hot folder

Test of a clean install with Linux and Samba users is already up and running on vmware host. Could I edit /tklbam-dump/TKLBAM/newpkgs to include only above packages, and tklbam handles the configs etc?

 

 

Jeremy Davis's picture

By default, all the config for all of those should be within /etc, so unless you explicitly exclude it (or even if you do, then explicitly re-include it) you should be good.

I manually installed and reconfigured the additional packages. Maximum control and I have a nice clean install now. Old and new servers are running in paralel now, keeping data synced with rsync, until we are ready for cut over.

To make sure, what is the correct way to escape spaces in paths in tklbam overrides?

Jeremy Davis's picture

On v15.x, if you are adding overrides to the overrides conf file (i.e. /etc/tklbam/overrides) then first be sure to install the latest release as per the instructions there (unfortunately, I don't yet have push access to the apt repo, but do on GitHub). With that version installed, then overrides added to the conf file, just have to be on their own line (i.e. no need to escape spaces, just ensure each is on it's own line).

If you are specifying paths with spaces when restoring via the commandline (i.e. using the --limits= switch) then use a forward slash. E.g.:

tklbam-restore BACKUP_ID --limits="/path\ with\ spaces /another\ path\ with\ spaces"

When installing the bug fix from github, I get a lot of messages like this:

...
E: Release 'tklbam_1.4.2_all.deb' for 'linux-modules-3.16.0-8-amd64' was not found
E: Release 'tklbam_1.4.2_all.deb' for 'linux-image-3.16.0-5-amd64' was not found
E: Release 'tklbam_1.4.2_all.deb' for 'linux-modules-3.16.0-5-amd64' was not found
E: Release 'tklbam_1.4.2_all.deb' for 'turnkey-core-14.0' was not found
E: Release 'tklbam_1.4.2_all.deb' for 'linux-image-3.16.0-4-amd64' was not found
E: Release 'tklbam_1.4.2_all.deb' for 'linux-modules-3.16.0-4-amd64' was not found

Is this to be expected?

NVM, found the problem between screen and keyboard. I was updating TKLBAM on my old fileserver. On the new one it worked like a charm, happily backing up to the Hub as we speak :)

Jeremy Davis's picture

FWIW, the errors/warning you noted are a bit weird and I'm not quite sure why they are occurring.

Although I'm glad that it worked ok on the newer server! :)

Add new comment