tklbam-restore 100% cpu 48 hrs now. Please help

I'm testing my restore procedure, and for the first time ever using an AWS cloudserver, via the HUB. Almost 24 hrs later now. Seems the restore has been done. I see files from the backup, not sure if it's 100%.

But in top, tklbam-restore is still hugging all available cpu. tail /var/log/tklbam-restore doesn't show any new progress.

What is keeping it busy and why? How can I find out more on what is happening?

 

Attached top screen shot picture

Enough free space as it seems




root@fileserver /var# df

Filesystem 1K-blocks Used Available Use% Mounted on

rootfs 103212320 49295524 48675068 51% /

udev 10240 0 10240 0% /dev

tmpfs 61108 1096 60012 2% /run

/dev/xvda1 103212320 49295524 48675068 51% /

tmpfs 5120 0 5120 0% /run/lock

tmpfs 122200 0 122200 0% /run/shm

/dev/xvda1 103212320 49295524 48675068 51% /tmp

/dev/xvda1 103212320 49295524 48675068 51% /var/cache/tklbam

 

last 60-or so lines from /var/log/tklbam-restore (ends with 2 empty lines)

chown -h root:root /var/lib/samba/perfmon
  chmod 0755 /var/lib/samba/perfmon
  chown -h Debian-exim:Debian-exim /var/spool/exim4/db/retry.lockfile
  chown -h www-data:www-data /var/www/ajaxplorer/data/plugins/auth.serial/Luzina
  chown -h www-data:www-data /var/www/ajaxplorer/data/plugins/auth.serial/jacob
  chown -h www-data:www-data /var/www/ajaxplorer/data/plugins/auth.serial/root
  chown -h www-data:www-data /var/www/ajaxplorer/data/tmp/update
  rm /etc/nologin
  rm /etc/rc0.d/K01confconsole
  rm /etc/rc1.d/K01confconsole
  rm /etc/rc6.d/K01confconsole
  rm /etc/ssl/certs/03f0efa4.0
  rm /etc/ssl/certs/1155c94b.0
  rm /etc/ssl/certs/11a09b38.0
  rm /etc/ssl/certs/1df5ec47.0
  rm /etc/ssl/certs/2cfc4974.0
  rm /etc/ssl/certs/4597689c.0
  rm /etc/ssl/certs/6adf0799.0
  rm /etc/ssl/certs/8f7b96c4.0
  rm /etc/ssl/certs/9818ca0b.0
  rm /etc/ssl/certs/Equifax_Secure_eBusiness_CA_2.pem
  rm /etc/ssl/certs/Firmaprofesional_Root_CA.pem
  rm /etc/ssl/certs/TC_TrustCenter_Universal_CA_III.pem
  rm /etc/ssl/certs/TDC_OCES_Root_CA.pem
  rm /etc/ssl/certs/Wells_Fargo_Root_CA.pem
  rm /etc/ssl/certs/ce026bf8.0
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/bottom.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/button_cancel.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/button_ok.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/configure.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/database.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/documentinfo_koffice.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/download_manager.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/drive_remote.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/edit_clear_history.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/edit_rename.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/editcopy.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/editdelete.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/editpaste.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/eraser.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/favorite-folder.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/oxygen/images/actions/32/file_info.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/view_choose.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/view_icon.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/view_left_close.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/view_left_right.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/view_list_tree.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/view_text.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/view_tree.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/viewmag1.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/window_close.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/window_fullscreen.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/window_nofullscreen.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/xmag.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/yast_backup.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/yast_kuser.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/yast_security.png
  rm /var/www/ajaxplorer/plugins/gui.ajax/res/themes/umbra/images/actions/16/yast_user_add.png


Jeremy Davis's picture

Micro servers can have high CPU spikes but if CPU usage remains high for any extended period of time AWS throttle the CPU. Once a CPU has been throttled a vicious cycle can often occur. Processing essentially comes to a halt so if your server was doing something CPU intensive then as soon as the CPU becomes unthrottled there is a lot of CPU work that has been backed up. So CPU usage goes high straight away triggering another round of throttling... TBH I haven't heard of that happening with a restore (a few users have experienced it with backups) but IMO there is no reason why it couldn't happen on restore (if CPU usage stays too high for too long).

Bottom line is that I do not have a 100% for sure answer. You can check on the logs to see if there is anything there (logs are in /var/log and should start with "tklbam"). Another option is to use the '--debug' switch. THis will do the restore in stages and drop you into a shell (sao you can have a poke around) between the stages.

A further option maybe to manually do the restore in stages and use the '--raw-download=/path/to/backup/' switch which will just download the files (without doing a restore) and then add the path to the TKLBAM command to restore the backup over the top of your filesystem.

It is a micro instance, backed with elastic block storage. I rebooted the thing after 3 days max load, like you would in the windows days ;) and now cpu behaviour is normal. What creeps me out is killing the restore job, before it finishes. I will never be sure it completed 100% and put back all my files :(

Now I will test the restore to a bare metal machine at my place. But the whole idea of using the cloudserver for this test was saving time and bandwith, and the ability to quickly test different scenarios I am considering for upgrading to the new release if the fileserver appliance.

How about a more beefy server instance? I am not on a plan.

Jeremy Davis's picture

Assuming that I was right in my guess of the Micro CPU throttling issue then it certainly shouldn't be an issue locally. It also shouldn't be an issue on a small server. And considering that it was only as a test then the cost difference (between Micro and Small AWS servers) shouldn't be too significant for future tests (if you do them).

I'd love to hear how things go with your local restore so please post back if you get a chance.

As for the new fileserver; the restore should go fine but note that v14.0 no longer includes Ajaxplorer. It still has a web accessible file frontend though but it uses Samba users only so keep that in mind. If you have any Ajaxplorer only users you will need to tweak things a bit by creating Samba users and moving their files to the relevant filesystem location...

I just installed the Fileserver appliance (v14) on bare metal. Later today I will start the tklbam restore, and will let you know how that goes. My backup is from a v13 appliance and I installed some extras to that appliance. I'm curious to see how the upgrade will go. But if the restore process does hug my cpu again, and not terminate properly, what should I do? How to gather intel to analyze what is going on?

I never liked the Ajaxplorer ;) so that won't be a problem.

Using the HUB I cannot deploy any other instance than a micro, without being on 'some plan'. And since this is my first test of a cloud server, I'm not sure if I want or need some plan. I cannot really oversee the costs. The amazon calculator told me I shouldn't do reserved instances.

Will get back to you later.

One more thing: the editor box I'm typing this commenct in, does not really like me (mac, safari). Everytime I enter to a new paragraph, the cursor jumps back to the start of my comment.

Jeremy Davis's picture

Re forum editor box

Hmmm that's weird! So when you hit Enter it goes back to the top of the box? Personally I haven't experienced that myself (Chrome & IceWeasel/Firefox on Debian) but TBH I mostly use the text editor anyway...

The only workarounds that I can think of are:

  • Switch to plain text editor - you'll need to use html though (for formatting)
  • Install/use an additional/alternate web browser
  • Restore to bare metal took approx 90 mins, for a 20GB backup and using a 100Mbit download.

    Seems it has gone through, without any serious cpu hogging. I used the --debug switch and couldnt notice any strange things in between. Question: does the --debug switch disable logging to /var/log/tklbam-restore?? Because there is no log file :(

    Do you have any advice how I can compare my live fileserver to the one I restored just now, to make sure all my data files came through?

    Jeremy Davis's picture

    I wouldn't have expected using the debug switch to stop it from logging but perhaps you've discovered a bug? Is it just empty or just missing entries? We probably should lodge that as a bug report...

    Did the restore totally complete without error? If so then there is no reason to suspect that it doesn't include all the files from your backup. However I can understand you wanting to check.

    When it restores, TKLBAM keeps track of what was changed and allows you to roll back. However I'm not actually sure where we'd find that... I should do some testing.

    If you haven't rebooted yet it may be worth seeing if the cache still exists locally (IIRC it uses either /tmp or a custom /TKLBAM directory). You could then compare that against the local filesystem. If you can't find that then you could redownload your backup (without restoring) and do the same comparison.

    Assuming that you still have your old server then you could run a TKLBAM backup dump (--dump=DIR on that and compare it to the same on your new server.

    The file /var/log/tklbam-restore is just not there. Does not exist. Even cd'd to the directory and ls'd to check the contents of /var/log. It is not there.

    I still have the live server running, because this is just testing the restore procedures. Although I noticed a SMART warning last week ;)  So better be prepared. Is there any standard tool or script to compare contents of data directory between 2 servers? No log file, not enough lines in screenbuffer. I am not sure if all went well. That is why I need to compare the 2 servers.

     

    Jeremy Davis's picture

    First up I have made a bug report on you not having a log after using the debug switch.

    Good plan on testing! AFAIK there is no standard tool for comparing directories between servers; but Linux being Linux there are probably a number of different ways to do it. A quick google turned up a suggestion to use rsync. TurnKey is actually based on Debian (not Red Hat) but rsync is a common Linux tool so that'll be fine. Assuming that you can connect to one server from the other I think that's a great idea!

    To make it easier to read, I have made a little script for you. I did a couple of quick tests and it seems good but to be on the safe side I urge you not take my word for it and have a bit of a read about rsync and inspect the output file (defaults to /tmp/rsync.tmp) yourself. You can run this on either server and in the script adjust REMOTE to be the IP of the other server. The CHK_PATH may also want some tweaking? This will take some time to run on a lot of files...

    #!/bin/bash
    
    # adjust these values as relevant
    REMOTE=192.168.1.50
    CHK_PATH=/srv/storage
    
    TMP_FILE=/tmp/rsync.tmp
    
    rsync -ani --delete --size-only $CHK_PATH/ root@$REMOTE:$CHK_PATH > $TMP_FILE
    
    echo "# Files (& empty dirs) on REMOTE - but NOT LOCAL:"
    FIND="\*deleting"
    grep $FIND $TMP_FILE | sed "s|^$FIND *||"
    echo
    
    echo "# Files on LOCAL - but NOT REMOTE:"
    FIND="<f"
    grep $FIND $TMP_FILE | sed "s|^$FIND.* ||"
    echo
    
    echo "# Files on BOTH:"
    FIND="\.f"
    grep $FIND $TMP_FILE | sed "s|^$FIND.* ||"
    
    Assuming you have called it check.sh; make it executable and run it:
    chmod +x check.sh
    ./check.sh
    

    Not sure if it is a bug or a feature ;) but rolling back a restore does not leave a log file either (this was without the -debug option)

    Jeremy Davis's picture

    I think it should log rollback too so I added a new feature request.

    Post new comment