Dan Dennedy's picture

I am having a new problem that seems to have appeared sometime in the last few weeks as I do not recall running into this the last time I was working on my project in late October. When restoring from tklbam backup symbolic links are being converted into unlinked files. That is not the worst problem. If the target of the link does not yet exist, then the link/file is not restored! This appears in /var/log/tklbam-restore as something like:

OVERLAY ERROR @ /usr/local/lib/libmlt.so: [Errno 2] No such file or directory: '/tmp/tklbam-wqBv6Z/usr/local/lib/libmlt.so'
 
and breaks library dependencies of my custom software making it unable to execute! Anybody else seen this? It is not limited to my files but also things like:
OVERLAY ERROR @ /etc/initramfs-tools/modules: [Errno 2] No such file or directory: '/tmp/tklbam-wqBv6Z/etc/initramfs-tools/modules'

I looked for problem reports and the cmd line options for duplicity but did not come up with anything. Moreover, duplicity is supposed to restore symbolic links as symbolic links yet within tklbam they do not!

Forum: 
Liraz Siri's picture

I'm not sure what is going on there but I don't believe your hypothesis regarding symbolic links is it.

Yes, of course TKLBAM restores symbolic links as symbolic links. Anything else would be ridiculous on a Linux system. If you doubt that I propose you launch a Core instance, create a symbolic link in your home directory back it up and see if its restored on another test instance.

Actually that's a good approach to problem solving in general. Try to reproduce the problem in its simplest, most isolated occurrence.

Tim Collins's picture

Sorry Liraz missed the later post where there is an admission of a problem with TKLBAM symbolic links. Please see second half of my post though regarding the further problem with ownership and permissions, this problem happens even when the correct volume is mounted.

Liraz Siri's picture

Tim, if you could help isolate the problem you're experiencing with ownership and permissions so I can reproduce it! independently from Core that would be very helpful! I'm just a few days from getting back to TKLBAM (currently pushing out the 11.3 maintenance release for tomorrow).

Dan Dennedy's picture

Steps to reproduce:

  1. launch a new server on Hub with Core
  2. ssh new-server
  3. touch foo
  4. ln -s foo bar
  5.  
  6. root@core ~# ls -l
    total 0
    lrwxrwxrwx 1 root root 3 Nov 27 19:49 bar -> foo
    -rw-r--r-- 1 root root 0 Nov 27 19:49 foo
  7. tklbam-backup
  8. wait for backup to complete and note backup ID
  9. go to hub/backups
  10. locate backup by ID and choose "Restore this backup to a new cloud server"
  11. wait for new server to finish booting and restoring
  12. ssh restored server
  13.  
  14. root@core ~# ls -l
    total 0
    -rw-r--r-- 1 root root 0 Nov 27 19:49 foo

    notice "bar" does not exist

  15. less /var/log/tklbam-restore:
...
Restoring filesystem
====================

MERGING USERS AND GROUPS


OVERLAY:

/root/foo
/root/.penv
OVERLAY ERROR @ /root/bar: [Errno 2] No such file or directory: '/tmp/tklbam-b22UHX/root/bar'
...
/etc/ssl/certs/cert.pem
OVERLAY ERROR @ /etc/ssl/certs/50e684e0.0: [Errno 2] No such file or directory: '/tmp/tklbam-b22UHX/etc/ssl/certs/50e684e0.0'
/etc/ssl/certs/ca-certificates.crt
OVERLAY ERROR @ /etc/ssl/certs/6f5d9899.0: [Errno 2] No such file or directory: '/tmp/tklbam-b22UHX/etc/ssl/certs/6f5d9899.0'
...
OVERLAY ERROR @ /etc/rc2.d/K01confconsole: [Errno 2] No such file or directory: '/tmp/tklbam-b22UHX/etc/rc2.d/K01confconsole'
Liraz Siri's picture

Sigh. Thanks for taking the time to isolate this. I just confirmed that you are absolutely right and we do in fact have a problem. Mea culpa.

TKLBAM is supposed to backup symbolic links as symbolic links. This is a regression. I'm not currently sure when it was introduced and what is causing it, but I'll get to the bottom of it as soon as I get back to TKLBAM development. That was supposed to happen a couple of weeks ago, but various distractions have interluded. Real soon now.

Tim Collins's picture

I'm afraid I can't shed any light on why this is happening, there are no errors in the backup of restore logs. I've just completed a test exercise and screen dumps below show before and after. Used TKLBAM via webmin to do this. Sequence on events... Did FULL backup then restored it, permissions were wrong as below. Did 'chown -r www-data:www-data' to correct and ran backup again (incremental) then restored again. Files again had their ownership changed to root. This is happening only in this directory tree which has several thousand files in

The behaviour I've seen is not consistent either as I've also experienced it changing the directory permissions and the read/write/x attributes. Am I going mad or is something very weird going on?

 

Before

-rwxrwxrwx  1 www-data     www-data       3144 Nov 30 11:15 99_5d668bdb4b54de5.icc
-rwxrwxrwx  1 www-data     www-data     192138 Nov 30 11:15 99_5d668bdb4b54de5.jpg
-rwxrwxrwx  1 www-data     www-data       2857 Nov 30 11:15 99col_b196093d59b5da2.jpg
-rwxrwxrwx  1 www-data     www-data      24503 Nov 30 11:15 99pre_f206f68bca78a65.jpg
-rwxrwxrwx  1 www-data     www-data      92464 Nov 30 11:15 99scr_8d542e809620471.jpg
-rwxrwxrwx  1 www-data     www-data       7247 Nov 30 11:15 99thm_e4749084031f557.jpg
-rwxrwxrwx  1 www-data     www-data       2270 Nov 30 12:20 metadump.xml

 

After

-rwxrwxrwx  1 root     root       3144 Nov 30 11:15 99_5d668bdb4b54de5.icc
-rwxrwxrwx  1 root     root     192138 Nov 30 11:15 99_5d668bdb4b54de5.jpg
-rwxrwxrwx  1 root     root       2857 Nov 30 11:15 99col_b196093d59b5da2.jpg
-rwxrwxrwx  1 root     root      24503 Nov 30 11:15 99pre_f206f68bca78a65.jpg
-rwxrwxrwx  1 root     root      92464 Nov 30 11:15 99scr_8d542e809620471.jpg
-rwxrwxrwx  1 root     root       7247 Nov 30 11:15 99thm_e4749084031f557.jpg
-rwxrwxrwx  1 root     root       2270 Nov 30 12:20 metadump.xml

Ok well it's now actually showing my screen dumps so I've pasted the text too.

Liraz Siri's picture

Note, both issues were related to bugs in the Python standard library when dealing with file moves across file systems. My development environment is a local installation of TKLBAM where /tmp was on the same filesystem as everything else. That's why the problem slipped through in the first place and why it was difficult to reproduce. Finally I admitted I must be missing something and just debugged TKLBAM in a sort of ad-hoc fashion on EC2. Sometimes it's the tiniest differences that getcha! For those interested, here are links to the Python bug tracking system:

http://bugs.python.org/issue9993

http://bugs.python.org/issue1355826

Once I figured out what was going on the fix was straightforward - a wrapper around shutil.move that fixed symlinks and ownership (there don't seem to be any other issues):


# workaround for shutil.move across-filesystem bugs
def move(src, dst):
    st = os.lstat(src)

    is_symlink = stat.S_ISLNK(st.st_mode)

    if os.path.isdir(dst):
        dst = os.path.join(dst, os.path.basename(os.path.abspath(src)))

    if is_symlink:
        linkto = os.readlink(src)
        os.symlink(linkto, dst)
        os.unlink(src)
    else:
        shutil.move(src, dst)
        os.lchown(dst, st.st_uid, st.st_gid)


Brad Rhoads's picture

One strange thing I noticed is that in my master system /var/lib/tomcat6/conf is also a sym link (to /etc/tomcat6). But when I start up a new system from a back up, the work and logs folders are missing (as I said above), but there seem to be a real /var/lib/tomcat6/conf folder with a copy of the contents form /etc/tomcat6.

Brad Rhoads's picture

Any progress on this  (https://bugs.launchpad.net/turnkeylinux/+bug/910515)?

The priroty of the bug is "uncertain". It should be a blocker.

Liraz Siri's picture

Don't worry about the bug tracker status. It's definitely a blocker and will be fixed in the upcoming next version of TKLBAM.

robotnut's picture

The WordPress sites no longer work after a reboot - same goes for the Vanilla Forms LAMP Appliance. So its across the board.

Hope this gets fixed soon - can't wait to use the hub with my current VM's

Liraz Siri's picture

Sorry this took so long, it turned out to be a bit of an edge case which I couldn't reproduce in my local development environment. That explains why it slipped through all my usual regression tests. Made my life a bit more difficult than usual.

Anyhow, I traced back this issue to an obscure Python bug that effects how shutil.move works when moving symlinks between filesystems:

http://bugs.python.org/issue9993

This only effected tklbam-restore's behavior on small and medium sized EC2 instances since 27/10/2011 when we started mount --bind /tmp to /mnt. Restores to non-EC2 TurnKey installations worked fine. Also, there was never a problem with the backups themselves, just the restore process.

Bottom line: I committed a fix, and we'll push out an automatic update tomorrow so that everyone gets a new version of tklbam automatically. After that restores should work just fine.

Tim Collins's picture

I shall be over the moon if you've cracked this one! As you know the original /tmp problem and the restore symlinks prob have both caused me a lot of headaches. Since you say it only affects the restore would I now be able to fix all my links by going back to an original full backup and restoring all the incrementals? Because I still have many issues with broken links.

Were you able to shed and light on my posts showing problems with permissions being incorrectly restored, this appeared to happen when what should have been a symlink was replaced with the file structure it pointed too during restore.

Steve SC's picture

Great to see this is being worked on ... however, the problem I reported here occurred on a micro instance, not a small or medium.

Liraz Siri's picture

Hmmm.... I guess Micro instances might have been affected too. I was mainly inferring from the fact that it doesn't have temporary storage on /tmp. But I think we may still be mount --bind'ing /tmp on Micro instances to /mnt/tmp in which case it may still have triggered the Python shutil.move moving-a-symlink-across-filesystems bug.

Anyhow, I'm pretty confident I cracked the symlink problem for good and that any related permission/ownership issues will go away as well. As soon as the archive is updated Alon will probably post a comment on this forum post.

Alon Swartz's picture

The package archive has been updated with the latest version of TKLBAM which fixes the symlink bug, as well as some other issues found during testing. You can install the latest version manually as follows (alternatively it will also be installed automatically with the daily security updates):

apt-get update
apt-get install tklbam
Tim Collins's picture

Now you've fixed the bug I still have to solve the problems caused by all the missing symlinks and all the errors I have left right and center when I install stuff. I know you said the backups are actually ok and it affected the restore. But I have this historic scenario now in event order...

Create and install appliance

Did original backup + some incrementals

Did a Restore, lost the symlinks - they were replaced with files of the same name instead of link

Have done many more full and incrementals since resulting in the backups containing files instead of symlinks.

With me so far? If I now restore the original backup I guess the symlinks will return because bug is fixed but surely when I then apply later backups they will overwrite the symlinks with the files caused by the earlier restore problem.

SO is there a way to tell the restore not to overwrite a symlink with a file of the same name????

I.E. I want to find a way to fully restore my appliance up-to-date with the original symlinks in place.

Jeremy Davis's picture

And unfortunately I don't have a quick fix, but I'm sure can be done. A couple of ideas that spring to mind are:

  1. restore an old backup that includes symlinks
  2. overwrite with a new backup that includes you desired data only (ie none of the files that should be symlinks)

Or

  1. do a backup of symlinks only (from an old backup that includes them). I'm not sure if you can make TKLBAM just backup certain filetypes, but that'd be cool if you could. Otherwise tar and/or rsync should be able to do that. This will find all symlinks and store their location in a file called /root/symlinks
    find / -type l -exec ls -l {} \; > /root/symlinks
    (don't run 'find /' on a system that has mounted network filesystems eg NFS - otherwise it may cause issues and/or take forever...) You could then compare that against the profile and just tar/rsync the ones that are in backup locations
  2. overwrite your current server with this backup

I know neither of those solutions are pretty and I'm sorry I can't provide you with a nice little script that can do what you want but it should be possible.

Tim Collins's picture

Thanks Jeremy good idea, I agree pretty it's not but definitely helpful. I'm still a little nervous about the integrity of my appliance though doing a manual fix like this. Think I'll restore original and incremetals prior to the original restore that caused the symlink problems to a new appliance, from there I'll use the find command you've created then see if I can use that info to fix them by hand on the live appliance. Hope there's not loads of them!

Thanks.

robotnut's picture

Has this issue been solved? I just updated everything on my VM's and did a fresh back up and tried to restore to a new Amazon instance and get the same thing. 

Webmin still has no permissions and the restores don't work 100%. My word press appliances work - but the webmin doesnt. Project Pier restores also do not work - makes it seem ilikes its a fresh project pier install. Webmin doesn't work on that one either. 

Also have a Lamp appliance running open cart - which just gives internal server errors - while the VM I have in my data center - works flawlessly.

Ideas? Is there something I can run to get this going? I was totally excited to run on Amazon but this just isn't working.

Jeremy Davis's picture

So the machine you backed up was all working correctly? In other words, you're sure that it wasn't a backup that already had the symlink/permission problems?

You could also ensure that the new version of TKLBAM is installed on both origin and target (it should be installed via security updates but just to be sure you could 'apt-get update && apt-get install tklbam')

I'll try to test myself sometime soon to see if I can reproduce the issue(s).

robotnut's picture

Yeah I double checked it all. I updated all VM's in my environment with the latest updates for EVERYTHING including TKLBAM. Then once the updates were done - I rebooted each VM. Once rebooted and everything was in working order I started a fresh set of back ups. 

Once the back ups were done - I did a restore from that back up to a new VM and then wamo. Same thing. Some of the VM's in amazon work - like wordpress and lamp running vanilla - however - i still get the access isssue when getting into Webmin. This is consistent across the board for all of them. 

The other thing - is that the lamp appliances running Project Pier, OpenCart and Atmail - don't work at all after I restore. Project Pier thinks its a fresh install. OpenCart just gives internal server errors - can't even get to webmin. And Atmail - well - Atmail we can disregard. That doesn't work period on Amazon no matter what I do. Fresh install or not.

Hope this helps.

Dan Dennedy's picture

As the original poster, I just want to report this fix is working fine for my use case. I am just using core and running custom, unpackaged software with uninstalled libs. Now the links in libs/ are fine. Also, while I do not use webmin much, it seems to be working, and I no longer see the errors in /var/log/tklbam-restore.

Thank you

Add new comment