Drew Ruggles's picture

Of course a long holiday weekend and a client project due make for perfect timing for a server crash and a good test of my computer problem solving skills, patience, and TKLBAM restore.

Not sure what the server was doing when I accidentally shut it down, but it didn't want to come up. As a matter of fact, there's a good probably it's days are done, and I'm just using costly end-of-life care at the moment, such as the Vertex 4 SSD I bought to see if I could revive it.

Biggest problem was the BIOS not recognizing the relatively new (~ 1 year) WD 500GB Green HDD, which had my TKL LAMP as the main OS, as well as a few VMs and other goodies -- Cloudprint, Logitech Media Server, PHPVirtualBox, etc.

So I can't get the box to boot off the WD500 HDD, even though all the tools in Parted Magic and Boot-Repair-Disk say it's just fine (this is one reason I'm thinking the mobo is hosed). Replaced the SATA cables. Installed the SSD and used UnetBootin to create a LiveUSB of the latest Turnkey LAMP. I can get it to boot from the SSD, to Webmin and run TKLBAM Restore, then it all goes to hell...

I tried it once, and couldn't reboot the machine. Wiped the SSD and reinstalled TKL LAMP, booted, ran TKLBAM Restore, and now have this error:

Download s3://s3.amazonaws.com/tklbam-[ lettersandnumberscode ]/duplicity-full.[ lookslikeacacatenationofmylastfullbackdateandid ].vol28.difftar.gpg failed (attempt #1, reason: BotoServerError: BotoServerError: 500 Internal Server Error
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>C9AF3DD7C870D7F1</RequestId><HostId> [ areallylongseriesoflettersandnumbersthatIdon'tknowiftheyareimportant ]</HostId></Error>) 

Any thoughts?

[UPDATE] I "tried again" on the recommendation of the above error:

Local and Remote metadata are synchronized, no sync needed.
Last full backup date: Sat Nov 10 12:37:26 2012 

Yeah... I don't think so.

Any way to "force" the Restore to happen? For example:

tklbam-restore 4 --noninteractive --force

FYI, I'm restoring a 10.x LTS on top of the 12.x Debian version.

Forum: 
Jeremy Davis's picture

But it sounds like in the meantime perhaps your new server has done a backup? Thus when you tried to restore there was ntohing to restore (the state was the same as the last backup). Either that or it thought the restore was successful?

If it is because it thinks the restore was successful, try doing a rollback then try again.

If it is because it has done it's own backup - then you'll need to go back to a (the?) previous backup (the one with your data). Sorry I don't recall the exact steps OTTOMH but if you look either in the Hub (or the TKLBAM webmin module or from the command line) then you should be able to find the backup that you want and manually tell it to use that one.

Good luck!

Drew Ruggles's picture

Yeah, I was able to discern which backup it was restoring (it was number 4, btw) and yes, it had created a backup 5, but I think even more problematic is it may have been running that backup as I was trying to restore 4 at the same time, which is probably why it barfed out all those error messages.

Eventually, it was able to restore -- at least I could tell from the Webmin TLKBAM log that it was putting the correct stuff back in the right places. The issue comes that not all the services are running after the restore, so it recommends a reboot. This is when things go catastrophic. From Webmin, I tell it to Reboot. On the ConfConsole screen, I can see messages (sorry, I didn't write them down, but I'm sure it said something like, "Need maintenance password or hit control-d to continue." Entering either the password or hitting control-d have almost not affect, as "focus" (not really) goes to the confconsole window, but I can't navigate it with the keyboard.

Forcing it to reset at this point restarts the system, and then lots and lots and lots of error messages that say the file system is read-only, and it won't even fully boot up, and it completely locked up.

Do you think going back to a 10.x LTS ISO of the LAMP appliance as a starting point, then trying to restore on that clean version?

Thanks, Jeremy.

Drew

Jeremy Davis's picture

My guess is that something is wrong with your filesystem. I suggest that you boot with a live CD/USB and run fsck on the whole filesystem.

Drew Ruggles's picture

Hmmm... That's a thought, except does that mean the TKLBAM Restore corrupted the filesystem? It was working prior to the restore (though, thinking about that... I don't think I ever "rebooted" the new LAMP appliance -- just went straight to TKLBAM restore).

It's also weird that the SSD is new, so I used GParted to partition it as a single volume (128GB) using ext4 filesystem. Then, the Debian Linux installer added the extended Linux swap partition. It just seems odd to me that the filesystem would be the culprit, but I still really appreciate your feedback (as always).

Drew

Jeremy Davis's picture

Not sure why the data corruption occurred but as the restore takes place at an OS level (not at a disk hardware level) then I find the possibility of the restore causing the corruption unlikely (perhaps impossible? - although I've been wrong before! :D). Perhaps the FS corruption is a syptom of the same problem which caused your HDD to die and your new one to not work properly?

The symptoms you describe sound just like filesystem corruption. "Need maintenance password or hit control-d to continue." - suggests to me that a partition/disk has been flagged as 'dirty' and an auto fsck has been scheduled. Then the auto check has found errors which require manual running of fsck on the affected partitions. Also the fact that the FS has been mounted read only is another sign to me that corruption exists (by default if a FS reports as damaged Linux will not allow it to be mounted read/write to reduce the chances of further corruption and increase the likelihood that you will be able to recover data).

Unless you chose to use the existing partition when installing TKL then your partitioning would have been ignored and new partitions (based on LVM) would have been made (but I guess you probably already knew that).

Considering your problems perhaps the issue wasn't with your HDD at all, but the SATA controller (motherboard) or perhaps RAM (RAM issues can often cause HDD corruption). Often motherboards have multiple SATA controllers (most of the ones I have handy have one bank of SATA ports controlled by the chipset (seems like 2 ports on this Gigabyte board I'm looking at now) and another bank (the other 4 on this board - or perhaps I've got that around the wrong way...?) controlled by an additional chip. Perhaps you could try pluging you WD HDD into one of the other ports. The fact that your hardware won't recognise a standard SATA HDD is a concern I would think... Also most older boards with SATA ports often allow the ports to operate in a 'legacy IDE' mode rather than pure SATA (ie it presents to the OS as an old PATA drive rather than a SATA one). Perhaps have a look in your BIOS and see if you have any options like that (exactly what it is called will depend on your BIOS).

Also have you tested your (original) HDD on other hardware at all? Have you run memtest on your RAM? I would suggest both of these things before you preceed to much further.

L. Arnold's picture

I would see if you can do a Cloud Install and Restore to that.  Alternatively try to restore to a different machine.  Otherwise I woiuld expect that TKLBAM would have the same trouble that Linux would if there were HD problems during the backup process.

I am facing a different subject.  Something corrupted MYSQL last weekend and when I go to restore one big Data File is failing.  I woiuld love to get some "micro access" to the Backup sets so I could "pre fix" a file before restoring it.

 

Drew Ruggles's picture

Both good thoughts... My guess is the SATA controller on the motherboard is hosed. I've researched it, and apparently it's a rare condition, but oh, well. I'll dig a little deeper in the BIOS settings and also try a memtest. I'm assuming the automatic one that happens each time it boots is not good enough...?

Not that I really have the time, but I'm very interested in trying out the restore on an AWS micro-server. Damn! ...and my Raspberry Pi just showed up at the door today, too! (It's ridiculously small)

Thanks, Jeremy & L. Arnold!

Drew

Jeremy Davis's picture

But I have had it happen on a fairly new board before (although it was actually an old IDE chip). It started as random HDD corruption and then within about 2 weeks the machine wouldn't boot - and the board was less than 6mths old!

As for memtest, no the BIOS self test at startup is nowhere near enough. I have had one high end board which did actually include a proper memtest option in BIOS but generally it requires booting from a CD/USB (many Linux distros actually include it as a boot option from GRUB). If you don't already have it though this is the one! And to be sure I highly recommend running it overnight (anything less than 8hrs will leave too much doubt).

Ignore this next paragraph if you get no errors on an overnight run...
If you get ANY errors (even just one), then rerun on each stick indivdually (assuming you have multiple RAM sticks). Ideally make sure they go back in the slots they came out of (probably good to label them). Again run each stick 8hrs+ (this will obviously take awhile if you have 4 sticks...!). Once you isolate which stick/slot is the problem, then try the 'bad' stick in 'good' slot. If you get no errors then try a confirmed good stick in the 'bad' slot (basically isolate whether it is the slot or the stick). Bottom line if you have one stick that errors on multiple slots (but the others are ok) then bin it (or RMA it if still under warranty). If you have multiple sticks that error on one slot then you have a mobo issue (but perhaps it'l be good enough to just avoid that slot for now). If you have multiple sticks in multiple slots causing issues then either the RAM and/or the mobo is not good (more testing required to reach a conclusion...)

As for the Raspi, Have a look at Rik's thread on using TKL patch to recreate TKL Core if you are interested...! :)

Drew Ruggles's picture

Looks like Memtest86+ is included on the Parted-Magic Live CD I've been using, so I'll toss that in and run it. Don't have to do overnight as the server is not being used during the day in this state... ugh. I'm still leaning towards hosed SATA controller (even replaced the SATA cables), but willing to check out other possibilities.

As for Raspi, I'm all over Rik's thread. I haven't looked in to it, much, but hoping to set up a bootloader that allows me to choose, Raspian OS, XBMC and TKL LAMP server. It's just amazing how tiny it really is in hand, with full-sized ethernet, 2xUSB, and full-sized HDMI ports -- for US$35.

Jeremy Davis's picture

As for Raspi, I know nothing about it but I would assume that you could install GRUB2 (aka grub-pc v1.9x - as used by most Linux distros these days inc TKL) and get that to handle your multibooting requirments.

Drew Ruggles's picture

I mentioned Raspi to the SO. She replied, "Well what would you use that for? No, seriously, tell me..."

[ sigh... ]

Memtest is still running... Does it ever stop? It's been over 9 hours. Looks like the error count is up to 16, almost all of which occurred on Test #5, whateverthehell that means (classic lack of documentation -- let me guess, it's available in the command line man page?) I gotta give a shout out to the Extras Menu in Parted Magic that has Memtest included in it. It appears most errors occurred between 1200MB and 1800MB. Can I assume this is the middle DIMM of the 3 x 1GB DIMMS on this board?

I'll see what I can accomplish tomorrow, but my travel schedule will prevent further investigation (on the hardware front). I still plan on setting up an AWS instance to test.

Drew

Jeremy Davis's picture

Memtest will run forever if you don't stop it, it will just loop and loop!

In my experience Memtest results are quite like pregnency tests. They can sometimes give false negatives, but rarely ever give false positives! So either you have a RAM issue or you have a motherboard issue. I would put money on it, I'm that confident!

And no you can't assume which stick it is (well you can but it'd really only be a marginally educated guess). It would depend completely on how the RAM is wired into the mobo. If they are running as 3 individual stick in sequence (which they rarely do) then perhaps you are right, but on boards that have 4 slots, when running single channel, they are often slot 1, slot 3, slot 2, slot 4. But perhaps your board is triple channel and they are all running in parallel? In which case it would be an out and out guess which stick (or perhaps it's all 3??)

So you need to re-read my post above (and like I say 8hrs is the minimum, the longer you run it the less chances of a false negative). Don't hesitate to ask for clarity if there is something you don't get...

As for the Raspi - they just don't get it do they! :)

Drew Ruggles's picture

OK, so I took L. Arnold's advice and plunked down the US$1.83 for an EC2 Cloud instance. I was easily able to get the latest LAMP stack up and running, but when I got to TKLBAM in Webmin, and try to restore, it looked like it started, then hung there for quite awhile... I don't know if I close my browser if it stops the restore, or exactly how the command structure works.

I see I have the option of restoring to a cloud instance from my TKLBAM dashboard. While this is appealing -- and I'll probably try it out -- it's not really what I want to do, as it doesn't replicate restoring to bare metal.

Any thoughts on how to make the Restore actually work on a new LAMP instance...?

Drew

Drew Ruggles's picture

Tried running tklbam-restore {backup_id -- in this case, 4} from the Webmin command line on my EC2 instance:

Access denied : User root is not allowed to use the Backup and Migration (TKLBAM) module

How's about them apples?!

Drew


Drew Ruggles's picture

OK, so it looks like from the logs that _something_ was restored (not convinced it's entirely correct, but that's less important at the moment).

The more important issue is that this instance is out there for the world to see, and I only want it for my eyes. How do I make it so all these web interfaces that I've built up are only available to me?

At home, it was fine because if I was out, I could VPN and see it. Is it possible to have it make a VPN connection with my router (my VPN host is my router), then be able to see it, just like it's on my network again?

Thanks.

Drew

PS: I really don't need to be sharing with the world my copy of the Rollings Stones Hampton Coliseum show. (probably don't need that in TKLBAM backup, either...)

L. Arnold's picture

Pretty sure you can control this in the same place you control your elastic IPs etc.  It has been a while since I was there.

The same restore process though should work on a Local Hosted Server.  Then go kill the original backup so you don't pay for the Rolling Stones show every month on your Amazon bill.

Drew Ruggles's picture

I've mostly finished building a new box with modern hardware (and quiet fans, thankfully) and decided to load up my server as a VM for the time being so I could disect it on to multiple hardware appliances. At first, I tried the new Turnkey Core 13.0 RC (amd64 wheezy) but found out quickly I couldn't (or shouldn't...?) restore my LAMP 10.04LTS appliance to a Core appliance (tsk, tsk).

So then I tried the current LAMP 12 stack (i386 squeeze), and this is where it got interesting, as this is where I last left my previous hardware. When I ran TLKBAM Restore on this appliance, it seemed to work, however, when I went to reboot, I ran in to the same READ ONLY FILESYSTEM errors on boot that I had seen with the old box on trying to restore to the newly purchased SSD. There was just NO WAY this could be a bad filesystem error. I have all new hardware that had been tested and running flawlessly on the workstation OS, as well as test workstation OS in a VM.

Then it occurred to me -- as it did back in my second post in this thread -- that I should start with the previous Ubuntu Server based version of the TKL LAMP Stack (Lucid 11.3) rather than the R12 Debian version. Fortunately, SourceForge.net has the previous versions available. I installed this on a new VM, started the TKLBAM Restore -- it takes several hours -- and voila! It's back. Not everything is 100% -- I still can't log in to PHPVirtualbox, for instance, but since I moved all the VMs off the server, it's no big deal for now (...and I'm not sure if running a VM inside a VM would even be plausible or desirable).

So while I've seen it on here many o' times that the way to upgrade an appliance is to run a TKLBAM full backup, do a clean install of the new appliance, then a TKLBAM Restore, my experience was this did not work -- at least between 10.04LTS and R12. Maybe it's possible between 11.3 and 12, but I haven't tested this, yet. This would also explain why the AWS server failed in the TKLBAM Restore. I don't know how to start an AWS instance with an 11.3 appliance (and I canceled my EC2 account so I'm not going to test this anytime soon).

Drew

Jeremy Davis's picture

But it does sound very strange. I don't understand why it would do that.

If your system is heavily customised (as it sounds it is) then I get why it might not work perfectly when restored to v12 (Debian based). Even though Debian and Ububtu are very similar, they are not quite the same. But I'm not sure what could be in your backup that could cause the FS to go read-only... I don't have any time at the moment but if you are open to it, perhaps I could have a look sometime...?

Drew Ruggles's picture

Of course! Just let me know what's the best way to set it up for you to poke around. Remember, I was not able to get the v12 Debian version to work -- only the v11.3 Ubuntu Server version (original backup was v10).

"Highly" customized is probably relative. If you just stick with each appliance as the way it came from TKL, then, yes, it's customized, however I don't feel there is a lot of customization. Who knows?

Jeremy Davis's picture

I've just been reviewing this thread (and correcting a few of my typos) and have realised that I don't recall what you are trying to acheive here (sorry I remember that there were discussions previous to this thread, but don't recall what they were...).

Basically I'd like to have a look and see what is going on re the disk corruption issue. However, I obviously won't be able to recreate it with your hardware and it won't be possible to get remote acccess to a server that won't boot! But I also notice that you speak about a VM. Is your instance running on a VM (and displaying the same corruption issues)? If so can you give me a litte more info? Ie what OS you have installed to bare metal, what the VM environment is?

Also if you are looking to host VMs I seriously can't recommend Proxmox enough! (Although knowing me I've probably told you that already!) Unless you have some real reason to be running your server on hardware, I wouldn't (it's such a waste of hardware...) I am more than happy to step you through how I'd set it up if you can detail your hardware a little more and your ultimate plans for this box.

As for testing your backup, the ideal way to do that for me would be to restore to a AWS instance (I'm happy to host it on my Hub account, but we can discuss those details later), but perhaps that's not going to adequately test your problem? Again it all depends on the specifics of what you are trying to achieve. If there is stuff you'd rather not discuss publicly then you can post an overview here and PM me the specifics...

Also as an idea (re your original plan to restore to the new v13RC) I think it should be possible and here's how I'd do it (for interests sake - I haven't tested it, I think I might sometime though - it'd be an interesting exercise...). Run a TKLBAM back up of your server, but force it to use the Core profile (this will make it a bigger backup, but will include the stuff that Core needs, but isn't in your original backup). Then restore to the v13RC appliance...

Add new comment