Hi,

Google noticed that my server stopped responding to robots.txt in the past 24-48 hours.

No response on either web console (ports 12320 and 12321 are not responding).

No web server is running at all!  SSH is running but is useless as no console access is available.

Looking at THub Console I see this at the bottom:

Begin: Running /scripts/init-bottom ...
Done.
 * Starting AppArmor profiles        [ OK ]
 * Starting Initialization hooks        [9;0]The disk drive for /dev/sda3 is not ready yet or not present
Continue to wait; or Press S to skip mounting or M for manual recovery

I tried rolling back to the TKLBAM backup from 7th July, but the same error appears after forcing the reboot.

Hmm - what to do? Open to any suggestions... Quite urgent as my company website is down and restoring from backup appears to not be an option.

Second question: Why does restoring from TKLBAM fail? It should *never* fail IMHO - that is it's purpose, to recover from failures...

3rd question: How does one access the console? Showing me the output and allowing input are very different things.

Many thanks in advance.

Forum: 

Following on from above: delving deeper it appears that the server was force dist upgraded on Monday. This occurred without my knowledge or consent.

Apache and MySQL were both upgraded and now the entire Joomla site is offline - thanks (NOT).

If not resolved in 12 hours I will have to cancel this Turnkey/Amazon subscription and move to Rackspace - my server there is always restorable and is *never* fiddled about without my knowlwedge or consent...


I think this is the final nail in the coffin for me - looking back at the logs, the backups ceased on Sun 14th.

I want to know:

- who authorised this change to turn off the backup?

- why it was done?

- when you were going to bother letting me (the owner) know? I sincerely hope that you weren't going to wait until someone (eg: one of my customers) pointed out to me that my company server is offline - thanks to Google for advising me, no thanks to TK...

Reliability is one thing, maintainability is another. Whilst TK has been emminently reliable for the past couple years, this maintenance nightmare I am in now, far far outweighs the reliability factor.

Forced upgrade, broken website, broken backups, no communication - I want a refund.


Alon Swartz's picture

It's technically impossible for the TurnKey Hub to perform actions on your servers as it does not have access. Once the Hub deploys a server, the only actions it can perform (which is directed by you) is to stop / start / destroy the instance via Amazon's API. The Hub cannot access the servers filesystem nor execute commands.

Regarding updates. All TurnKey appliances are configured to automatically install security updates (not regular updates) directly from the Debian package archive daily. Debian carefully backport security updates so as not to result in breakage. We haven't had any issues with the auto-sec-updates since moving to Debian.

Does anyone else have access to the server who might have performed a dist-upgrade?

It sounds like you now have shell access to the server, correct? If this is the case, it should be possible to see what happened, why, and most likely fix the issues.

As for the backups, maybe you attempting to restore a backup that was performed after the breakage? You could try restoring a previous backup to get back up and running, and then look into the issue further.

If there is anything I can do to help, feel free to ask.

Thanks Alon, my concern was that things had changed and they should not have. Restoration from what should be a working backup has failed.

I will retry with an older backup tomorrow as it is v v late here now, after 4am on a work day (but this has kept me up a while now)...

Re: other ppl doing upgrades - no one else (is supposed to have) access to this machine except for me. I try to keep it reasonably up to date, min joomla functions, joomla admin a/c renamed, .htaccess to admin pages, etc, etc.

I created a new instance and restored the backup there in the vain hope that it was the instance itself that had become corrupted. Now I cannot even ssh to the newly restored instance, though it is responding to pings - weird. (I checked the firewall rules, yes ssh, web and ping are allowed).

It's console output looks as though it is hung, asking for a new root password - I provided that during the instance creation so I'm completely at a loss now? I cannot regain control of this instance at all by the looks of it, since I am unable to login via ssh or web shell or gain control through the web interface and so there is no way to restore another backup to it. As above I will try an earlier backup later today after some Zzzz.

Thanks again Alon for your offer of assistance and clear thinking - my fears bubbled out and overflowed too quickly...


Update: Launched another new instance and tried restoring from January this year. I *know* the server was operational in January through to at least early June, so that backup must work or there is something very wrong with the system.

Same issue as above:

I set the root password during the instance creation.

After the machine is up, I can ssh in with that password no problem.

Once in, I restore the backup. I reset the root password after the restore.

At this point I can access the (default) joomla page, webmin and web shell (and phpmyadmin) and ssh.

I reboot the server to restart all services cleanly.

After reboot, there is no access to the server whatsoever (connection refused to ssh). Console output indicates (again) that the password needs to set for the root account - see log below.

As I had numerous punctuation characters in the password, including forward and backward slashes, pipe characters, etc it occurred to me that the password might get munged by some script that bombs out on those punctuation chars - so I removed all these and reverted to a shorter password containing only underscores - same result.

Console log after the reboot following the restore (yes, I reset the root passwd immediately prior to the reboot. Clearly it is asking at the end for a new root password which is preventing the machine from starting at all... but the errors in the register_finalize seem to be the culprit:

...
writing new private key to '.tmpkey.pem'
-----
writing RSA key
Traceback (most recent call last):
  File "/usr/bin/hubclient-register-finalize", line 42, in <module>
    main()
  File "/usr/bin/hubclient-register-finalize", line 34, in main
    subkey, secret = hubapi.Server().register_finalize(conf.serverid)
  File "/usr/lib/hubclient/hubapi.py", line 44, in register_finalize
    response = self.api.request('POST', url, attrs)
  File "/usr/lib/python2.6/dist-packages/pycurl_wrapper.py", line 169, in request
    raise self.Error(response.code, name, description)
pycurl_wrapper.Error: 401 - HubServer.Finalized (Hub Server already finalized registration)
Traceback (most recent call last):
  File "/usr/lib/inithooks/firstboot.d/25ec2-userdata", line 36, in <module>
    main()
  File "/usr/lib/inithooks/firstboot.d/25ec2-userdata", line 32, in main
    executil.system(fh.path)
  File "/usr/lib/python2.6/dist-packages/executil.py", line 56, in system
    raise ExecError(command, exitcode)
executil.ExecError: non-zero exitcode (1) for command: /tmp/ec2userdataIXDPFe
[1;24r[0;10m[4l[?7h[?1000h[0;10m[H[J[18d[0;10;1m[36m[44m[J[H TurnKey Linux - First boot configuration[K
 [0;10;1;11m[36m[44mÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;1m[36m[44m[K
[K
[K
[K
[K[7;9H[1K [0;10;1;11m[37m[47mÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;1m[34m[47mRoot Password[0;10;1;11m[37m[47mÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;11m[30m[47m¿[0;10;1m[36m[44m[K[8;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m[58X[8;69H[0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[9;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m Please enter new password for the root account.          [0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[10;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m [0;10;11m[30m[47mÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;1;11m[37m[47m¿[0;10m[30m[47m [0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[11;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m [0;10;11m[30m[47m³[0;10;1m[37m[47m[54X[11;67H[0;10;1;11m[37m[47m³[0;10m[30m[47m [0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[12;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m [0;10;11m[30m[47mÀ[0;10;1;11m[37m[47mÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ[0;10m[30m[47m [0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[13;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m[58X[13;69H[0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[14;9H[1K [0;10;1;11m[37m[47mÃÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;11m[30m[47m´[0;10;1m[30m[40m  [36m[44m[K[15;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m[25X[15;36H[0;10;1m[37m[44m<[44m  [37m[44mO[44mK  [37m[44m>[0;10m[30m[47m[25X[15;69H[0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[16;9H[1K [0;10;1;11m[37m[47mÀ[0;10;11m[30m[47mÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ[0;10;1m[30m[40m  [36m[44m[K[17;11H[1K [30m[40m[60X[17;72H[36m[44m[K[11;13H[0;10m[30m[47m[54X[0;10m
 

I found your comments Alon regarding the DNS issues at the link below, so I will try those suggestions to see if it allieviates my pain:

https://github.com/turnkeylinux/tracker/issues/46#issuecomment-20793652

[Edit: Correction to description of log content - not full log, only tail]


Jeremy Davis's picture

But it seems that there is something in the currrent TKL instance that is not compatible with the data that is contained within your backup... It seems incredibly strange to me that this has occurred prior to a backup restore though...

I understand how the backup data could theoretically cause this to happen (especially if you used TKLBAM to migrate data froma particularly old (TKL version) instance to a brand new one. Hence the fact that having a backup is never enough (to make sure your backups are good, you need to regularly test them). But I still don't understand how this could have happened without manual intervention (i.e. your running server developing this issue all on it's own and your restored backups from ages ago also causing the same issue...)

To assist with troubleshooting perhaps it's worth spinning up a local instance of the relevant TKL appliance (like on VirtualBox or similar) and restoring your data there and see how that goes.

Andrew's picture

It was sysvinit package that I installed on the original 11.1 server about 3 years ago that ultimately was the root cause... Installing it resulted in the removal of the upstart package, which caused all of my backups to become non-restorable (in the sense that restoring them did not result in a running system)

[As for why I installed that non-standard package: it was late one night, I was in a hurry to get the site up, I've used PC unix since 40 floppy download of 386BSD in 1992, so of course why would I switch to a differnt set of startup utilities in 2010 after 18 years of familiarity with the sysv method... Besides, after the first, second and third upgrades all succeeded why would I expect the next upgrade to fail.]

I have rebuilt my site manually and it is up already.

Rebuild steps:
1) Download TKL Joomla iso, install to local vmware.
2) Boot it up, restore over the top from my online backup of 30 June 2013.
3) Reboot following restore, answer the first run prompts in the console window.
4) Manually diagnose and repair (aka apt-get install upstart, answer "Yes, do as I say!" to warning.)
5) Fix all the broken things that resulted. (NB:It worked perfectly for the past 3 years with sysvinit installed, survived upgrades to 11.2, 11.3 12.0, but died at 12.1...)
6) Backup new local system using tklbam-backup.
7) Launch new hub instance using the backup from step 6.
8) Correct a few more minor issues - as I rushed the first run prompts in step 3, I set diff passwords than were in the backup and hence the mysql restore failed. Used AkeebaBackup to copy entire Joomla instance and DB from local server back to the new hub instance.

Et Voila! Site is back up.

So, it was clearly the result of something I did about 3 years ago (replaced upstart with sysvinit) that resulted in the server being down.  It just happened to require the upgrade from 12.0 to 12.1 to break it.

This should serve as a warning to others - modifying your TKL system may well result in your complete inability to restore from *any* backup, ***irrespective of whether you tested that backup***.

The last part of that sentence needs to be read several times to make sure you fully understand the implications. ie: it doesn't matter if you have tested your backups or not, they can still *not* be fully recoverable after an automatic upgrade. The only solution in such an instance is manual rebuild on a local machine.

As Jeremy says above, tested backups are good things to have. Not too sure how I would have been able to test this without installing 12.1 myself locally and doing a retore onto that, prior to the auto upgrade of the hub though?

Andrew's picture

Installing it resulted in the removal of the upstart package, which after the auto upgrade to 12.1 caused all of my backups to become non-restorable (in the sense that restoring them did not result in a running system).

Jeremy Davis's picture

You can restore just parts of TKLBAM backups. E.g. you can choose to just restore files and DBs (and exclude packages) with the '--skip-packages' switch; which it sounds would have probably got you past your issues. FYI the tklbam-restore man page is in the docs, here.

Also for testing your backups you could always launch a fresh (EC2) instance and restore your backup to that. As long as you leave your 'proper' server running most (if not all) of it should work I would think.. Although perhaps you may need to temporarily add an entry to your local hosts file - so the domain name resolves to the test server rather than the proper one. Don't forget to remove it when you've finished testing though! It could cause some unnecessary panic if you forget! :)

Add new comment