Hi,

Google noticed that my server stopped responding to robots.txt in the past 24-48 hours.

No response on either web console (ports 12320 and 12321 are not responding).

No web server is running at all!  SSH is running but is useless as no console access is available.

Looking at THub Console I see this at the bottom:

Begin: Running /scripts/init-bottom ...
Done.
 * Starting AppArmor profiles        [ OK ]
 * Starting Initialization hooks        [9;0]The disk drive for /dev/sda3 is not ready yet or not present
Continue to wait; or Press S to skip mounting or M for manual recovery

I tried rolling back to the TKLBAM backup from 7th July, but the same error appears after forcing the reboot.

Hmm - what to do? Open to any suggestions... Quite urgent as my company website is down and restoring from backup appears to not be an option.

Second question: Why does restoring from TKLBAM fail? It should *never* fail IMHO - that is it's purpose, to recover from failures...

3rd question: How does one access the console? Showing me the output and allowing input are very different things.

Many thanks in advance.

Forum: 

Following on from above: delving deeper it appears that the server was force dist upgraded on Monday. This occurred without my knowledge or consent.

Apache and MySQL were both upgraded and now the entire Joomla site is offline - thanks (NOT).

If not resolved in 12 hours I will have to cancel this Turnkey/Amazon subscription and move to Rackspace - my server there is always restorable and is *never* fiddled about without my knowlwedge or consent...


I think this is the final nail in the coffin for me - looking back at the logs, the backups ceased on Sun 14th.

I want to know:

- who authorised this change to turn off the backup?

- why it was done?

- when you were going to bother letting me (the owner) know? I sincerely hope that you weren't going to wait until someone (eg: one of my customers) pointed out to me that my company server is offline - thanks to Google for advising me, no thanks to TK...

Reliability is one thing, maintainability is another. Whilst TK has been emminently reliable for the past couple years, this maintenance nightmare I am in now, far far outweighs the reliability factor.

Forced upgrade, broken website, broken backups, no communication - I want a refund.


Alon Swartz's picture

It's technically impossible for the TurnKey Hub to perform actions on your servers as it does not have access. Once the Hub deploys a server, the only actions it can perform (which is directed by you) is to stop / start / destroy the instance via Amazon's API. The Hub cannot access the servers filesystem nor execute commands.

Regarding updates. All TurnKey appliances are configured to automatically install security updates (not regular updates) directly from the Debian package archive daily. Debian carefully backport security updates so as not to result in breakage. We haven't had any issues with the auto-sec-updates since moving to Debian.

Does anyone else have access to the server who might have performed a dist-upgrade?

It sounds like you now have shell access to the server, correct? If this is the case, it should be possible to see what happened, why, and most likely fix the issues.

As for the backups, maybe you attempting to restore a backup that was performed after the breakage? You could try restoring a previous backup to get back up and running, and then look into the issue further.

If there is anything I can do to help, feel free to ask.

Thanks Alon, my concern was that things had changed and they should not have. Restoration from what should be a working backup has failed.

I will retry with an older backup tomorrow as it is v v late here now, after 4am on a work day (but this has kept me up a while now)...

Re: other ppl doing upgrades - no one else (is supposed to have) access to this machine except for me. I try to keep it reasonably up to date, min joomla functions, joomla admin a/c renamed, .htaccess to admin pages, etc, etc.

I created a new instance and restored the backup there in the vain hope that it was the instance itself that had become corrupted. Now I cannot even ssh to the newly restored instance, though it is responding to pings - weird. (I checked the firewall rules, yes ssh, web and ping are allowed).

It's console output looks as though it is hung, asking for a new root password - I provided that during the instance creation so I'm completely at a loss now? I cannot regain control of this instance at all by the looks of it, since I am unable to login via ssh or web shell or gain control through the web interface and so there is no way to restore another backup to it. As above I will try an earlier backup later today after some Zzzz.

Thanks again Alon for your offer of assistance and clear thinking - my fears bubbled out and overflowed too quickly...


Update: Launched another new instance and tried restoring from January this year. I *know* the server was operational in January through to at least early June, so that backup must work or there is something very wrong with the system.

Same issue as above:

I set the root password during the instance creation.

After the machine is up, I can ssh in with that password no problem.

Once in, I restore the backup. I reset the root password after the restore.

At this point I can access the (default) joomla page, webmin and web shell (and phpmyadmin) and ssh.

I reboot the server to restart all services cleanly.

After reboot, there is no access to the server whatsoever (connection refused to ssh). Console output indicates (again) that the password needs to set for the root account - see log below.

As I had numerous punctuation characters in the password, including forward and backward slashes, pipe characters, etc it occurred to me that the password might get munged by some script that bombs out on those punctuation chars - so I removed all these and reverted to a shorter password containing only underscores - same result.

Console log after the reboot following the restore (yes, I reset the root passwd immediately prior to the reboot. Clearly it is asking at the end for a new root password which is preventing the machine from starting at all... but the errors in the register_finalize seem to be the culprit:

...
writing new private key to '.tmpkey.pem'
-----
writing RSA key
Traceback (most recent call last):
  File "/usr/bin/hubclient-register-finalize", line 42, in <module>
    main()
  File "/usr/bin/hubclient-register-finalize", line 34, in main
    subkey, secret = hubapi.Server().register_finalize(conf.serverid)
  File "/usr/lib/hubclient/hubapi.py", line 44, in register_finalize
    response = self.api.request('POST', url, attrs)
  File "/usr/lib/python2.6/dist-packages/pycurl_wrapper.py", line 169, in request
    raise self.Error(response.code, name, description)
pycurl_wrapper.Error: 401 - HubServer.Finalized (Hub Server already finalized registration)
Traceback (most recent call last):
  File "/usr/lib/inithooks/firstboot.d/25ec2-userdata", line 36, in <module>
    main()
  File "/usr/lib/inithooks/firstboot.d/25ec2-userdata", line 32, in main
    executil.system(fh.path)
  File "/usr/lib/python2.6/dist-packages/executil.py", line 56, in system
    raise ExecError(command, exitcode)
executil.ExecError: non-zero exitcode (1) for command: /tmp/ec2userdataIXDPFe
[1;24r[0;10m[4l[?7h[?1000h[0;10m[H[J[18d[0;10;1m[36m[44m[J[H TurnKey Linux - First boot configuration[K
 [0;10;1;11m[36m[44mÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;1m[36m[44m[K
[K
[K
[K
[K[7;9H[1K [0;10;1;11m[37m[47mÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;1m[34m[47mRoot Password[0;10;1;11m[37m[47mÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;11m[30m[47m¿[0;10;1m[36m[44m[K[8;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m[58X[8;69H[0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[9;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m Please enter new password for the root account.          [0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[10;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m [0;10;11m[30m[47mÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;1;11m[37m[47m¿[0;10m[30m[47m [0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[11;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m [0;10;11m[30m[47m³[0;10;1m[37m[47m[54X[11;67H[0;10;1;11m[37m[47m³[0;10m[30m[47m [0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[12;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m [0;10;11m[30m[47mÀ[0;10;1;11m[37m[47mÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ[0;10m[30m[47m [0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[13;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m[58X[13;69H[0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[14;9H[1K [0;10;1;11m[37m[47mÃÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ[0;10;11m[30m[47m´[0;10;1m[30m[40m  [36m[44m[K[15;9H[1K [0;10;1;11m[37m[47m³[0;10m[30m[47m[25X[15;36H[0;10;1m[37m[44m<[44m  [37m[44mO[44mK  [37m[44m>[0;10m[30m[47m[25X[15;69H[0;10;11m[30m[47m³[0;10;1m[30m[40m  [36m[44m[K[16;9H[1K [0;10;1;11m[37m[47mÀ[0;10;11m[30m[47mÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ[0;10;1m[30m[40m  [36m[44m[K[17;11H[1K [30m[40m[60X[17;72H[36m[44m[K[11;13H[0;10m[30m[47m[54X[0;10m
 

I found your comments Alon regarding the DNS issues at the link below, so I will try those suggestions to see if it allieviates my pain:

https://github.com/turnkeylinux/tracker/issues/46#issuecomment-20793652

[Edit: Correction to description of log content - not full log, only tail]


Jeremy Davis's picture

But it seems that there is something in the currrent TKL instance that is not compatible with the data that is contained within your backup... It seems incredibly strange to me that this has occurred prior to a backup restore though...

I understand how the backup data could theoretically cause this to happen (especially if you used TKLBAM to migrate data froma particularly old (TKL version) instance to a brand new one. Hence the fact that having a backup is never enough (to make sure your backups are good, you need to regularly test them). But I still don't understand how this could have happened without manual intervention (i.e. your running server developing this issue all on it's own and your restored backups from ages ago also causing the same issue...)

To assist with troubleshooting perhaps it's worth spinning up a local instance of the relevant TKL appliance (like on VirtualBox or similar) and restoring your data there and see how that goes.

Jeremy Davis's picture

You can restore just parts of TKLBAM backups. E.g. you can choose to just restore files and DBs (and exclude packages) with the '--skip-packages' switch; which it sounds would have probably got you past your issues. FYI the tklbam-restore man page is in the docs, here.

Also for testing your backups you could always launch a fresh (EC2) instance and restore your backup to that. As long as you leave your 'proper' server running most (if not all) of it should work I would think.. Although perhaps you may need to temporarily add an entry to your local hosts file - so the domain name resolves to the test server rather than the proper one. Don't forget to remove it when you've finished testing though! It could cause some unnecessary panic if you forget! :)

Add new comment