Rick Kipfer's picture

This is something I didn't expect. One of the greatest things about Amazon is that you can start and stop instances at will. That's great, but what happens when you can't stop or kill a server. Can't connect, can't reboot, yet it is still in the background running and doing stuff. I can't unplug it, I can't press off. I can't smash it with a sledgehammer. 

This happened to me today. I thought TKL had ultimate low level control over server instances, and if push came to shove, we can pull the plug.  And with a script running on that server that is designed to perform real life connection tasks to services controlling phones, SMS, Emails, etc., it was truly like the computer Jushua in the 1983 movie, WAR GAMES, the computer that was asking Matthew Broderick, "DO YOU WANT TO PLAY A GAME?"

In the movie, Matthew Broderick got the idea to get Joshua to play chess with himself until it slowed the computer down enough that it stopped playing the GLOBAL THERMONUCLEAR WAR game (That was interfaced into the US Military nuclear arsenal.) But I couldn't even do that, because I couldn't get a Shell prompt, no webmin, only a runaway server responding to ping only, but still interfacing with the world through automation.

It was a harrowing experience and it finally did shut down, but it took about 30 minutes. The 5 reboots I tried in the hub interface (before I sent the STOP command) told me "the server is rebooting", but a continuous ping was showing that was not actually happening. While it was "stopping", it continued to respond to pings and make phone calls and send sms messages while I had time to build another server from an image, change the DNS entries, get the new server up and running.

I'm glad this happened, because with a server running a script that is connecting with real world devices like telephony and SMS, I realize I need to add another layer of control, maybe a relay on another server hosted by another company? so I have a way of stopping its activity if TKL's interface fails again.

Has anyone had a similar experience? Does anyone know what could possibly have happened to make a instance freeze like that to be unresponsive on the lowest level available to my control?

Rick

P.S. UPDATE: It's about 90 minutes after my first attempt at reboot and about an hour after I initiated the STOP. It is still 'stopping' in the TKL hub. I have a good level of confidence that it may be stopped because I'm not getting direct ping responses from the IP, but I have no way to verify for sure that it is actually stopped.

Forum: 
Jeremy Davis's picture

It sounds like something is seriously messed up on your server...

Have you been able to restart the server and see what may have been going on?

My suspicion is that something locked up somewhere along the line (bit of an Einstein aren't i!?)... If that is the case Shutdown and Reboot not responding is expected behaviour (although that's probably no comfort to you...) Both the Shutdown and Reboot commands cause a 'nice' stop i.e. all running programs/processes are 'gracefully' stopped. This means that the server OS politely asks them to stop. So if a process has gone rougue and is 'busy' then the server will wait until it says it's finished (to avoid corruption and/or data loss)... AFAIK the Stop and Destroy commands are somewhat akin to pulling the plug. IIRC correctly you can configure your server to force processes to close if they refuse to close gracefully within a given timeframe. That may be another useful thing to investigate and configure. Another thing that may be worth doing is to have an external log colletion appliance (so you can get some clues on what may have caused things like this in the future).

Also as far as I know the Hub is merely something of a wrapper for the AWS Control Panel (using the AWS API). So the Hub should have just as much control of your appliances as the AWS Control Panel.

Rick Kipfer's picture

Yes, Jeremy, you are a bit of an Einstein. But maybe only the part in which he couldn't control his own bladder and regularly wet himself? hehe. :o)

I eventually found this instance destroyed after about 3 hours after the initial 'DESTROY' request in the hub.  So, ultimately the thing finally died on its own. Everything has been running smoothly ever since with my new monster created from the last snapshot. So ultimately the experience ranged from "Oh my god, this is crazy, approaching the dark side" to "Oh my god, once we got over the hump, AWS snapshots made the full recovery back into action a breeze." So our confidence was severely crushed on one hand by the utter lack of control and then we were impressed with that feeling that comes with an instant restore. Pretty much a wash.

What we did end up deciding after this experience is that Amazon cannot be fully trusted. (of course they can't, they aren't any more perfect than we are)... We may, at some point down the road when there's more to lose, build a third party proxy/break-glass-in-case-of-emergency/throw-the-main-switch logic node that would reside on another server with another provider to act as a fault for us to be able to intervene in the logic of the script if something like this were ever to happen again. 

I still love TKL.

Chris Musty's picture

AWS reuse IP addresses so unless you have reserved it (Elastic IP) it can be reassigned and you are pinging something else. The hub interface can be buggy. I have had instances that are destroyed hang around etc but logging into AWS console shows otherwise. Why it fixed itself 3 hours later is beyond me. I tend not to shut down servers from the hub or AWS but rather kill them from a SSH session. When it tells me I have been disconnected I know I can destroy, remove whatever from the hub.

Chris Musty

Director

Specialised Technologies

Rick Kipfer's picture

Thanks Chris, I do love TKL but you're right, the hub can be kind of buggy from time to time.

I was under the impression I couldn't see my TKL instances in the AWS console, but after reading your post I went in and realized it just wasn't set the the right region. THANKS!!

This is good news for me as you can imagine. Of course I want the lowest level control possible for my instances, and if I had known I could see them in AWS I would have tried to kill the frankenstein there, or at lest verify it was actually still alive.

On another note, by stopping the instance in SSH, do you mean just using the shutdown command? Would that actuall "stop" the instance so that it would have to be 'started' again in AWS or TKLhub?

Thanks again

Chris Musty's picture

I know what you mean AWS console needs a little thought, it can get confusing.

What you do in AWS generally appears in the Hub so there a some things I do in either.

My city (Sydney) is not available in the hub yet so allot of my latency experiements are yet to include the hub.

Now that AWS have released AMI copy this should be a snap for the devs (hint hint)

As for using an SSH session, I use putty not the web ssh gizmo. My appliances have user/pass disabled so without a 4096 bit key you aint getting in. To shut the thing down I use

shutdown -h now

or if I just want to reboot

reboot

That way when the putty session gets killed I know for certain its off, anything the hub or AWS reports is not true and I can destroy from either.

Chris Musty

Director

Specialised Technologies

Add new comment