Design overview

Background

Many batch tasks can be easily broken down into units of work that can be executed in parallel. On cloud services such as Amazon EC2, running a single server for 100 hours costs the same as running 100 servers for 1 hour. In other words, for problems that can be parallelized this makes it advantageous to distribute the batch on as many cloud servers as required to finish execution of the job in just under an hour. To take full advantage of these economics an easy to use, automatic system for launching and destroying server instances reliably on demand and distributing work amongst them is required.

Terms and definitions

  • Job: a shell command representing an atomic unit of work
  • Task: a sequence of jobs
  • Task template: a pre-configured task
  • Session: the state of a task run at a particular time. This includes the task configuration, the status of jobs that have finished executing, and a list of jobs still pending execution.
  • Split: the number of workers the task jobs of task are split amongst.
  • Worker: a server running SSH on which we execute jobs. This can be a persistent server or a dynamically allocated EC2 cloud server instance.
  • TurnKey Hub: a web service that cloudtask may use to launch and destroy TurnKey servers preconfigured to perform a given task.

Features

  • Jobs are just simple shell commands executed remotely: there is no special API. Shell commands are well understood, language agnostic and easy to test and develop.

  • Ad-hoc task configuration via command line options / environment: cloudtask can be used directly from the command line, which is useful for one-off tasks, or for experimenting/debugging a new routine task.

  • Pre-configured task templates: the configuration parameters for routine tasks can be embedded within a pre-configured task template, which is itself executable just like cloudtask, and inherits its interface.

    Under the hood a task template is implemented by defining a Python class that inherits Task:

#!/usr/bin/python 
from cloudtask import Task

class HelloWorld(Task):
    DESCRIPTION = "This is a hello world cloudtask template"
    COMMAND = 'echo hello world'
    SPLIT = 2
    REPORT = 'mail: cloudtask@example.com liraz@example.com'

HelloWorld.main()
  • Transparent execution with real-time logging: cloudtask provides real-time logging to make it easy for the user to following the progress of a task. For example, the progress of any command executed over SSH can be followed by tailing the worker's session log:

cd ~/.cloudtask/$session_id/workers/
tail -f 1234
  • Fault tolerance: CloudTask is designed to reliably survive multiple types of failure. For example:

    • worker servers are continually monitored for failure so that a job executing on a failed server may be rerouted to a working server. A task will continue executing so long as a single worker survives.
    • the user can specify a per-job timeout so that jobs that freeze up for whatever reason will time out gracefully without jamming up the worker indefinitely.
    • In case of Hub API failure cloudtask will wait a few seconds and try again.
  • Abort and resume capability: a task can be aborted at any time by pressing Ctrl-C, or sending the TERM signal to the main process. After all automatically launched server instances are destroyed, the state of the session is saved so that it may be resumed later from where it left off.

  • Reporting hook: when the execution of a session finishes a reporting hook may be configured to perform an arbitrary action (e.g., sending a notification e-mail, updating a database, etc.). Three types of reporting handlers are currently supported:

    1. mail: send out an e-mail with the session log to one or more recipients.
    2. sh: execute a shell command. The current working directory is set to the session path and the environment is populated with the session context.
    3. py: execute an arbitrary snippet of Python code. The session and task configuration are accessible as local variables.

How it works

When the user executes a task, the following steps are performed:

  1. A temporary SSH session key is created.

    The initial authentication to workers assumes you have set up an SSH agent or equivalent (cloudtask does not support password authentication).

    The temporary session key will be added to the worker's authorized keys for the duration of the task run, and then removed. We need to authorize a temporary session key to ensure access to the workers without relying on the SSH agent.

  2. Workers are allocated.

    Worker cloud servers are launched automatically by cloudtask to satisfy the requested split unless enough pre-allocated workers are provided via the --workers option.

    A TKLBAM backup id may be provided to install the required job execution dependencies (e.g., scripts, packages, etc.) on top of TurnKey Core.

  3. Worker setup.

    After workers are allocated they are set up. The temporary session key is added to the authorized keys, the overlay is applied to the root filesystem (if the user has configured an overlay) and the pre command is executed (if the user has configured a pre command).

  4. Job execution.

    CloudTask feeds a list of all jobs that make up the task into an job queue. Every remote worker has a local supervisor process which reads a job command from the queue and executes it over SSH on the worker.

    The job may time out before it has completed if a --timeout has been configured.

    While the job is executing, the supervising process will periodically check that the worker is still alive every 30 seconds if the job doesn't generate any console output. If a worker is no longer reachable, it is destroyed and the aborted job is put back into the job queue for execution by another worker.

  5. Worker cleanup

    When there are no job commands left in the input Queue to provide a worker it is cleaned up by running the post command, removing the temporary session key from the authorized keys.

    If cloudtask launched the worker, it will also destroy it at this point to halt incremental usage fees.

  6. Session reporting

    A reporting hook may be configured that performs an action once the session has finished executing. 3 types of reporting hooks are supported:

    1. mail: uses /usr/sbin/sendmail to send a simple unencrypted e-mail containing the session log in the body.

    2. sh: executes a shell command, with the task configuration embedded in the environment and the current working directory set to the session path. You can test the execution context like this:

      --report='sh: env && pwd'
    3. py: executes a Python code snippet with the session values set as local variables. You can test the execution context like this:

      --report='py: import pprint; pprint.pprint(locals())'