Finding the closest data center using GeoIP and indexing

We are about to release the TurnKey Linux Backup and Migration (TKLBAM) mechanism, which boasts to be the simplest way, ever, to backup a TurnKey appliance across all deployments (VM, bare-metal, Amazon EC2, etc.), as well as provide the ability to restore a backup anywhere, essentially appliance migration or upgrade.

Note: We'll be posting more details really soon - In this post I just want to share an interesting issue we solved recently.

Backups need to be stored somewhere - preferably somewhere that provides unlimited, reliable, secure and inexpensive storage. After exploring the available options, we decided on Amazon S3 for TKLBAM's storage backend.
 

The problem

Amazon have 4 data centers called regions spanning the world, situated in North California (us-west-1), North Virginia (us-east-1), Ireland (eu-west-1) and Singapore (ap-southeast-1).
 
The problem: Which region should be used to store a servers backups, and how should it be determined?
 
One option was to require the user to specify the region to be used during backup, but, we quickly decided against polluting the user interface with options which can be confusing, and opted for a solution that could automatically determine the best region.
 

The solution

The below map plots the countries/states with their associated Amazon region:
 
Generated automatically using Google Maps API from the indexes.
 
The solution: Determine the location of the server, then lookup the closest Amazon region to the servers location.
 

Part 1: GeoIP

This was the easy part. The TurnKey Hub is developed using Django which ships with GeoIP support in contrib. Within a few minutes of being totally new to geo-location I had part 1 up and running.
 
When TKLBAM is initialized and a backup is initiated, the Hub is contacted to get authentication credentials and the S3 address for backup. The Hub performs a lookup on the IP address and enumerates the country/state.
 
In a nutshell, adding GeoIP support to your Django app is simple: Install Maxmind's C library and download the appropriate dataset. Then, once you update your settings.py file, you're all set.
 
settings.py

GEOIP_PATH = "/volatile/geoip"
GEOIP_LIBRARY_PATH = "/volatile/geoip/libGeoIP.so"

code

from django.contrib.gis.utils import GeoIP

ipaddress = request.META['REMOTE_ADDR']
g = GeoIP()
g.city(ipaddress)
    {'area_code': 609,
     'city': 'Absecon',
     'country_code': 'US',
     'country_code3': 'USA',
     'dma_code': 504,
     'latitude': -39.420898,
     'longitude': - 74.497703,
     'postal_code': '08201',
     'region': 'NJ'}
 

Part 2: Indexing

This part was a little more complicated.
 
Now that we have the servers location, we can lookup the closest region. The problem is creating an index of each and every country in the world, as well as each US state - and associating them with their closest Amazon region.
 
Creating the index could have been really pain staking, boring and error prone if doing it manually - so I devised a simple automated solution:
  • Generate a mapping of country and state codes with their coordinates (latitude and longitude).
  • Generate a reference map of the server farms coordinates.
  • Using a simple distance based calculation, determine the closest region to each country/state, and finally output the index files.
I was also planning on incorporating data about internet connection speeds and trunk lines between countries, and add weight to the associations, but decided that was overkill.
 
We are making the indexes available for public use (countries.index, unitedstates.index).
 
More importantly, we need your help to tweak the indexes - as you have better knowledge and experience on your connection latency and speed. Please let us know if you think we should associate your country/state to a different Amazon region.
 
[update] We updated the indexes to include the new AWS regions (Oregon, Sao Paulo, Tokyo), tweaked automatic association to use the haversine formula, and added overrides based on underwater internet cables. Lastly, we've open sourced the whole project on github (checkout the live map meshup).

Comments

Alon Swartz's picture

I thought I'd post this as reference, seems to correlate quite nicely with the generated indexes.

Source: Greg's cable map

Jeremy Davis's picture

Interesting to have a look at where they all go. Thanks for posting Alon!

Jeremy Davis's picture

But does that mean that TKLBAM will only do remote backup/migration to Amazon EC2? Will there be some facility to specify your own remote location? Or even a local location (server or even HDD/etc?

I note that you mention more details to come, so pls feel free to ignore this and make me wait :)

Alon Swartz's picture

I won't ignore you completely, but I will make you wait for the full details. The default storage backend will be Amazon S3, as it provides unlimited, reliable, secure and inexpensive storage.

Did I mention inexpensive? With all our testing we have yet to break the $0.01 barrier. TKLBAM only stores the delta of changes, and performs incremental backups, so space isn't really an issue (at least for most use cases).

It will be possible to use alternative storage backends, but won't be officially supported. There are a couple of caveats though, so the user experience won't be as smooth as using the default backend, but it will be possible.

Detailed information will be posted soon, really soon...

Alon Swartz's picture

This comment is long overdue, but better late than never. I'd like to thank ReadWriteWeb and Audrey Watters for their article entitled: A map to your nearest data center. Hopefully with enough eyeballs we'll be able to tweak the indexes to account for latency - not just location.

Fast connections are always better. Reminds me of a T-Shirt I saw a couple of years back saying "I'll work for bandwidth".

Alon Swartz's picture

We updated the indexes to include the new AWS regions (Oregon, Sao Paulo, Tokyo), tweaked automatic association to use the haversine formula, and added overrides based on underwater internet cables.

Lastly, we've open sourced the whole project on github (checkout the live map meshup).

Jeremy Davis's picture

Prior to AWS to AWS opening a datacentre in Sydney, my closest one was Singapore (~6500km/~4000miles) but as it turns out due to latency on that pipe it was much better to use California (about twice as far as the crow flies... ~12800km/8000miles - let alone distance via the pipe!)

There was a pretty cool map linked to above by Alon. Have a look! :)

But who knows maybe you're right... Perhaps do some research and testing and see...

If you have an AWS datacentre closer (and better connection) than Tokyo, then perhaps restore your appliance to an AWS instance in that location, then run another TKLBAM backup and restore it back to you (assuming that you're running it locally). It's good to test your backups regularly anyway!

Pages

Add new comment