TurnKey Linux Virtual Appliance Library

Finding the closest data center using GeoIP and indexing

We are about to release the TurnKey Linux Backup and Migration (TKLBAM) mechanism, which boasts to be the simplest way, ever, to backup a TurnKey appliance across all deployments (VM, bare-metal, Amazon EC2, etc.), as well as provide the ability to restore a backup anywhere, essentially appliance migration or upgrade.

Note: We'll be posting more details really soon - In this post I just want to share an interesting issue we solved recently.

Backups need to be stored somewhere - preferably somewhere that provides unlimited, reliable, secure and inexpensive storage. After exploring the available options, we decided on Amazon S3 for TKLBAM's storage backend.
 

The problem

Amazon have 4 data centers called regions spanning the world, situated in North California (us-west-1), North Virginia (us-east-1), Ireland (eu-west-1) and Singapore (ap-southeast-1).
 
The problem: Which region should be used to store a servers backups, and how should it be determined?
 
One option was to require the user to specify the region to be used during backup, but, we quickly decided against polluting the user interface with options which can be confusing, and opted for a solution that could automatically determine the best region.
 

The solution

The below map plots the countries/states with their associated Amazon region:
 
Generated automatically using Google Maps API from the indexes.
 
The solution: Determine the location of the server, then lookup the closest Amazon region to the servers location.
 

Part 1: GeoIP

This was the easy part. The TurnKey Hub is developed using Django which ships with GeoIP support in contrib. Within a few minutes of being totally new to geo-location I had part 1 up and running.
 
When TKLBAM is initialized and a backup is initiated, the Hub is contacted to get authentication credentials and the S3 address for backup. The Hub performs a lookup on the IP address and enumerates the country/state.
 
In a nutshell, adding GeoIP support to your Django app is simple: Install Maxmind's C library and download the appropriate dataset. Then, once you update your settings.py file, you're all set.
 
settings.py

GEOIP_PATH = "/volatile/geoip"
GEOIP_LIBRARY_PATH = "/volatile/geoip/libGeoIP.so"

code

from django.contrib.gis.utils import GeoIP

ipaddress = request.META['REMOTE_ADDR']
g = GeoIP()
g.city(ipaddress)
    {'area_code': 609,
     'city': 'Absecon',
     'country_code': 'US',
     'country_code3': 'USA',
     'dma_code': 504,
     'latitude': -39.420898,
     'longitude': - 74.497703,
     'postal_code': '08201',
     'region': 'NJ'}
 

Part 2: Indexing

This part was a little more complicated.
 
Now that we have the servers location, we can lookup the closest region. The problem is creating an index of each and every country in the world, as well as each US state - and associating them with their closest Amazon region.
 
Creating the index could have been really pain staking, boring and error prone if doing it manually - so I devised a simple automated solution:
  • Generate a mapping of country and state codes with their coordinates (latitude and longitude).
  • Generate a reference map of the server farms coordinates.
  • Using a simple distance based calculation, determine the closest region to each country/state, and finally output the index files.
I was also planning on incorporating data about internet connection speeds and trunk lines between countries, and add weight to the associations, but decided that was overkill.
 
We are making the indexes available for public use (countries.index, unitedstates.index).
 
More importantly, we need your help to tweak the indexes - as you have better knowledge and experience on your connection latency and speed. Please let us know if you think we should associate your country/state to a different Amazon region.
 
[update] We updated the indexes to include the new AWS regions (Oregon, Sao Paulo, Tokyo), tweaked automatic association to use the haversine formula, and added overrides based on underwater internet cables. Lastly, we've open sourced the whole project on github (checkout the live map meshup).
You can get future posts delivered by email or good old-fashioned RSS.
TurnKey also has a presence on Google+, Twitter and Facebook.

Comments

Alon Swartz's picture

Map of worldwide underwater cables

I thought I'd post this as reference, seems to correlate quite nicely with the generated indexes.

Source: Greg's cable map

Jeremy's picture

That is seriously cool!

Interesting to have a look at where they all go. Thanks for posting Alon!

Jeremy's picture

Australia connecting to Asia's Amazon servers makes sense.

But does that mean that TKLBAM will only do remote backup/migration to Amazon EC2? Will there be some facility to specify your own remote location? Or even a local location (server or even HDD/etc?

I note that you mention more details to come, so pls feel free to ignore this and make me wait :)

Alon Swartz's picture

The default storage backend will be Amazon S3, but...

I won't ignore you completely, but I will make you wait for the full details. The default storage backend will be Amazon S3, as it provides unlimited, reliable, secure and inexpensive storage.

Did I mention inexpensive? With all our testing we have yet to break the $0.01 barrier. TKLBAM only stores the delta of changes, and performs incremental backups, so space isn't really an issue (at least for most use cases).

It will be possible to use alternative storage backends, but won't be officially supported. There are a couple of caveats though, so the user experience won't be as smooth as using the default backend, but it will be possible.

Detailed information will be posted soon, really soon...

Is GeoIP going to give you enough information?

This seems interesting but GeoIP data isn't always that acurate. It seems like a better way of going about this would be to measure actual latency and bandwidth from each location in the same country or a close country. I would imagine using ping and downloading a reasonably large file would do the trick.

Alon Swartz's picture

Clarification

I totally agree that basing solely off geographic location is not adequate, so let me clarify the implementation details.

In production, we use GeoIP to determine the country/state of the server in question, and then perform a lookup in the generated indexes to determine the preferred region. I say preferred (and not closest) because the indexes are static, and are not calculated on the fly. This was a design decision to allow us to tweak the indexes with the help of community feedback.

The location based calculation described above was used to generate the baseline indexes to provide a relatively good starting point. One of the reasons for writing this blog post and publishing the indexes, was to take us to the next phase of tweaking the indexes, hence the closing paragraph:

we need your help to tweak the indexes - as you have better knowledge and experience on your connection latency and speed. Please let us know if you think we should associate your country/state to a different Amazon region.

The countries and US indexes consist of 249 and 62 entries, respectively. We don't have the resources to perform latency testing in each and every location, for that we need your help.

locations and distances

I did some research that may be relevant to this discussion, measuring distances to cloud providers. We could apply the same approach to this, but i need a little time to figure this out.

Have a look at the 'cloud encounters' on slideshare.

http://www.slideshare.net/pveijk/cloud-encounters-sept-2009-for-cmg-dec-6

 

any ideas?

Alon Swartz's picture

Write up on ReadWriteWeb

This comment is long overdue, but better late than never. I'd like to thank ReadWriteWeb and Audrey Watters for their article entitled: A map to your nearest data center. Hopefully with enough eyeballs we'll be able to tweak the indexes to account for latency - not just location.

Fast connections are always better. Reminds me of a T-Shirt I saw a couple of years back saying "I'll work for bandwidth".

Alon Swartz's picture

Major update, open sourced project

We updated the indexes to include the new AWS regions (Oregon, Sao Paulo, Tokyo), tweaked automatic association to use the haversine formula, and added overrides based on underwater internet cables.

Lastly, we've open sourced the whole project on github (checkout the live map meshup).

Hi I just want the

Hi I just want the hostnames.. The map does nothing as so many ISPs have crazy routing

wrong location

Even after updating my IP-location at Maxmind (https://www.maxmind.com/en/correction), the backup is stored in Tokyo, which is 9300 miles away from the server. That can't be efficient....

Jeremy's picture

I guess... Depends where you are and what the pipes are like...

Prior to AWS to AWS opening a datacentre in Sydney, my closest one was Singapore (~6500km/~4000miles) but as it turns out due to latency on that pipe it was much better to use California (about twice as far as the crow flies... ~12800km/8000miles - let alone distance via the pipe!)

There was a pretty cool map linked to above by Alon. Have a look! :)

But who knows maybe you're right... Perhaps do some research and testing and see...

If you have an AWS datacentre closer (and better connection) than Tokyo, then perhaps restore your appliance to an AWS instance in that location, then run another TKLBAM backup and restore it back to you (assuming that you're running it locally). It's good to test your backups regularly anyway!

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account, used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <p> <span> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <strike> <caption>

More information about formatting options

Leave this field empty. It's part of a security mechanism.
(Dear spammers: moderators are notified of all new posts. Spam is deleted immediately)