TurnKey Linux Virtual Appliance Library

Exploring S3 based filesystems S3FS and S3Backer

In the last couple of days I've been researching Amazon S3 based filesystems, to figure out if maybe we could integrate that into an easy to use backup solution for TurnKey Linux appliances.

Note that S3 could only be a part of the solution. It wouldn't be a good idea to rely exclusively on S3 based automatic backups because of the problematic security architecture it creates. If an attacker compromises your server, he can easily compromise and subvert or destroy any S3 based automatic backups. That's bad news.

S3 performance, limitations and costs

S3 performance

  • S3 itself is faster than I realized. I've been fully saturating our server's network connection and uploading/downloading objects to S3 at 10MBytes/s.

  • Each S3 transaction comes with fixed overhead of 200ms for writes and about 350ms for reads.

    This means you can only access about 3 objects a second sequentially, which will of course massively impact your data throughput. (e.g., if you read many 1 bytes objects sequentially you'll get 3 bytes a second)

S3 performance variability

S3 is usually very fast, but it's based on a complex distributed storage network behind the scenes that is known to vary in its behavior and performance characteristics.

Use it long enough and you will come across requests that take 10 seconds to complete instead of 300ms. Point is, you can't rely on the average behavior ALWAYS happening.

S3 limitations

  • Objects can contain a maximum of 5GB.
  • You can't update part of an object. If you want to update 1 byte in a 1GB object you'll have to reupload the entire GB.

S3 costs

Three components:

  1. Storage: $0.15 GB/month
  2. Data transfer: $0.1 GB in, $0.17 GB out
  3. Requests: $0.01 per 1000 PUT requests, $0.01 per 10,000 GET and other requests

A word of caution, some people using S3 based filesystems have made the mistake of focusing on just the storage costs and forgotten about other expenses, especially requests, which look so inexpensive.

You need to watch out for that because using one filesystem under the default configuration (4KB blocks), storing 50GB of data cost $130 just in PUT request fees, more than 17X the storage fees!

Filesystems

s3fs

Which s3fs?

It's a bit confusing but there are two working projects competing for the name "s3fs", both based on FUSE.

One is implemented in C++, last release Aug 2008:

http://code.google.com/p/s3fs/wiki/FuseOverAmazon

Another implemented in Python, last release May 2008:

https://fedorahosted.org/s3fs/

I've only tried the C++ project, which is better known and more widely used (e.g., the Python project comes with warnings regarding data loss) so when I say s3fs I mean the C++ project on Google Code.

Description

s3fs is a direct mapping of S3 to a filesystem paradigm. Files are mapped to objects. Filesystem metadata (e.g., ownership and file modes) are stored inside the object's meta data. Filenames are keys, with "/" as the delimiter to make listing more efficient, etc.

That's significant because it means there is nothing terribly magical about a bucket being read/written to by s3fs, and in fact you can mount any bucket with s3fs to explore it as a filesystem.

s3fs's main advantage is its simplicity. There are however a few gotchas:

  • If you're using s3fs to access a bucket it didn't create and have objects in it that have directory-like components in their names (e.g., mypath/myfile), you'll need to create a dummy directory in order to see them (e.g., mkdir mypath).

  • The project seems to be "regretware". The last open source release was in August 2008. Since then the author seems to have continued all development of new features (e.g., encryption, compression, multi-user access) as a commercial license (subcloud), and with that inherent conflict of interest the future of the GPLed licensed open source version is uncertain.

    In fact a few of the unresolved bugs (e.g., deep directory renames) in the open source version have been long fixed in the proprietary version.

  • No embedded documentation. Probably another side-effect of the proprietary version, though the available options are documented no the web site.

  • Inherits S3's limitations: no file can be over 5GB, and you can't partially update a file so changing a single byte will re-upload the entire file.

  • Inherits S3's performance characteristics: operation on many small files are very efficient (each is a separate S3 object after all)

  • Though S3 supports partial/chunked downloads, s3fs doesn't take advantage of this so if you want to read just one byte of a 1GB file, you'll have to download the entire GB.

    OTOH, s3fs supports a disk cache, which can be used to mitigate this limitation.

  • Watch out, the ACL for objects/files you update/write to will be reset to s3fs's global ACL (e.g., by default "private"). So if you rely on a richer ACL configuration for objects in your bucket you'll want to access your S3FS bucket in read-only mode.

  • By default, s3fs doesn't use SSL, but you can get that to work by using the -o url option to specify https://s3.amazonaws.com/ instead of http://s3.amazonaws.com/

    It's not documented very well of course. Cough. Proprietary version. Cough.

S3Backer

S3Backer is a true open source project under active development, which has a very clever design.

Also based on FUSE but instead of implementing usable filesystem on top of S3 it implements a virtual loopback device on top of S3:

mountpoint/
    file       # (e.g., can be used as a virtual loopback)
    stats      # human readable statistics

Except for this simple virtual filesystem S3Backer doesn't know anything about filesystems itself. It just maps that one virtual file to a series of dynamically allocated blocks inside S3.

The advantage to this approach is that it allows you to leverage well tested code built into the kernel to take care of the higher level business of storing files. For all intents and purposes, it's just a special block device (e.g., use any filesystem, LVM, software Raid, kernel encryption, etc.).

In practice it seems to work extremely well, thanks to a few clever performance optimizations. For example:

  • In-memory block cache: so rereads of the same unchanged block don't have to go across the network if it's still cacehd.
  • Delayed, multi-threaded write queue: dirty blocks aren't written immediately to S3 because that can be very inefficient (e.g., the smallest operation would update the entire block). Instead changes seem to be accumulated for a couple of seconds and then written out in parallel to S3.
  • Read-ahead algorithm: will detect and try to predict sequential block reads so the data is available in your cache before you actually ask for it.

Bottom line, under the right configuration S3Backer works well enough that it can easily saturate our Linode's 100Mbit network connection. Impressive.

In my testing with Reiserfs I found the performance good enough so that it would be conceivable to use it as an extension of working storage (e.g., as an EBS alternative for a system outside of EC2).

There are a few gotchas however:

  • High risk for data corruption, due to the delayed writes (e.g., your system or the connection to AWS fails). Journaling doesn't help because as far as the filesystem is concerned the blocks have already been written (I.e., to S3Backer's cache).

    In other words, I wouldn't use this as a back up drive. You can reduce the risk by turning off some of the performance optimizations to minimize the amount of data in limbo for write to Amazon.

  • too small block sizes (e.g., the 4K default) can add significant extra costs (e.g., $130 for 50GB with 4K blocks worth of storage)

  • too large block sizes can add significant data transfer and storage fees.

  • memory usage can be prohibitive: by default it caches 1000 blocks. With the default 4K block size that's not an issue but most users will probably want to increase block size.

    So watch out 1000 x 256KB block = 256MB.

    You can adjust the amount of cached blocks to control memory usage.

Future versions of S3Backer will probably include disk based caching which will mitigate data corruption and memory usage issues.

Tips:

  • Use Reiserfs. It can store multiple small files in the same blocks.

    It also doesn't populate the filesystem with too many empty blocks when the filesystem is created which makes filesystem creation faster and more efficient.

    Also, with Reiserfs I tested expanding and shrinking off the filesystem (after increasing/decreasing the size of the virtual block device) and it seemed to work just fine.

  • Supports storing multiple block devices in the same bucket using prefixes.

Conclusions

  • s3fs: safe, efficient storage of medium-large files. Perfect for backup / archiving purposes.
  • S3Backer: high performing live storage on top of S3. EBS alternative outside of EC2. Not safe for backups at this stage.
You can get future posts delivered by email or good old-fashioned RSS.
TurnKey also has a presence on Google+, Twitter and Facebook.

Comments

It seems to me that S3 is a

It seems to me that S3 is a great option if you are dealing with small files sizes.  The real hassle has to do with users changing the block size to a larger number ... if that's all it takes, I'm in!  I look forward to the future versions of S3Backer that include disk based caching.

cannot view directories already in the S3 bucket

Hi,

You its really a good article explaining about the s3fs and comparing with the s3backer. Here i have got a task to complete . I have a S3 bucket , got to sync all the datas from the bucket to an EBS but when i mount the s3 to the linux instance i am not able to view the datas present in the S3 bucket. I wonder why the contents in the s3 is available when we check using the s3fox but shows empty when we try ls command in the mount point. Can you please help me to solve this issue.?

there has to be 1 file per

there has to be 1 file per directory if you want s3fs to understand your folder hierarchy. for instance:

consider key: bucketA/folder/x

You have to create an object called bucketA/folder

and the object

bucketA/folder/x

S3 itself doesn't understand the concept of folders, its just keys and storage. Hierarchical layout is just a guise on top of lexicographical object listing.

Folders

If you use PyFileSystem http://pythonhosted.org/fs/s3fs.html you can use folders with the prefix attribute: prefix='/user/img'

High performance?

Greetings.

After reading your I have been trying out s3backer on a CentOS machine - with abysmal results.

The  results of a test (dd'ing files to an xfs filesystem on top of s3backer, 128kB blocks) are appended at the end. Basically write speed dwindles to a trickle as the test progresses, and the log fills with "PUT timeouts" and "rec'd 500" errors. An earlier attempt with rsync failed in a similar way. I was wondering how could our experiences diverge so dramatically...

Any ideas?

Thanks & cheers.

10485760 bytes (10 MB) copied, 0.377319 seconds, 27.8 MB/s
20971520 bytes (21 MB) copied, 0.43774 seconds, 47.9 MB/s
31457280 bytes (31 MB) copied, 0.873447 seconds, 36.0 MB/s
52428800 bytes (52 MB) copied, 0.516218 seconds, 102 MB/s
104857600 bytes (105 MB) copied, 0.461421 seconds, 227 MB/s
10485760 bytes (10 MB) copied, 6.07941 seconds, 1.7 MB/s
20971520 bytes (21 MB) copied, 7.50843 seconds, 2.8 MB/s
31457280 bytes (31 MB) copied, 10.6714 seconds, 2.9 MB/s
52428800 bytes (52 MB) copied, 20.5768 seconds, 2.5 MB/s
104857600 bytes (105 MB) copied, 151.619 seconds, 692 kB/s
10485760 bytes (10 MB) copied, 46.6603 seconds, 225 kB/s
20971520 bytes (21 MB) copied, 36.3313 seconds, 577 kB/s
31457280 bytes (31 MB) copied, 91.5108 seconds, 344 kB/s
52428800 bytes (52 MB) copied, 156.001 seconds, 336 kB/s
104857600 bytes (105 MB) copied, 362.794 seconds, 289 kB/s
10485760 bytes (10 MB) copied, 45.0143 seconds, 233 kB/s
20971520 bytes (21 MB) copied, 82.4619 seconds, 254 kB/s
31457280 bytes (31 MB) copied, 110.109 seconds, 286 kB/s
52428800 bytes (52 MB) copied, 178.252 seconds, 294 kB/s
 

WHAT YEAR??

What YEAR was this post written? It is next to meaningless if it was written 4 years ago and all these projects or whatever have moved on since then.

 

thanks

Agreed, how old is this?

I wasjust thinking the same thing.  This has to be multiple years based on the order of the dates.

December, January, October, August....

The site theme needs to be updated. The date is worthless without the year.

Posted by Liraz Siri on 7 Apr 2010 - 16:49

If you use the search form you'll see when they wrote the article: Posted by Liraz Siri on 7 Apr 2010 - 16:49

S3FS Memory leaks

s3fs has a big problem, memory leaks! If you use it in the quite loaded production env you will notice that after some time it use too much memory and it grows and grows.

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account, used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <p> <span> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <strike> <caption>

More information about formatting options

Leave this field empty. It's part of a security mechanism.
(Dear spammers: moderators are notified of all new posts. Spam is deleted immediately)