In the last couple of days I've been researching Amazon S3 based filesystems, to figure out if maybe we could integrate that into an easy to use backup solution for TurnKey Linux appliances.
Note that S3 could only be a part of the solution. It wouldn't be a good idea to rely exclusively on S3 based automatic backups because of the problematic security architecture it creates. If an attacker compromises your server, he can easily compromise and subvert or destroy any S3 based automatic backups. That's bad news.
S3 performance, limitations and costs
S3 itself is faster than I realized. I've been fully saturating our server's network connection and uploading/downloading objects to S3 at 10MBytes/s.
Each S3 transaction comes with fixed overhead of 200ms for writes and about 350ms for reads.
This means you can only access about 3 objects a second sequentially, which will of course massively impact your data throughput. (e.g., if you read many 1 bytes objects sequentially you'll get 3 bytes a second)
S3 performance variability
S3 is usually very fast, but it's based on a complex distributed storage network behind the scenes that is known to vary in its behavior and performance characteristics.
Use it long enough and you will come across requests that take 10 seconds to complete instead of 300ms. Point is, you can't rely on the average behavior ALWAYS happening.
- Objects can contain a maximum of 5GB.
- You can't update part of an object. If you want to update 1 byte in a 1GB object you'll have to reupload the entire GB.
Storage: $0.15 GB/month
Data transfer: $0.1 GB in, $0.17 GB out
Requests: $0.01 per 1000 PUT requests, $0.01 per 10,000 GET and other requests
A word of caution, some people using S3 based filesystems have made the mistake of focusing on just the storage costs and forgotten about other expenses, especially requests, which look so inexpensive.
You need to watch out for that because using one filesystem under the default configuration (4KB blocks), storing 50GB of data cost $130 just in PUT request fees, more than 17X the storage fees!
It's a bit confusing but there are two working projects competing for the name "s3fs", both based on FUSE.
One is implemented in C++, last release Aug 2008:
Another implemented in Python, last release May 2008:
I've only tried the C++ project, which is better known and more widely used (e.g., the Python project comes with warnings regarding data loss) so when I say s3fs I mean the C++ project on Google Code.
s3fs is a direct mapping of S3 to a filesystem paradigm. Files are mapped to objects. Filesystem metadata (e.g., ownership and file modes) are stored inside the object's meta data. Filenames are keys, with "/" as the delimiter to make listing more efficient, etc.
That's significant because it means there is nothing terribly magical about a bucket being read/written to by s3fs, and in fact you can mount any bucket with s3fs to explore it as a filesystem.
s3fs's main advantage is its simplicity. There are however a few gotchas:
If you're using s3fs to access a bucket it didn't create and have objects in it that have directory-like components in their names (e.g., mypath/myfile), you'll need to create a dummy directory in order to see them (e.g., mkdir mypath).
The project seems to be "regretware". The last open source release was in August 2008. Since then the author seems to have continued all development of new features (e.g., encryption, compression, multi-user access) as a commercial license (subcloud), and with that inherent conflict of interest the future of the GPLed licensed open source version is uncertain.
In fact a few of the unresolved bugs (e.g., deep directory renames) in the open source version have been long fixed in the proprietary version.
No embedded documentation. Probably another side-effect of the proprietary version, though the available options are documented no the web site.
Inherits S3's limitations: no file can be over 5GB, and you can't partially update a file so changing a single byte will re-upload the entire file.
Inherits S3's performance characteristics: operation on many small files are very efficient (each is a separate S3 object after all)
Though S3 supports partial/chunked downloads, s3fs doesn't take advantage of this so if you want to read just one byte of a 1GB file, you'll have to download the entire GB.
OTOH, s3fs supports a disk cache, which can be used to mitigate this limitation.
Watch out, the ACL for objects/files you update/write to will be reset to s3fs's global ACL (e.g., by default "private"). So if you rely on a richer ACL configuration for objects in your bucket you'll want to access your S3FS bucket in read-only mode.
It's not documented very well of course. Cough. Proprietary version. Cough.
S3Backer is a true open source project under active development, which has a very clever design.
Also based on FUSE but instead of implementing usable filesystem on top of S3 it implements a virtual loopback device on top of S3:
mountpoint/ file # (e.g., can be used as a virtual loopback) stats # human readable statistics
Except for this simple virtual filesystem S3Backer doesn't know anything about filesystems itself. It just maps that one virtual file to a series of dynamically allocated blocks inside S3.
The advantage to this approach is that it allows you to leverage well tested code built into the kernel to take care of the higher level business of storing files. For all intents and purposes, it's just a special block device (e.g., use any filesystem, LVM, software Raid, kernel encryption, etc.).
In practice it seems to work extremely well, thanks to a few clever performance optimizations. For example:
- In-memory block cache: so rereads of the same unchanged block don't have to go across the network if it's still cacehd.
- Delayed, multi-threaded write queue: dirty blocks aren't written immediately to S3 because that can be very inefficient (e.g., the smallest operation would update the entire block). Instead changes seem to be accumulated for a couple of seconds and then written out in parallel to S3.
- Read-ahead algorithm: will detect and try to predict sequential block reads so the data is available in your cache before you actually ask for it.
Bottom line, under the right configuration S3Backer works well enough that it can easily saturate our Linode's 100Mbit network connection. Impressive.
In my testing with Reiserfs I found the performance good enough so that it would be conceivable to use it as an extension of working storage (e.g., as an EBS alternative for a system outside of EC2).
There are a few gotchas however:
High risk for data corruption, due to the delayed writes (e.g., your system or the connection to AWS fails). Journaling doesn't help because as far as the filesystem is concerned the blocks have already been written (I.e., to S3Backer's cache).
In other words, I wouldn't use this as a back up drive. You can reduce the risk by turning off some of the performance optimizations to minimize the amount of data in limbo for write to Amazon.
too small block sizes (e.g., the 4K default) can add significant extra costs (e.g., $130 for 50GB with 4K blocks worth of storage)
too large block sizes can add significant data transfer and storage fees.
memory usage can be prohibitive: by default it caches 1000 blocks. With the default 4K block size that's not an issue but most users will probably want to increase block size.
So watch out 1000 x 256KB block = 256MB.
You can adjust the amount of cached blocks to control memory usage.
Future versions of S3Backer will probably include disk based caching which will mitigate data corruption and memory usage issues.
Use Reiserfs. It can store multiple small files in the same blocks.
It also doesn't populate the filesystem with too many empty blocks when the filesystem is created which makes filesystem creation faster and more efficient.
Also, with Reiserfs I tested expanding and shrinking off the filesystem (after increasing/decreasing the size of the virtual block device) and it seemed to work just fine.
Supports storing multiple block devices in the same bucket using prefixes.