Background: how a backup key works
In TKLBAM the backup key is a secret encrypted with a passphrase which is uploaded to the Hub. Decrypting the backup key yields the secret which is passed on to duplicity (and eventually to GnuPG) to be used as the symmetric key with which backup volumes are encrypted on backup and decrypted on restore.
When you create a new backup, or change the passphrase on an existing backup, a new backup key is uploaded to the Hub where it is stored in the key field for that backup record.
When you restore, tklbam downloads the backup key from the Hub and decrypts it locally on the computer performing the restore. Note that the Hub only allows you to download the backup key for backup records to which you have access (e.g., you are the owner).
Only you can decrypt your passphrase protected backups
All of this matters because it means that as long as you use a passphrase to protect the key, even the Hub can't decrypt your backups, only you can - provided you remember the passphrase (or failing that, at least have the escrow key stored in a safe place).
In other words, the decryption of the backup key happens locally and at no point does the passphrase reach the Hub, so we can't decrypt your backup even if you asked us to. Neither can an attacker that has theoretically compromised the Hub, or a government agency that comes kicking down our door with a court warrant.
The problem with cryptographic passphrases
But wait. If an attacker has local access to the key, his ability to run dictionary attacks to find the key's passphrase is limited only by the computational resources he can throw at it.
Remember there's a critical difference between a passphrase used for authentication purposes (e.g., to an online service) and a passphrase used for cryptographic purposes.
By contrast, a passphrase used for authenticating to an online service doesn't need to be as strong as a passphrase that is used cryptographically because with an online service, even if no explicit countermeasures are used (e.g., IP blacklisting on too many failed attempts) there is still a network between the attacker and the service. The available bandwidth places a debilitating upper limit on how many passphrases can be tried per second. Also, in practice there are usually bottlenecks in other places which would slow down an online dictionary attack even further.
But a passphrase used for cryptographic purposes assumes the attacker has access to the ciphertext, and that's a whole different ball game.
To better understand what we're up against, here's the formula for calculating the size of the passphrase search space:
log(howmany_different_possible_values ** howmany_values) / log(2)
For example, consider a typical 6 letter password.
6 ASCII printable letters = maximum 42-bits of search space.
That's a maximum of 4 trillion possible combinations. Which sounds like a lot. But it really isn't, since:
You can probably squeeze out about 1 million local passphrase tests per second from a modern multi-core workstation, assuming a typical passphrase encryption method is used.
This is one of those problems that are trivial to parallelize.
If you rent just 100 computers (e.g., in the cloud) you could exhaustively search through 42-bits in about 5 days.
And remember, today the bad guys often have millions of computers at their disposal via botnets.
People are very bad at choosing truly random passwords. A clever attacker will try the low hanging fruit first, so they're likely to find out your passphrase much sooner than by brute forcing blindly through the full search space.
For example, say you know a 6 letter password is much too short for an encryption key and instead you're using a longer random combination of 10,000 common English words:
- 2 words = 18-bits worth of search space.
- 3 words = 27-bits worth of search space.
- 4 words = 36-bits worth of search space.
English words aren't very random so your "paranoid" 3 word, 17 letter passphrase may actually be easier to crack than a truly random combination of just 4 ASCII printable characters (28-bits).
For comparison, let's see what happens if you use 6 random individual characters.
If you just use random lowercase characters the search space is reduced to 27-bits which is 32,768 times easier to search through than the full 42-bit search space of 6-letter ASCII printable passwords.
If you just use random lowercase characters and numbers, the search space is 30-bits which is 4,096 times easier to search through.
If you just use random lowercase and uppercase characters and numbers, the search space is 35-bits which is 128 times easier to search through.
The good news is that each bit of search space doubles the expense for the attacker.
The bad news is that it takes a truly random combination of 11 uppercase, lowercase characters and numbers just to reach 64-bits worth of search space, and a 10M strong botnet could crack even that in an average of 10 days.
Bottom line: even your supposedly ultra-paranoid passphrase (e.g., r0m4n14nv4mp1r344rdv4rkn3st) of 4 random words from a dictionary of 150K words (in l33t speak) only has about 50-bits worth of entropy, despite being 27 characters long. A 10,000 botnet could crack that in about a day.
Countermeasures: increase computational cost
Though it's impossible to prevent these attacks entirely I've implemented a couple of countermeasures in the way TKLBAM generates passphrase protected keys:
1) The first trick: increase how computationally expensive it is to calculate the cipher key from the passphrase:
def _repeat(f, input, count): for x in xrange(count): input = f(input) return input def _cipher_key(passphrase, repeats): cipher_key = _repeat(lambda k: hashlib.sha256(k).digest(), passphrase, repeats)
The principle is that calculating a hash costs CPU time so by feeding the hash into itself enough times we can linearly increase how expensive it is to map the passphrase-space to the key-space.
For example, repeating the hash routine 100,000 times takes about a quarter second on one of the cores of my computer. If I use all 4 cores this limits me to generating 16 cipher keys per second. Down from 1.6 million cipher keys per second. So that's one way to dramatically reduce the practical feasibility of a dictionary or exhaustive brute force attack.
Note that an attacker can't circumvent this calculation by searching through the key-space directly because even after we increase the cost of generating the passphrase space a 100,000 times over, the cost of trying to bruteforce the 256-bit key-space directly is still countless trillions of times greater.
The weakness of this technique is that an attacker would have to pay the cost of mapping the passphrase-space (e.g., a dictionary) to the key-space only once when trying to crack multiple keys.
2) The second trick: increase how computationally expensive it is to decrypt the key packet by increasing the number of times we pass it through encryption:
def _cipher(cipher_key): return AES.new(cipher_key, AES.MODE_CBC) ciphertext = _repeat(lambda v: _cipher(cipher_key).encrypt(v), _pad(plaintext), cipher_repeats)
This part of the computational expense is key-specific so trading off memory to pre-calculate the mapped key-space won't help you with this step.
Embedding repeat parameters in the key packet
The current implementation hardwires 100,000 repeats of the hash, and another 100,000 repeats of the cipher.
This makes searching through the passphrase-space about 200,000 times more expensive. On my workstation it takes 0.5 seconds to encrypt or decrypt a key (per-core).
I'm not sure these are the ideal parameters but they are in the ball park of how much you can increase the computational expense before usability suffers.
That's not to say you couldn't go higher, but there's a practical upper boundary to that too. If you're willing to wait about a minute for key generation/decryption you could increase the computational expense about 100 times over and that would give you 100 times better protection or allow you to use a password that is 100 times weaker with the same protection.
Just in case, to allow the number of repeats to change or be user configurable in the future the key formatter routine embeds the repeat parameters into the unencrypted portion of the key packet. This allows the key parsing routine to extract these parameters from the key itself so it can just as easily parse a 0.5 second key (I.e., the current default) as a 5 second, or 50 second key.
Embedding a version id
Just to make sure the key format is future proof I'm also embedding a version id into it.
Embedding a version costs almost nothing (an extra byte) and makes it easier to support incompatible changes to the key format should the need arise (e.g., changing of cipher/hash, changing the format, etc.).
Worst case scenario, we increment the version and implement a new incompatible key format. Old clients won't be able to understand the new key format but will at least fail reliably, and new clients will be able to support both new and old key formats.