BusyBeaver key derivation function

June 27, 2015:

BusyBeaver is now available in your browser!



I present BusyBeaver, a password-based key derivation function (PBKDF) which I believe to be original and new. BusyBeaver attempts to improve on lessons of the past, which I would summarize thusly:
  • Recklessly foolish systems store passwords in plaintext. Anyone who peeks at database has everyone's passwords.
  • Naive systems try to protect passwords by storing their one-way hashes. It turns out it's cheap and easy to mount dictionary attacks.
  • Less naive systems store salted one-way hashes of passwords. It turns out low-iteration hashes are easy to brute-force.
  • PKCS#5 specifies PBKDF2, a salted one-way hash, repeated many times. Along comes Bitcoin, proving just how efficiently standard hashes can be brute-forced with GPUs, FPGAs, and ASICs.
  • Colin Percival creates scrypt, a PBKDF that attempts to increase the cost of brute-forcing by being memory-hungry. Along comes Litecoin, showing how GPU, FPGA, and ASIC optimizations are still quite feasible, and their improvements still measure in orders of magnitude.
BusyBeaver attempts to improve on this by taking advantage of what CPUs do well - serial execution of arbitrary code, 64-bit instructions, cache prefetch - in order to thwart optimizations that are made easier by the order of operations being static. BusyBeaver does not try to be as memory hungry as scrypt; top-end GPUs that might be used in brute-forcing now include so much memory per core as to make topping this expensive for legitimate users.

BusyBeaver operates as follows:
  1. Takes as input a salt and a password.
  2. Uses provided inputs to generate two 64 kB blocks of data using SHA-512 and HMAC-SHA512.
  3. Interprets the two blocks as byte code for a virtual machine where all inputs are valid as instructions and arguments. To avoid side channel attacks based on cache timing, all memory read/write offsets are controlled by the first block, which is generated from salt information only. Instructions perform calculations and changes on the second block, using operations that preserve entropy (jumps, additions, rotations, substitutions, XOR, and multiplication/modulo).
  4. The executed algorithm is unpredictable, but deterministically dependent on input.
  5. The algorithm is exceedingly unlikely to run into infinite loops, so much so that if it does, it's not a problem. Any infinite loops are local to very rare combinations of salt and password, and the attacker's cost of detecting them is greater than the maximum possible benefit.
  6. Processing stops when the requested number of operations have been executed.
  7. The final digest is produced as HMAC-SHA512 dependent on salt, password, and the final processed 64kB block.
BusyBeaver's C++ source code can be downloaded here:

BusyBeaver-20150703.zip (31 kB)

It is platform-agnostic, includes a test program, and comes with a permissive, free software license.

My implementation is parameterized by the number of operations to be executed. I find that a good value is on the order of 50,000 - enough to ensure that execution is not dominated by SHA-512, and nearly all bytes of the second 64 kB block have been replaced with different values.

The full KDF, parameterized with 50,000 operations, executes in about 6 ms on my machine as 64-bit code (of which 20% is SHA-512), and 18 ms as 32-bit code (of which 25% is SHA-512).



June 25, 2015:

We have verified that, at least on AMD R9 290X, BusyBeaver appears to resist attempts at GPU optimization. A colleague implemented and tested BusyBeaver with OpenCL, attempting to see if the insights that allow an efficient GPU implementation of scrypt, as used in LiteCoin mining, would also work on BusyBeaver. In the time allotted to this task, he was unable to implement a GPU version of BusyBeaver that would perform substantially better - or even significantly better - than what is achievable on CPU.

The crucial requirement to efficiently implement an algorithm on the GPU is to make calculations fit within the 32 kB of local memory available to each Compute Unit. In the scrypt case, normally more than 32 kB of memory would be required. However, scrypt has weaknesses (or features, if that's what you prefer) that allow the memory requirement to be reduced in exchange for computing more information on the fly. The parameters for scrypt, as used in LiteCoin, further require a low memory footprint, and the combination makes it possible to run the brunt of the algorithm within 32 kB.

With BusyBeaver, such optimizations are more difficult. Whereas scrypt only reads from the large memory buffer, BusyBeaver also writes to it. Whereas scrypt reads from the memory in large fixed chunks, BusyBeaver reads and writes in unpredictable sizes - often as small as one byte. In the time allotted to the task, my colleague has not been able to figure out how to meaningfully reduce the memory requirement. This means that each Compute Unit has to use global memory, which with this pattern of memory access is quite slow.

I expect that BusyBeaver would run orders of magnitude faster on a GPU with 128 kB, or at least 96 kB, of memory local to each Compute Unit. However, if we are to draw conclusions from the history of level 1 cache sizes for the CPU, we can suspect that there are tradeoffs involved with making this memory bigger. If similar pressures exist on the GPU that are keeping level 1 caches small on the CPU, major increases in local memory for general purpose GPUs may be unlikely. However, if GPUs were to introduce an intermediate category of memory, similar to level 2 cache on the CPU, that might also make them much more effective at running BusyBeaver.

I would love to see research into, and would welcome feedback from, any attempts to implement BusyBeaver efficiently on an FPGA. For the time being, I have confidence that BusyBeaver does favor the CPU more than alternative algorithms. We are therefore going ahead, and will be using it in the next version of Bitvise SSH Server.

July 2-3, 2015:

A parallel implementation, ParaBeaver, is now implemented separately in ParaBeaver.h/.cpp. It uses C++11 threading in order to run a number of concurrent threads processing BusyBeaver instances. ParaBeaver is useful on clients that can expect to have most cores idle, and can benefit from parallelization to gain security at no cost to the user.

The implementation of MulModOp64 now uses a right shift by 4 bits between multiplication and modulo in order to use a portion of multiplication result that preserves more entropy. The result of MulModOp64 is now used via rotBits in all instructions, to prevent the CPU from detecting it's not used and optimizing it away.

An additional parameter, swapBlockThreshold, is now supported to provide a tradeoff between side channel resistance and preventing an attacker from rearranging the order in which instructions are executed. BusyBeaver will now operate in side-channel-free mode the first 1/4 of the time (all memory offsets are exclusively salt-based), and will swap blocks the remainder of the time (memory offsets become also password based, and both blocks are read from and written to).

ParaBeaver supports swapBlockThreshold as well, and implements it sensibly for multithreaded execution.

Comments

Popular posts from this blog

When monospace fonts aren't: The Unicode character width nightmare

"Unreachable" beauty standards

VS 2015 projects: "One or more errors occurred"