Technical: OctAlloc improved

I wrote previously about my attempts to engineer a concurrency-friendly memory allocation algorithm for Windows platforms that would be efficient for the most demanding multi-threaded and multi-core use.

I made available in the previous post an implementation, OctAlloc, which performs way better than the default heap allocator in Windows XP or 2003, and fares well against the Vista allocator in highly concurrent situations, but is less performant than the Vista allocator for single-threaded multi-core use.

I have since made two optimizations to OctAlloc which improve its performance beyond the Vista allocator in all respects. The optimizations are as follows:
  • The previously posted version of OctAlloc uses thread-local storage to store the most likely CPU number that the current thread is running on. The TlsGetValue() and TlsSetValue() operations this requires are however somewhat expensive, and do not pay off when there's only a single thread. The new version uses a less reliable, more heuristic, but also highly more efficient strategy by using GetCurrentThreadId(), an inexpensive call, to choose the allocation buckets for the executing thread. This improves performance by some 25%.
  • The previous version of OctAlloc always pops a memory chunk and returns it back to an atomic list after allocating a unit for it. This requires two synchronized operations for every call. The new version keeps handy an additional list of preallocated units which do not require popping and returning a memory chunk. This requires only one synchronized operation for most allocations.
My measurements for the new version follow at the bottom of this post below. According to my measurements, OctAlloc outperforms the Vista allocator in performance as well as tightness in managing memory.

Note that the performance figures are not comparable directly to the measurements I posted in the previous OctAlloc article. Previously, the test program kept the load-per-thread constant regardless of the number of test threads, causing 8- and 16-thread measurements to run very long. The new version of the test program varies the load-per-thread depending on the number of threads and the number of available CPUs. An algorithm with no contention should now complete in the same time regardless of the number of threads; if it takes longer to complete with multiple threads than it does with one, the difference is attributable directly to contention.

Download the new test binaries and the source code:

OctAlloc2.zip (51 kB)

Comments

Anonymous said…
I have been looking for a replacement for the default heap manager for VC and came across your library. It seems to work quite well and avoids the contention issues I was experiencing with the Microsoft libaries. But there seems to be a rather obvious bug. The following snippet:

void* __fastcall OctAllocFlex(OctSizeT* size)
{
if (*size > c_bucketSizes[0])
{
*size = (*size + 0xFFFF) / 0x10000;
return VirtualAlloc(0, *size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
}
else
{

Is surely wrong, and indeed crashes my code when the VirtualAlloc is invoked as the size is divided by 0x10000. Sureley the code should be:

void* __fastcall OctAllocFlex(OctSizeT* size)
{
if (*size > c_bucketSizes[0])
{
*size = ((*size + 0xFFFF) / 0x10000)* 0x10000;
return VirtualAlloc(0, *size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
}
else
{

So the size is restored in the correct block size.
Is this right or I am going mad?

Thanks,
Gareth.
denis bider said…
Gareth,

yes, there's that obvious bug. Your fix looks right.

I stumbled upon this bug after publishing OctAlloc, but I didn't get around to updating the published archive because it's somewhat trivial and I wasn't sure that anyone is even interested.

I will upload a fixed version now though. Thanks for reminding me.
denis bider said…
The published OctAlloc2.zip is now updated with this bug fixed.
Anonymous said…
Hello Mr. Bider!

I'm a researcher from a Federal University from Brazil and I'm studing about memory allocators in general. In this moment, I'm researching about the Windows memory allocator: HeapAlloc.
Do you know how does it work? What kind of data structure is used (say, double linked list, queue, ...)? Where can I find more details about the source of the function HeapAlloc?

Thank you!
denis bider said…
I'm not intimately familiar with the internal workings of HeapAlloc, no. You should seek this information from sources closer to Microsoft.
Anonymous said…
I recently got a new Dell quad-core With Windows 7 and have had problems since day 1 (had to hard-boot after about an hour...). My main problem isn't with Dell, though, but with Windows 7 (I already fixed Dell's issues). The programs hang constantly - a typical problem apparently. I think it's a cpu allocation problem. You were comparing your OctAlloc to Vista's HeapAlloc. Do you know how it compares to Windows 7's allocator? (I'm assuming they are different, but I don't know for certain)
denis bider said…
Anonymous,

there is a better than 99% probability that this article is completely unrelated to why you're having trouble with your computer.

Popular posts from this blog

When monospace fonts aren't: The Unicode character width nightmare

VS 2015 projects: "One or more errors occurred"

Is the internet ready for DMARC with p=reject?