Generating Unique IDs Fast

This article elaborates about the cost of atomic operations of a unique 64-bit ID generator, which uses a global counter and generates unique IDs by incrementing such counter atomically. Additionally, it offers a faster approach that uses thread local storage that is used as an index cache to minimize the access to a global variable.

Global Counter

Let's start with the most straightforward approach, which is to use a single global counter, which is incremented by using atomic operations:

uint64_t generateIdGlobal() {
  static std::atomic<uint64_t> globalCounter(0);
  return ++globalCounter;
}

This code would always compile on X86 as it has a CMPXCHG8B instruction, which is enough for implementing any kind of 64-bit atomic operation. However, it's not guaranteed that this code would compile on targets that don't provide 64-bit atomic operations (in 32-bit mode, for example). In that case a mutex would be required, which would slow down the generator considerably.

Global & Thread-Local Counter

If the only requirement for the ID generator is to return unique numbers that do not have to always be incrementing, we can use thread-local storage to implement a local cache and to only use atomics for incrementing the global counter in case we have exhausted the range of locally available IDs:

uint64_t generateIdLocal() {
  static std::atomic<uint64_t> globalCounter(0);

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0)
    localIdx = globalCounter.fetch_add(uint64_t(cacheSize));

  return ++localIdx;
}

This approach is of course longer, but it also minimizes the use of atomic operations to manipulate the global counter. In addition, since we have minimized the use of atomics we can also think about using a mutex to guard the access to globalCounter in case that we run on hardware that doesn't have 64-bit atomics:

static std::mutex globalMutex;

uint64_t generateIdLocalMutex() {
  static uint64_t globalCounter = 0;

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0) {
    std::lock_guard<std::mutex> guard(globalMutex);
    localIdx = globalCounter;
    globalCounter += cacheSize;
  }

  return ++localIdx;
}

Performance

So what would be your guess regarding the performance of each approach? I have written the following code to benchmark various implementations of the ID generator:

#include <stdint.h>
#include <atomic>
#include <mutex>
#include <thread>
#include <chrono>

typedef uint64_t (*GenerateIdFunc)(void);

static std::mutex globalMutex;

static uint64_t generateIdGlobal() {
  static std::atomic<uint64_t> globalCounter(0);
  return ++globalCounter;
}

static uint64_t generateIdGlobalMutex() {
  static uint64_t globalCounter = 0;
  std::lock_guard<std::mutex> guard(globalMutex);
  return ++globalCounter;
}

static uint64_t generateIdLocal() {
  static std::atomic<uint64_t> globalCounter(0);

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0)
    localIdx = globalCounter.fetch_add(uint64_t(cacheSize));

  return ++localIdx;
}

static uint64_t generateIdLocalMutex() {
  static uint64_t globalCounter = 0;

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0) {
    std::lock_guard<std::mutex> guard(globalMutex);
    localIdx = globalCounter;
    globalCounter += cacheSize;
  }

  return ++localIdx;
}

static void testFunction(GenerateIdFunc func, const char* name) {
  std::atomic<uint64_t> result {};
  constexpr size_t numThreads = 8;
  constexpr size_t numIterations = 1024 * 1024 * 16;

  printf("Testing %s:\n", name);
  auto start = std::chrono::high_resolution_clock::now();

  std::thread threads[numThreads];
  for (size_t i = 0; i < numThreads; i++)
    threads[i] = std::thread([&]() {
      uint64_t localResult = 0;
      for (size_t j = 0; j < numIterations; j++)
        localResult += func();
      result.fetch_add(localResult);
    });

  for (size_t i = 0; i < numThreads; i++)
    threads[i].join();

  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed = end - start;

  printf("  Time: %0.3g s\n", elapsed.count());
  printf("  Result: %llu\n", (unsigned long long)result.load());
}

int main(int argc, char** argv) {
  testFunction(generateIdGlobal, "Global");
  testFunction(generateIdGlobalMutex, "GlobalMutex");
  testFunction(generateIdLocal, "Local");
  testFunction(generateIdLocalMutex, "LocalMutex");
  return 0;
}

Results on AMD Ryzen 3950x (Linux):

Approach 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads
Global - Atomic 0.081s 0.292s 0.580s 1.070s 1.980s
Global - Mutex 0.164s 0.782s 4.440s 9.970s 19.00s
Local - Atomic 0.030s 0.039s 0.039s 0.038s 0.041s
Local - Mutex 0.038s 0.039s 0.037s 0.037s 0.056s

Conclusion

The results should not be surprising. Atomic operations are always faster than using synchronization primitives to access a shared resource, because synchronization relies on atomics too. Our benchmark actually shows clearly that when there is no contention the cost of acquring and releasing mutex is 2 atomic operations (one for each). Accessing a thread local storage is much cheaper and actually pays off in our case.

I would like to note that this is a microbenchmark that stresses the access to a single shared resource. Such high contention would most likely not happen in common situations and the speedup offered by using a different approach may be totally negligible in many real world scenarios.

Additionally, this article implements a 64-bit counter, which would take 136 years to overflow if the process generated 4 billion unique IDs per second, which is unlikely. However, if a higher precision is needed, it's possible to use thread-local approach with mutex and add a more complex logic to retrieve the cached range.