Generating Unique IDs Fast

Discover the costs of atomic operations in a unique 64-bit ID generator that relies on a global counter to incrementally generate unique IDs. This article also reveals a quicker alternative that leverages thread local storage as an index cache to minimize the need for global variable access.

Global Counter

Let's start with the most straightforward approach, which is to use a single global counter, which is incremented by using atomic operations:

uint64_t generateIdGlobalAtomic() {
  static std::atomic<uint64_t> globalCounter(0);
  return ++globalCounter;
}

This code would always be atomic on X86 hardware as it has a CMPXCHG8B instruction, which is enough for implementing any kind of 64-bit atomic operation. However, it's not guaranteed that this code would be atomic on targets that don't provide 64-bit atomic operations (in 32-bit mode, for example). In that case a mutex would be required, which would slow down the generator considerably.

Global & Thread-Local Counter

If the only requirement for the ID generator is to return unique numbers that do not have to always be incrementing, we can use thread-local storage to implement a local cache and to only use atomics for incrementing the global counter in case we have exhausted the range of locally available IDs:

uint64_t generateIdLocalAtomic() {
  static std::atomic<uint64_t> globalCounter(0);

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0)
    localIdx = globalCounter.fetch_add(uint64_t(cacheSize));

  return ++localIdx;
}

This approach is of course longer, but it also minimizes the use of atomic operations to manipulate the global counter. In addition, since we have minimized the use of atomics we can also think about using a mutex to guard the access to the globalCounter in case that we run on hardware that doesn't have 64-bit atomics:

static std::mutex globalMutex;

uint64_t generateIdLocalMutex() {
  static uint64_t globalCounter = 0;

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0) {
    std::lock_guard<std::mutex> guard(globalMutex);
    localIdx = globalCounter;
    globalCounter += cacheSize;
  }

  return ++localIdx;
}

Performance

So what would be your guess regarding the performance of each approach? The following code can be used to benchmark the provided implementations of the ID generator:

#include <stdint.h>
#include <atomic>
#include <mutex>
#include <thread>
#include <chrono>

typedef uint64_t (*GenerateIdFunc)(void);

static std::mutex globalMutex;

static uint64_t generateIdGlobalAtomic() {
  static std::atomic<uint64_t> globalCounter(0);
  return ++globalCounter;
}

static uint64_t generateIdGlobalMutex() {
  static uint64_t globalCounter = 0;
  std::lock_guard<std::mutex> guard(globalMutex);
  return ++globalCounter;
}

static uint64_t generateIdLocalAtomic() {
  static std::atomic<uint64_t> globalCounter(0);

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0)
    localIdx = globalCounter.fetch_add(uint64_t(cacheSize));

  return ++localIdx;
}

static uint64_t generateIdLocalMutex() {
  static uint64_t globalCounter = 0;

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0) {
    std::lock_guard<std::mutex> guard(globalMutex);
    localIdx = globalCounter;
    globalCounter += cacheSize;
  }

  return ++localIdx;
}

static void testFunction(GenerateIdFunc func, const char* name, uint32_t numThreads) {
  constexpr size_t numIterations = 1024 * 1024 * 16;

  std::atomic<uint64_t> result {};
  std::thread* threads = new std::thread[numThreads];

  auto start = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < numThreads; i++)
    threads[i] = std::thread([&]() {
      uint64_t localResult = 0;
      for (size_t j = 0; j < numIterations; j++)
        localResult += func();
      result.fetch_add(localResult);
    });

  for (size_t i = 0; i < numThreads; i++)
    threads[i].join();

  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed = end - start;

  delete[] threads;

  printf("  %-13s: %0.3g s\n", name, elapsed.count());
}

int main(int argc, char** argv) {
  for (uint32_t numThreads = 1; numThreads <= 32; numThreads <<= 1) {
    printf("Thread Count %u\n", numThreads);
    testFunction(generateIdGlobalAtomic, "GlobalAtomic", numThreads);
    testFunction(generateIdGlobalMutex , "GlobalMutex" , numThreads);
    testFunction(generateIdLocalAtomic , "LocalAtomic" , numThreads);
    testFunction(generateIdLocalMutex  , "LocalMutex"  , numThreads);
  }
  return 0;
}

Results on AMD Ryzen 7950X (Linux):

Approach 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads
Global - Atomic 0.024s 0.142s 0.508s 1.330s 2.180s 7.640s
Global - Mutex 0.089s 0.569s 1.660s 3.080s 6.350s 13.60s
Local - Atomic 0.019s 0.021s 0.029s 0.029s 0.034s 0.049s
Local - Mutex 0.019s 0.021s 0.029s 0.030s 0.037s 0.056s

Conclusion

The results should not be surprising. Atomic operations are always faster than using synchronization primitives to access a shared resource, because synchronization relies on atomics too. The benchmark actually shows clearly that when there is no contention the cost of acquring and releasing a mutex is at least 2 atomic operations and some additional overhead to call a library function. Accessing a thread local storage is much cheaper and actually pays off in our case.

I would like to note that this is a micro-benchmark that stresses the access to a single shared resource. Such high contention would most likely not happen in common situations and the speedup offered by using a different approach may be totally negligible in many real world scenarios.

Additionally, this article implements a 64-bit counter, which would take 136 years to overflow if the process generated 4 billion unique IDs per second, which is unlikely. However, if a higher precision is required, it's possible to use the thread-local approach with mutex and add a more complex logic to retrieve the cached range.