Programming Multi-threaded Architectures: Interlocked Operations
Veteran game programmer Philippe Paquet discusses the advantages and implementation of interlocked operations within a multi-threaded environment, in this exclusive Gamasutra technical feature.
1. Introduction
Interlocked operations provide a simple mechanism for synchronizing access to variables that are shared by multiple threads. Interlocked operations don't supplant classic synchronization functions like critical sections or mutex objects but complement them as they answer a very specific need.
2. What are interlocked operations?
Interlocked operations are a high-performance way of updating integer-sized or pointer-sized variables in an atomic manner.
Interlocked operations are high-performance as they use hardware-based synchronization techniques rather than operating system level mechanisms. Interlocked operations are implemented as intrinsic functions and are even translated in some cases to a single processor instruction.
Interlocked operations are atomic. An operation is atomic if it is indivisible. During an atomic write, you can't have another thread reading the value that you are writing half way through the write. Similarly, during an atomic read, you can't have another thread changing the value that you are reading half way through the read. If the operations were not atomic, there would be a possibility that a thread will read part of an old value and part of a new value from a variable that another thread is writing to. Additionally, the atomicity of interlocked operations is guaranteed across CPUs.
The table below shows the interlocked operations available in the Windows API. Keep in mind that other APIs have more or less operations available.
Windows API function | Description |
---|---|
_InterlockedAnd | Perform a Boolean And operation on the specified variable |
_InterlockedCompareExchange | Conditionally set a variable on the outcome of a comparison with the current value |
_InterlockedDecrement | Decrements the specified variable |
_InterlockedExchange | Set a variable and returns its prior value |
_InterlockedExchangeAdd | Add to a variable and returns its prior value |
_InterlockedIncrement | Increments the specified variable |
_InterlockedOr | Perform a Boolean Or operation on the specified variable |
_InterlockedXor | Perform a Boolean Exclusive Or operation on the specified variable |
Because some of the architectures capable of running Linux don't provide the necessary hardware support, interlocked operations are not available in the user side of the Linux API. They are available only on the kernel side and only on specific architectures.
3. The pros of interlocked operations
Compared to classic lock mechanisms, interlocked operations are the most lightweight and efficient synchronization primitives. The advantages are many.
Interlocked operations are directly supported by the hardware in the form of one or many processor instructions. Compiler support is provided in the form of intrinsic functions. That configuration is very cache friendly. The flow in the instruction cache will not be disturbed by using an interlocked operation while the data cache may take a minimal hit, depending on the platform memory model.
There is no wait state with interlocked operations. When you try to acquire a mutex object or a critical section, your thread may have to wait for them to be released while interlocked operations are performed immediately. That specific property is the base of lockless programming. Lockless programming is a model where synchronization is achieved using non locking primitives. Alternative names are or lock-free, non-blocking and wait-free programming. As lockless programming is notoriously difficult, don't expect to make your application totally lockless. Instead, concentrate your efforts on trying to make lockless some of the algorithms you are using. Good candidates are linked lists [Val95] [Har01] and deque [ST04].
There is no user mode to kernel mode transition as for mutex objects. This is a significant advantage as user-mode to kernel-mode transitions are particularly slow operations on most platforms (about 100 times slower than a normal function call on a Pentium 4 [Ric05]).
It is possible and relatively easy to create complex synchronization primitives using interlocked operations. Below is a rudimentary implementation of a critical section API using _InterlockedCompareExchange and _InterlockedExchange. Keep in mind that the following implementation is only an example. You shouldn't write a replacement for the standard critical section API unless you have a very specific need.
#define CS long
#define CS_FREE 0
#define CS_USED 1
void InitializeCS(CS* pCS)
{
// Set the free flag
*pCS = CS_FREE;
}
void EnterCS(CS* pCS)
{
while (_InterlockedCompareExchange(pCS, CS_USED, CS_FREE)
== CS_USED)
{
// Spin until we see the free flag then acquire
// the critical section by setting the used flag
}
}
void LeaveCS(CS* pCS)
{
// Set the free flag
_InterlockedExchange(pCS, CS_FREE);
}
4. The cons of interlocked operations
While interlocked operations are great when you want performance, they are limited in functionality. Boolean operations or bit manipulation are usually not available. However, if you are ready to slightly compromise the non-locking aspect of interlocked operations, it is usually possible to build the missing operations from already available operations. Below is an example of an _InterlockedAnd based on the very useful _InterlockedCompareExchange. That example can easily be transformed in a variety of Boolean or bit manipulation interlocked operations by replacing the calculation of the desired value.
void _InterlockedAnd( int * piDestination, int iAnd)
{
int iOriginalValue
int iCurrentValue;
// Get the destination and don't access it
// anymore except through the interlocked operation
iCurrentValue= *piDestination;
do
{
iOriginalValue = iCurrentValue;
// Calculate the desired value
int iDesiredValue = iOriginalValue & iAnd;
// Try to perform an exchange
iCurrentValue = _InterlockedCompareExchange(piDestination,
iDesiredValue, iOriginalValue);
// The value prior to potential exchange is returned
// If the value has changed, repeat the operation
}
while (iOriginalValue != iCurrentValue);
}
Because of their limitation to integer-sized or pointer-sized variables, there are cases where you won't be able to avoid critical sections or mutex objects. When you have to update several members of a data structure, you will have to use in most case a more complex synchronization mechanism.
Hardware support is required for interlocked operations. Most of the modern processors implement the required support but that is not a given, especially in the embedded and mobile market segment. In that segment, adding support for such complex operations would push both the cost and the power consumption beyond what is currently acceptable.
The processor instructions used to implement interlocked operations are highly complex instructions. Not only the processor architecture has a performance impact, bus design and cache structure have significant performance implications too. Depending on the conditions, hundreds to thousands of cycles will be necessary to complete an interlocked operation. Remember, the fastest synchronization primitive is the one that you are not using.
A last limitation of interlock operations lies with inter-process synchronization. As processes have separate memory spaces, inter-process synchronization can only be achieved with mutex objects.
5. Debugging tips
5.1 Don't mix synchronization mechanisms
Mixing interlocked operations with any other synchronization mechanism is a fast track to synchronization bugs. Synchronization mechanisms have not been designed to be mixed. If you are using a critical section to synchronize a data structure, don't use an interlocked operation to access part of your data structure.
5.2 Don't confuse interlocked operations with volatile variables.
Confusing interlocked operations with volatile variables can lead to hard to track bugs. The volatile keyword only ensures that variables are not kept in registers. It doesn't guarantee atomic interactions across threads and the C++ standard even allows compliant compilers to completely ignore the volatile keyword. Additionally, if you access a volatile variable through another variable (a pointer for example) that is not qualified as well as volatile, the program behavior is undefined.
5.3 Beware of memory consistency
Depending of the hardware and the compiler, interlocked operations may or may not guarantee the memory consistency. If no guarantee is given, both the compiler and the hardware can reorder instructions around an interlocked operation and strict program order will have to be explicitly enforced.
If your program writes to memory and then uses what has been written in an interlocked operation, you need to commit outstanding memory accesses before performing the interlocked operation. This prevents both the hardware and the compiler from reordering accesses around the interlocked operation, causing a synchronization bug. This process is called a release memory barrier or release memory fence.
If your program uses an interlocked operation to read a value and subsequently access it thru memory, you need to commit outstanding memory accesses after performing the interlocked operation. Again, this prevents both the hardware and the compiler from reordering accesses around the interlocked operation, causing a synchronization bug. This process is called an acquire memory barrier or acquire memory fence.
The table below shows the memory barrier functions available in the Windows API.
Windows API function | Description |
---|---|
_ReadBarrier | Forces memory reads to complete |
_WriteBarrier | Forces memory writes to complete |
_ReadWriteBarrier | Block the optimization of reads and writes to global memory |
You should note that the x86 implementation of the Windows API provides a full fence / full memory barrier on interlocked operations.
6. References
[Ric05] Jeffrey Richter. Concurrent Affairs - Performance-Conscious Thread Synchronization. In MSDN Magazine, October 2005.
[Val95] John D. Valois. Lock-Free Linked Lists Using Compare-and-Swap. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing, August 1995.
[Har01] Timothy L. Harris. A Pragmatic Implementation of Non-Blocking Linked Lists. In Proceedings of the 15th International Symposium of Distributed Computing, October 2001.
[ST04] Håkan Sundell, Philippas Tsigas.Lock-Free and Practical Deques using Single-Word Compare-And-Swap. Technical Report no. 2004-02.
Read more about:
FeaturesAbout the Author
You May Also Like