A treatise on threads and threading
One way to achieve low latency computation is matching binary patterns directly. Data packets are routed to a listener which compares the sequence of bits in the packet to pre-defined bit patterns, then decides what to do as a result of a match. The whole process takes place in the L1 cache, in other words nothing is stored in main memory.
It means that a good place to start looking at thread cost and performance is with CPU registers, execution units, memory buffers, L1, L2 and L3 (or other future) CPU caches, main memory, and sockets to clustered CPU's relate together. What are all the primitives used to manage threads? Take the volatile keyword: it's purpose is a synchronised read. It tells the CPU to wait for x clock cycles - for as long as execution core memory buffers take to empty to the shared L3 cache - so all CPU's can see the latest copy of the data. It works where several threads use several cores, in other words CPU 1 data isn't yet synchronized with CPU 2 data, and so on. That's all about cache coherence.
Sticking with volatile for a while, the way that an L3 cache flush actually takes place is through the use of memory fences (or barriers). A store fence waits for all store (write) buffers to empty so the data becomes visible to all CPU cores. A load fence waits for all load (read) buffers to drain. A full memory barrier is a composite of store and load fences, ie ensuring all buffers drain before executing CPU instructions. In Java, volatile data orders a load fence before a read, and a store fence after a write.
In the Java Memory Model, a final variable requests a store fence after initialisation.
When a lock is placed on an object, what happens is a full barrier is employed around the lock to block the memory sub system across cores, and even across CPU's. Thus, visibility and program order is preserved. But, the cost is waiting for each execution core's memory buffers to drain. Maximum performance is obtained by modelling the algorithm required to have necessary memory fences occur at the boundaries before and after the work unit is completed. According to Norvig locking or unlocking a mutex lock takes a modern CPU approx 25 ns.
Given that CPU use requires memory fences whatever happens, then probably the simplest way to do thread programming is only to use locks on objects. A bit like the days of C coding and database locks. Consider that thread primitives like volatile are architecture dependent, and hard to model or debug making for very difficult software maintenance. To minimise long term maintenance costs it's easier to just use locks in multi-threaded code. In other words:
But even then a lock on an object in a thread doesn't prevent the OS scheduling other threads to run on the same core. If your estimated wait time is under a millisecond then a spinlock prevents this. Then for example, the OS won't assign out the core to the kernel, which assignment means context switches (where all CPU registers are emptied to the L3 cache, opposed to a lock on just one CPU register). For more investigation, see: Stack Overflow on spinlocks. A textbook example of a spinlock is:
Considering instruction cycles and latency, see the following picture from Mechanical Sympathy, showing estimated timings to access CPU registers, caches, the motherboard network (for now?) and main memory, showing that the QPI bus is faster than the bus to main memory.
Finally, since memory fetch times increase between successive CPU caches and main memory, then per the binary pattern matching the most efficient use of the CPU is through the L1 cache. The way this is done is tricky because "CPU cache is not addressable". Several techniques exist to achieve the same end including spatial locality (see CPU cache coding) and pre-fetching, see __builtin_prefetch
References: Malcolm Brown, Mechanical Sympathy
Tell me what you think, please get in touch!