The best way to design, implement, and tune for Hyper-Threading Technology (HT Technology) enabled processors is to start with components or libraries that are thread-safe and designed for use with this technology. The operating system and threading libraries are likely already to be optimized for various processors. Use operating system and/or threading synchronization libraries instead of implementing application specific mechanism like spin-waits. Existing applications can take advantage of enhanced code modules by re-linking or through the use of dynamic link libraries.
The Intel® compiler enables threading by supporting both OpenMP* and auto-parallelization. OpenMP* is an industry standard for portable threaded application development, and is effective at threading loop-level parallel problems and function level parallelism.
See Parallelism Overview for more information about OpenMP* support in the Intel® compiler.
Processors equipped with HT Technology have multiple logical CPUs per physical package. The state information necessary to support each logical processor is replicated while sharing and/or partitioning the underlying physical processor resources. Multiple threads running in parallel can achieve higher processor utilization and increased throughput.
When multiple applications are running on a system, HT Technology helps reduce stalls and task switching delays caused by the interaction of two or more independent programs. With HT Technology enabled, the multi-tasking capability of the system is much enhanced; a background task is less likely to be pre-empted by other programs, and multi-tasking allows continued PC use for other activities.
HT Technology does not guarantee your application will run faster. To benefit from HT Technology, applications must have executable sections that can run in parallel. Threading improves the granularity of an application so that operations can be broken up into smaller units whose execution is scheduled and controlled by the operating system. Two threads can run independently of each other without requiring task switches to get at the resources of the processor.
Multitasking occurs at the user interface level every time a user runs multiple programs at simultaneously. Some applications also perform multitasking internally by creating multiple processes. Each new process is given a time-slice during which time it executes. Process creation involves creating an address space, the application's image in memory, which includes a code section, a data section, and a stack. Parallel programming using processes requires the creation of two or more processes and an inter-process communication mechanism to coordinate the parallel work.
Threads are tasks that run independently of one another within the context of a process. A thread shares code and data with the parent process but has its own unique stack and architectural state that includes an instruction pointer. Threads require fewer system resources than processes. Intra-process communication is significantly cheaper in CPU cycles than inter-process communication.
The life cycle of a thread begins when the application assigns a thread pool and creates a thread from the pool. When invoked, the thread is scheduled by Windows* operating system according to a round-robin mechanism. The next available thread with the highest priority runs.
When the thread is scheduled, the operating system checks to see which virtual processors are available, then allocates resources needed to execute the thread. Each time a thread is dispatched, resources are replicated, divided, or shared to execute the additional threads. When a thread finishes, the operating system idles the unused processor, freeing the resources associated with that thread.
In a processor with HT Technology, the architectural state is the only resource that is replicated. All other resources are either shared or partitioned between logical processors. This behavior introduces the issue of resource contention, which can degrade performance, or in the extreme case, cause an application to fail. Synchronization between threads is another area where problems can arise.
This section provides brief discussions of some of the most common issues in multi-threaded software design.
Synchronization is used in threaded programs to prevent race conditions (for example, multiple threads simultaneously updating the same global variable). A spin-wait loop is a common technique used to wait for the availability of a variable or I/O resource.
Consider the case of a master thread that needs to determine when a disk write has completed. The master thread and the disk write thread share a synchronization variable in memory. When this variable is written, the variable can cause an out-of-order memory violation that forces a performance penalty. Inserting a PAUSE instruction in the master thread read loop can greatly reduce memory order violations.
Spin-wait loops consume execution resources while they are cycling. One solution, when other tasks are waiting to run, is to have the thread performing the spin lock insert a call to Sleep(0), which releases the CPU. If no tasks are waiting, this thread immediately continues execution. Another alternative to long spin-wait loops is to replace the loop with a thread-blocking API, such as WaitForMultipleObjects. Using this system call ensures that the thread will not consume resources until all of the listed objects are signaled as ready and have been acquired by the thread.
The first level data cache (L1) is a shared resource on HT Technology processors. Cache lines are mapped on 64KB boundaries, so if two virtual memory addresses are modulo 64KB apart, they will conflict for the same L1 cache line. Under Windows*, threads are created on megabyte boundaries, and 64K aliasing can occur when these threads access local variables on their stacks. A simple solution is to offset the starting stack address by a variable amount using the _alloc function.
A cache line in the Pentium® 4 processor consists of 64 bytes for write operations, and 128 bytes for reads. False sharing occurs when two threads access different data elements in the same cache line. When one of those threads performs a write operation, the cache line is invalidated, causing the second thread to have to fetch the cache line (128 bytes) again from memory. If this occurs frequently, false sharing can seriously degrade the performance of an application.
You can diagnose false sharing using the Intel VTune™ Analyzer to monitor memory order machine clears caused by other thread counters. Some techniques to avoid false sharing include partitioning data structures, creating a local copy of the data structure for each thread, or padding data structures so they are twice the size of a read cache line.
The Pentium® 4 processor with Intel NetBurst® microarchitecture supports six (6) Write Combine (WC) store buffers, each buffering one cache line. The WC store buffers allow code execution to proceed by combining multiple write operations before they are written back to memory through the L1 or L2 caches. If an application is writing to more than 4 cache lines at about the same time, the WC store buffers will begin to be flushed to the second level cache. This is done to help insure that a WC store buffer is ready to combine data for writes to a new cache line.
To take advantage of the WC buffers, an application should write to no more than 4 distinct addresses or arrays inside an inner loop. On HT Technology enabled processors, the WC store buffers are a shared resource; therefore, you must consider the total number of simultaneous writes by both threads running on the two logical processors. If data is being written inside of a loop, it is best to split inner loop code into multiple inner loops, each of which writes no more than two regions of memory.
In some applications there are background activities that run continuously, but have little impact on the responsiveness of the system. In these cases, consider adjusting the task or thread priority downward so that this code only runs when resources become available from higher priority tasks.
Conversely, if an application requires real-time response, it can increase task priority so that it runs ahead of other normal priority tasks. Use this technique cautiously; using this technique can degrade the responsiveness of the user interface and may affect the performance of other applications running on the system.
On a multi-processor or HT Technology enabled system, load balancing is normally handled by the operating system, which allocates workload to the next available resource. In some cases a virtual CPU will becomes idle while the other is overloaded. In these cases, you can address the load imbalance by setting Processor Affinity.
Processor affinity allows a thread to specify exactly which processor (or processors) the operating system may select when it schedules the thread for execution. When an application specifies the processor affinity for all of its active threads, the application ensures that load imbalance will not occur among its threads and eliminate thread migration from one virtual processor to another.
With HT Technology, there are several arithmetic-logic units (ALU) used for integer logic, but only one shared floating-point unit. If your application uses floating-point calculations, it may be beneficial to isolate those threads and set the processor affinity of the threads to minimize the processor resource contention.
Relying on the execution timing between threads as a synchronization technique is not reliable because of speed differences between host systems. Delay loops are sometimes used during initialization as well, and should be avoided for the same reasons.
Most of the issues and rules discussed above apply for Multitasking; however, there are a few additional considerations. Task switches are much slower then thread context switches because each task operates in its own address space. The state of the previously running task must be saved and data residing in the cache will be invalidated and reloaded.
HT Technology enhances multitasking because the state information for each task is stored on a separate virtual processor. Cache invalidation will still occur, but the need for a task switch is eliminated since both tasks can run at once. Since cache is a shared resource on HT Technology enabled processors, all of the above rules regarding data alignment still apply.
Contention for resources can be a problem when multitasking. Resource contention can occur in memory, on the system busses, or on I/O devices. Consider the case of video capture while creating an MP3 file. Both applications use the hard disk intensively, but video capture has to occur in real-time. The result of contention is that the video drops frames, and the MP3 file skips.
Applications should check the status of an I/O device before attempting to pass data to it. If necessary, lock peripherals to avoid access by other applications. Locking makes sense for a CD or DVD writer, which are essentially single-use devices. Locking the hard drive is not recommended, since it is a critical OS resource.
Task and thread priority can have a dramatic effect in a multitasking environment. If priority is raised in a task that runs continuously, other tasks will not have sufficient resources until the high priority task releases the processor. Lowering priority on such a task may be the best choice. Consider the case of a video encoder which normally takes all processor operation. If you lower the priority, the user will be able to use the computer on demand, and the video encode will still run 100% of the time when the CPU is otherwise available.
Load balancing within applications can actually degrade multi-tasking performance. If one application behaves as if it has full control of both processors, resource contention might occur when a second application attempts to load. This highlights a fundamental issue with multitasking programs; the application cannot know what other applications are running concurrently. Do not lock resources that other programs might need to function.