Intel has always continued to improve its microprocessors performance and functionality. One of these features is Hyper-Threading (HT) Technology. This feature provides more than one logical processor so that separate threads can execute concurrently. However, even HT Technology is evolving and is now going into the area of dual processor cores in a physical package. This gives an even greater performance boost over the original HT Technology that was introduced.
This paper will discuss how to detect HT Technology, the number of processor cores on a single package and the number of logical processors per processor core and physical package.
HT Technology
This technology allows the computer to execute two or more threads (depending on the specific processor) in parallel. While it does not necessarily double the performance of applications that take advantage of multiple threads, it can significantly increase the performance of the application. Since a processor usually uses roughly 30% of its resources, HT Technology takes advantage of this by having two logical processors on a single physical processor and the extra 70% of the CPU resources can be utilized.
For more information on HT Technology please visit Intel's website at:

Figure 1. Two CPUs in one physical package
|
Dual Core Technology
"Dual cores" refers to two microprocessors (CPUs) in one physical package (i.e. a single chip) which is the next step for HT Technology over the multiple logical processors on a single core package. The CPUs share the same packaging and the same bus interface into the chipset/memory. They operate as distinct CPUs except certain products may share the higher level cache. Keep in mind that Intel's microprocessors are constantly evolving so future processors may behave differently than described.
Usage Models
There are several reasons for detecting the presence of HT Technology and Dual Core Technology. The first being detection of more than one processor for straightforwardly utilizing multiple threads in an application for performance reasons, next is detection of processor
Advanced Programmable Interface Controller (APIC) ID for managing multiple thread processor affinity, and finally determining cache partition size when running multiple threads so as not to thrash the caches when threading.
Multi-Threading
While there are several reasons for threading an application, this section will be dealing with the threading of an application for increased performance. HT Technology gives the benefit of added performance by providing more than one processor so that an application can divide its workload between the available processors thereby decreasing the time it takes to do the calculations.
Managing Threads
Managing threads can be beneficial to the application on HT Technology even though the operating system is responsible for scheduling the various threads on the available processors. Because the logical processors in HT Technology share resources it could be possible for some algorithms to contend with each other if they heavily utilize the same execution units (e.g. Streaming SIMD Extensions 2 (SSE2) intensive algorithm not utilizing much of the other instruction sets). In this case the performance gain of threading would not be as high as expected and it would then be beneficial to inform the operating system which threads to run on which processor in order to maximize the CPU resource usage.
The following shows how you might want to assign threads to a processor to decrease any potential resource contention:
SSE2IntensiveFunction1(); // Assign to logical processor 0
SSE2IntensiveFunction2(); // Assign to logical processor 0
IA32IntensiveFunction1(); // Assign to logical processor 1
IA32IntensiveFunction2(); // Assign to logical processor 2
By assigning the two SSE2 intensive functions to the same processor they are no longer competing for resources. They are now executing one after another as scheduled by the OS. The IA32 intensive functions are also operating in a similar manner. What is occurring now is that the SSE2 and IA32 functions are executing simultaneously and since they do not use the same execution units the HT Technology capable processor is able to maximize the resource usage of both threads.
Future processors may have different combinations of the number of physical and logical processors per package. This makes it necessary to check the mappings of the processors reported by the OS to the physical and logical processors in a CPU. Determining the mappings of the processors requires getting the APIC ID of each processor from both the processors, using the CPUID instruction, and by querying the OS through its API functions.
Sharing the Cache
The size of the cache can play an important role in the performance of data intensive applications. The cache in a processor holds the most recently used data so that fetches to memory do not need to be made which in turn reduces processing time as system memory is much slower than the cache. Data intensive applications are programs that have more data that can fit inside of the processor's cache and therefore cause frequent fetches into memory.
Fortunately these memory fetches can be reduced by only operating on the amount of data that will fit in the cache. Once all the calculations on the data are done the algorithm can then move onto the next batch of data that fits in the cache.
In the case of HT Technology, several levels of the cache may be shared between the logical processors of a physical processor. The CPUID instruction can be used to determine the cache sizes and shared cache levels for the logical processors.
Detection and the CPUID Instruction
Detecting for HT Technology requires the use of the CPUID instruction. This instruction returns information about the CPU such as its identification as to what CPU it is as well as different features that the processor supports. For our purposes we will be accessing the functionality in CPUID that tells us:
- Support for HT Technology within a processor
- The number of logical processors in a package
- The APIC ID of a processor
- The cache information of a processor
The CPUID instruction is not visible from high-level languages such as C/C++, C#, or Java. Calling CPUID from these languages requires either support for inline assembly or the ability to call an assembly language routine.
For more information on the CPUID instruction please refer to the application note entitled "Intel Processor Identification and the CPUID Instruction" (Order Number: 241618-023) located on Intel's website at developer.intel.com.
Detecting an Intel Processor
Detection for the presence of an Intel processor requires calling CPUID with an input value of zero (CPUID.0). This is done by placing a value of zero into the EAX register and then issuing the CPUID instruction. The following assembly code shows how to do this:
mov eax, 0
cpuid
The results of this are stored in the register EAX, EBX, ECX, and EDX. The EAX register gives information about the highest input value that CPUID can take. We need this value to determine if we can do the next steps which require a maximum input of 4 into CPUID. The EBX, ECX, and EDX registers hold the identifier for an Intel processor which is 'GenuineIntel'.
The following is the output of the CPUID instruction with an input value of zero:

Figure 2. CPUID.0 Output |
Detecting HT Technology
Checking for HT Technology support requires the use of the CPUID instruction with an input value of one (CPUID.1). This is done by placing a value of one into the EAX register and then issuing the CPUID instruction. However, before CPUID.1 is issued we need to make sure that it is supported. This is done by checking the highest CPUID input value, which was returned in EAX when we issued the CPUID.0 instruction, to verify that it is greater than or equal to one.
There are two fields of interest to us located in registers EBX and EDX. The return values in the other two registers (i.e. EAX and EDX) are not needed for this purpose as they do not pertain to HT Technology or dual-core. Bit 28 of EDX signifies whether HT Technology is supported by the processor. Bits 16 to 23 of register EBX identify how many logical processors are running on the physical package.
Please take note that while HT Technology may be supported by a processor it may not be utilized by the OS. The following is the output, pertinent to HT Technology detection, of the CPUID instruction with an input value of one:

Figure 3. CPUID.1 Output (HT Technology Specific)
|
Detecting Processor Mappings
Processor mapping detection for each processor in the system uses the CPUID instruction with an input value of n (CPUID.1) and needs to be done in conjunction with the OS. This is because the OS handles scheduling which processor a thread is executed on. Also take note that this is the same CPUID input as for detecting HT Technology. Hence, when writing code there is no need to issue CPUID.1 more than once, per processor.
There are two fields of interest to us located in register EBX. The return values in the other three registers (i.e. EAX, EBX, and EDX) are not needed for this purpose as they do not pertain to the APIC ID. Bits 31 to 24 of register EBX specifies the local APIC ID that is assigned to the processor when the system is reset. This field contains the physical and logical ids of the processor. However, the ids cannot be read directly as the value depends on how many logical processors there are for the physical processor. The start of the physical id is shifted left depending on how many bits it takes to give a logical id for all logical processors. In order to calculate it, we need bits 16 to 23 of register EBX to identify how many logical processors are running on the physical package.
The following is the output, pertinent to the APIC information, of the CPUID instruction with an input value of 1:

Figure 4. CPUID.1 Output (APICID Specific) |
Detecting Cache Information
Cache sharing information can be retrieved by executing the CPUID instruction with an input value of 4 (CPUID.4). This is done by placing a value of four into the EAX register and then issuing the CPUID instruction. Since there is more than one cache in a processor, CPUID.4 is an iterative instruction. Iterating through CPUID.4 is done by using the ECX register. A value of zero is placed into ECX before issuing the CPUID instruction and then ECX is incremented for each succeeding CPUID issue until a value of zero is returned in the cache type field.
There are three fields of interest to us located in register EAX, three fields in the EBX register, and the entire ECX register. The return value in the other register (i.e. EDX) is not needed for this purpose. Bits 0 to 4 of EAX signifies the cache type being reported (i.e. data, instruction, or unified). Bits 5 to 7 of register EAX identifies the level of the cache being reported, with level 1 being the cache level closest to the processor (i.e. takes the least amount of clocks to get data from). Bits 14 to 25 of register EAX indicates the number of threads sharing this cache. This value is one less than the actual count, so a one needs to be added to get the correct number of threads.
The cache size can be calculated from the values in the EBX and ECX registers. The follow are the bit fields of register EBX: bits 0 to 11 contain the system coherency line size (L), bits 12 to 21 contains the physical line partitions (P), and bits 22 to 31 contains the ways of associativity (W). The ECX register contains the number of sets (S). The values in all these fields are all one less than the actual value, so a one needs to be added to all of these items to get the correct value. While the definition of each of these fields is beyond the scope of this paper, we will be using them to calculate the cache size. The cache is calculated by multiplying all of the fields together (after the one has been added to get the correct value):
Cache Size = L × W × P × S
The following is the output, pertinent to cache information, of the CPUID instruction with an input value of 4:

Figure 5. CPUID.4 Output (Cache Information) |
Appendix A: Putting it Together with Sample Code
The sample code in this section shows how to issue the CPUID instruction in conjunction with the Windows APIs to get all the information discussed in this paper.
Header file for the code
Source file for the code
Related Links
Learn about the Impact of Thread Priority on .NET Applications and HT Enabled Processors
Read about Combining Linux Message Passing and Threading in High Performance Computing
Find out the difference between Dual and Multi-Processor Chips