The Theory of Constraints (TOC)[1] is a philosophy of continuous improvement which is defined as a framework that can be applicable to many disciplines. The theory is based on the idea that in any complex system, there is usually one aspect of that system that limits its ability to achieve its goal or optimal functioning. In order to achieve any significant improvement of the system, the constraint must be identified and resolved.
Software optimization, in the generic sense, is the process of improving the software by eliminating bottlenecks so that it operates more efficiently on a given piece of hardware and uses resources optimally. In keeping with Knuth’s advice, TOC can help developers to identify the bottlenecks of their code for optimization that have the most return on investment (ROI). Identifying the bottlenecks in the target application and eliminating them appropriately is the key to efficient optimization.
There are many optimization methodologies, which help developers to answer the questions of why to optimize, what to optimize and how to optimize, and these methods aid developers to reach desired performance requirements. These questions are not any different than the questions asked to improve an organization, streamline the supply chain, or increase manufacturing output.
This paper proposes that TOC, which is a philosophy of management and improvement, can be also used and applied for software optimization. The focus of this paper is how to identify the bottlenecks and eliminate them with the help of the Intel® VTune™ Performance Analyzer[2] while following the framework that is described by TOC.
Introduction to Software Optimization: Hotspots and more?
Performance tuning usually focuses on reducing the time it takes to complete a well-defined workload. Performance events can be used to measure the elapsed time and therefore reducing the elapsed time of completing a workload is equivalent to reducing measured processor cycles (clockticks).
Intel® VTune™ Performance Analyzer is one of the most powerful tools available for software developers who are interested in performance analysis. The easiest way to identify the hotspots in a given application is to sample the application with processor cycles with VTune™ analyzer helping the developer identify where most of the CPU cycles are spent, in addition to many other processor events1, by utilizing two profiling techniques; sampling and call graph. The sampling can be in two forms: processor event based and time based sampling. Event based sampling (EBS) relies on the performance monitoring unit (PMU) supported by the processors. From this point forward event based sampling (EBS) will be referred to as sampling.
When sampling, the VTune™ analyzer by default uses processor cycles and instructions retired2 to analyze the application. The count of cycles, also known as clockticks, forms the fundamental basis for measuring how long a program takes to execute. The total cycle measurement is the start to finish view of total number of cycles to complete the application of interest. In typical performance tuning situations, the metric Total cycles can be measured by the event CPU_CLK_UNHALTED.CORE.
The other event used by default sampling activity is instructions retired. This event indicates the number of instructions that retired or executed completely. This does not include partially processed instructions executed due to branch mis-predictions. The ratio of clockticks (non-halted) and instructions retired is called clocks per instruction retired (CPI) and it can be good indicator of performance problems (indicator of the efficiency of instructions generated by the compiler and/or low CPU utilization)3.
The approach used in this paper is explained in greater details in the Intel® 64 and IA-32 Intel® Architecture Optimization Reference Manual. For more complete analysis please refer to the optimization manual.
In this approach, it is accurate to say that the total number of cycles an application takes is the sum of the cycles issuing μops4 and the cycles not issuing μops (stalls). This formula is given with the processor event names in Formula 2.
Total Cycles = Cycles issuing μops + Cycles not issuing μops
Formula 1
It is worthwhile to note here that this decomposition is approximate in its nature so it should be taken as estimation.
Therefore, our focus will be on three basic concepts:
- minimizing “stalls” by reducing or optimizing long latency operations such as memory accesses, floating point operations, etc.,
- minimizing “non-retired”5 instructions by reducing the branch mis-predictions,
- minimizing “retired” instructions.
In the Intel® Core™ microarchitecture (Figure 1), up to four μops may be retired per cycle and 6 μops6 can be executed. The results of Intel 64 and IA-32 instructions must be committed in original program order before they are retired. Reducing the number of retired instructions will improve the execution time as will reducing the stalls.
Cycles wasted due to stalls indicate sub-optimal execution; therefore identifying them will be the main focus. Stalls can be categorized as
- Execution stalls
- Front End stalls
- Mispredicted branch pipeline flushing
- Cycles wasted executing instructions that are not retired
With the introduction of Core™ architecture, the approach to identify and root-cause stall cycles became easier and more systematic. Before going into deeper analysis and explanation of the events, let’s introduce the Theory of Constraints.
Applications Will Perform as Fast as Their Bottlenecks: Gentle Introduction to the Theory of Constraints
TOC[1,6] is a philosophy of management and improvement developed by Eliyahu M. Goldratt. As mentioned earlier, it is a process of continuous improvement which focuses on the idea that in any complex system, there is usually one aspect of that system that limits its ability to achieve its goal.
The TOC processes defined by Goldratt are used to improve the health of an organization and these processes are very similar to the ones used by software optimization methodologies.
1 “Performance monitoring events provide facilities to characterize the interaction between programmed sequences of instructions and microarchitectural subsystems.”[3]
2 Instructions Retired: Recent generations of Intel 64 and IA-32 processors feature microarchitectures using an out-of-order execution engine. They are also accompanied by an in-order front end and retirement logic that enforces program order. Instructions executed to completion are referred as instructions retired [4].
3 The Intel Core™ microarchitecture is capable of reaching CPI as low as 0.25 in ideal situations. The greater value of CPI for a given workload indicates that there are more opportunities for code tuning to improve performance [4].
4 Micro-operations, also known as a
micro-ops or
μops, are simple, RISC-like microprocessor instructions used by some CISC processors to implement more complex instructions [5].
5 "In an OOO engine, speculative execution is an important part of making forward progress of the program. But speculative execution of μops in the shadow of mispredicted code path represent un-productive work that consumes execution resources and execution bandwidth."[4]
6 Figure 1 shows only one store dispatch port but in actual implementation, there are two store ports: one for load and another one for store.
7 Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L2 Cache, etc…) are shared between cores.