Intel Go Parallel
Intel
Getting Started Concurrent Programming Community And Opinion Tools and Tips Advanced Concepts Go Parallel RSS Feed
 Print Print
How Can Theory of Constraints Help in Software Optimization? (cont'd)

The Five Focusing Steps of TOC
In this theory, five steps are defined. Before these steps are applied, a goal must be articulated. In our case, this is improving the performance of our application or meeting a performance requirement. Having articulated the goal[5], the steps can be given as:

1. Identify the constraint (the root-cause that prevents achievement of the goal)
2. Decide how to exploit the constraint (make sure the constraint identified is doing what it is designed to do. i.e: execution unit is executing but not stalling)
3. Subordinate and synchronize everything else to the above decisions
4. Elevate the performance of the constraint (increase capacity of the constraint. i.e: taking advantage of multiple cores, parallelism)
5. If in any of the above steps the constraint has shifted, go back to Step 1

The most critical piece in these steps is identifying the constraints. The questions to ask are:

  • “What to change?”,
  • “To What to Change?” and
  • “How to Change?”[1,6].
The TOC defines processes called thinking processes to answer these questions and provides a problem-solving framework based on cause and effect. The basic idea while trying to reach our goal (desired performance) is to identify the hotspots, understand the underlying cause and effect relation and eliminate the causes. If we take the four major stall categories given earlier as an example, we can say that most of the time execution stalls are the cause of performance problems (effect) and resource related stalls. While sub-optimal algorithms are the cause of execution stalls (effect).

In the rest of the article we’ll show how one can identify the constraints preventing optimal application performance by using the TOC methodology as a guideline for software optimization on Core™ architecture.

Core™ Architecture Cycle Analysis in Details
Figure 2 which is from the Intel® 64 and IA-32 Optimization Manual can give an idea of how to approach a cycle decomposition and performance analysis exercise. Figure 3 is one interpretation of Figure 2 for Core™ architecture with some of the microarchitectural event names given. This is clearly not the complete breakdown of all the events, but nonetheless a good starting point for further analysis [4, 7].

The idea behind Figure 3 is to identify cycles with the help of Intel® VTune™ Performance Analyzer where 1) no μops are dispatched for execution and 2) cycles which are executed but are not retired (due to speculative nature of the processor). The non-retired cycles are basically non-productive cycles and cause ineffective usage of the execution unit.

Cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event. Therefore the equation given earlier in Formula 1 can be re-written as given in Formula 2. The ratio of RS_UOPS_DISPATCHED.CYCLES_NONE to CPU_CLK_UNHALTED.CORE will tell you the percentage of cycles wasted due to stalls. These very stalls can turn the execution unit of a processor into a major bottleneck. The execution unit by definition is always the bottleneck because it defines the throughput and an application will perform as fast as its bottleneck. Therefore it is extremely critical to identify the causes for the stall cycles and remove them if possible.

CPU_CLK_UNHALTED.CORE ~ RS_UOPS_DISPATCHED.CYCLES_ANY + RS_UOPS_DISPATCHED.CYCLES_NONE
Formula 2

Our goal is to determine how we can minimize the causes for the stalls and let the “bottleneck” (i.e: execution unit due to stalls) do to what it is designed to do. In sum, the execution unit should not sit idle and wait for whatever reason.

There are many contributing factors to the stall cycles and sub-optimal usage of the execution unit. Memory accesses (e.g: cache misses), Branch mis-predictions (pipeline flushes as a result), Floating-point (FP) operations (ops) (e.g: long latency operations such as division, fp control word change etc) and μops not retiring due to the out of order (OOO) engine can be given as some of them.

Some of the key events that are used in breaking down the stalls cycles in Figure 3 are given below. VTune™ analyzer can help to sample these events not only on your application but also on the entire system.

  • MEM_LOAD_RETIRED.L2_LINE_MISS, MEM_LOAD_RETIRED.L1_LINE_MISS: Number of retired load operations that missed the L2/L1 (respectively) cache. When the event count is multiplied with penalty in cycles, one can estimate the impact on the stalls cycles. However, this approach oversimplifies the fact that the OOO engine can handle multiple outstanding load misses8.
  • DTLB_MISSES.ANY: Number of Data Table Lookaside Buffer (DTLB) misses. The count includes misses detected as a result of speculative accesses. Typically a high count for this event indicates that the code accesses a large number of data pages. The cost of a DTLB lookup miss is about 10 cycles. The event MEM_LOAD_RETIRED.DTLB_MISS measures the number of load micro-ops that experienced a DTLB miss.
  • RESOURCE_STALLS.BR_MISS_CLEAR: Number of cycles after a branch misprediction is detected at execution until the branch and all older micro-ops retire. During this time new micro-ops cannot enter the out-of-order pipeline.
    • BR_INST_RETIRED.MISPRED and its ratio to BR_INST_RETIRED.ANY can also give idea on how many of the overall branch instructions were mispredicted.
  • RAT_STALLS.ANY: Number of stall cycles due to various reasons, the breakdown of the stalls counted by this event can also be counted separately, please see VTune™ analyzer help).
  • Type conversions and long latency operation will also impact the performance of the execution unit. For example:
    • Division which is a long latency operation and its impact on the execution unit can be measured with DIV (counts the number of divide ops) and IDLE_DURING_DIV (counts the number of cycles the divider is busy and no other execution unit or load operation is in progress) respectively.
    • RESOURCE_STALLS.FPCW: counts the number of cycles while execution was stalled due to writing the floating-point unit (FPU) control word.
  • DELAYED_BYPASS.[FP/LOAD/SIMD]: Number of times floating point/load/SIMD operations use data immediately after the data was generated by a non-floating point/non-SIMD unit. Such cases result in one penalty cycle due to data bypass between the units. Normally small penalties such as this one are hidden by out of order (OOO) execution.
  • ILD_STALLS: Number of Instruction Length Decoder stall cycles due to a length changing prefix (LCP). The event ILD_STALLS measures the number of times the slow decoder was triggered, the cost of each instance is 6 cycles. This event is front-end related.

Figure 2: Performance Events Drill-Down and Software Tuning Feedback Loop [3]

Even though OOO engine takes care of small stall penalties (usually anything less 10 cycles), it can be a good exercise to identify the locations of these events to find a correlation with the un-dispatched cycles.

As mentioned above, non-productive cycles (non-retired instruction) utilize the execution unit unnecessarily; thus, identifying and eliminating those instructions is crucial. Although there isn’t a direct event to measure the cycles associated with non-retiring μops, the following formulas (Formula 2 and 3) can be used to estimate non-retired cycles by using already performance events.

RS_UOPS_DISPATCHED - (UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED – UOPS_RETIRED.MACRO_FUSION)
Formula 3

RS_UOPS_DISPATCHED - (UOPS_RETIRED.ANY + UOPS_RETIRED. LD_IND_BR + UOPS_RETIRED.STD_STA)
Formula 4


Figure 3: One interpretation of Figure 2 with some of the microarchitectural event names.


8 More accurate estimation can be given as:
L2 Data Load Miss Retired = (MEM_LOAD_RETIRED.L2_LINE_MISS * Stalls per Bus Request) / CPU_CLK_UNHALTED.CORE
Average Read Latency (full cache line) = (BUS_REQUEST_OUTSTANDING.SELF / (BUS_TRANS_BRD.SELF - BUS_TRANS_IFETCH.SELF)
Request Queue Depth = BUS_REQUEST_OUTSTANDING.SELF / (TSC – BUSQ_EMTPY.SELF)
Stalls per Bus Request = Average Read Latency / Request Queue Depth [4]


Previous Page: Introduction to Software Optimization Next Page: Applying TOC: Cause and Effect Relation and Focusing Steps
Page 1: Introduction to Software OptimizationPage 3: Applying TOC: Cause and Effect Relation and Focusing Steps
Page 2: The Five Focusing Steps of TOC 
Submit article to:
Ever wonder why we don't hear more from threading practitioners about how they managed to grok concurrency? Perhaps it's because they're too busy enjoying the performance increases. They won't say it's easy, but the Vegas Pro developers at Sony Creative Software are understandably proud of their growing expertise in threading and OpenMP. »
While threading can be a challenge, new software development tools help simplify the process by identifying thread correctness issues and performance opportunities. We present a methodology that has been used to successfully thread many applications and discuss tools that can assist in developing multi-threaded applications. »
This paper describes the performance analysis phase of the threading methodology we presented in our previous paper, "Best Practices for Developing and Optimizing Threaded Applications." »
Understanding Dual Processors, Hyper-Threading Technology, and Multi-Core Systems
Multi-Threading in a Java Environment
» More Personalized Content
Getting Started (98)
Concurrent Programming (114)
Community and Opinion (52)
Tools and Tips (90)
Advanced Concepts (62)
What concurrency info do you need right now?
(Choose your top answer.)
An introduction
Threading basics
Advanced parallelism concepts
Optimization tools and techniques

View Results
Past Votes