A good strategy is to start by looking carefully at the code for the inner loops, identifying performance-reducing attributes such as excessive memory references and poor use of registers.

we can often determine the time (or at least a lower bound on the time) required to execute a loop by identifying critical paths, chains of data dependencies that form during repeated executions of a loop.

The case where two pointers may designate the same memory location is known as memory aliasing.

x = 1000; y = 3000;

*q = y; /* 3000 */

*p = x; /* 1000 */

t1 = *q; /* 1000 or 3000 */

The value computed for t1 depends on whether or not pointers p and q are aliased—if not, it will equal 3000, but if so it will equal 1000. If a compiler cannot determine whether or not two pointers may be aliased, it must assume that either case is possible, limiting the set of possible optimizations.

The sequencing of activities by a processor is controlled by a clock providing a regular signal of some frequency, usually expressed in gigahertz (GHz), billions of cycles per second. For example, when product literature characterizes a system as a “4 GHz” processor, it means that the processor clock runs at 4.0 × 109 cycles per second. the period of a 4 GHz clock can be expressed as either 0.25 nanoseconds or 250 picoseconds.

cycles per element, abbreviated “CPE,”. Using a least squares fit, we find that the run times (in clock cycles) for psum1 and psum2 can be approximated by the equations 496 + 10.0n and 500 + 6.5n, respectively. These equations indicate an overhead of 496 to 500, cycles due to the timing code and to initiate the procedure, set up the loop, and complete the procedure, plus a linear factor of 6.5 or 10.0 cycles per element. We refer to the coefficients in these terms as the effective number of cycles per element, abbreviated “CPE.” By this measure, psum2, with a CPE of 6.50, is superior to psum1, with a CPE of 10.0.

**Eliminating Loop Inefficiencies**

A general class of optimizations known as code motion. They involve identifying a computation that is performed multiple times (e.g., within a loop), but such that the result of the computation will not change.We can therefore move the computation to an earlier section of the code that does not get evaluated as often.

Lower2 is better than lower1, the strlen is O(n), then the lower1 is O(n2), while move it out to local variable in lower2, the function just new O(n).

**Reducing Procedure Calls**

since original the code do the boarder check every time when call get_vec_element, we change it to the get_vec_start to return the whole array. There is no more calls appear.

The resulting improvement is surprisingly modest, only improving the performance for integer sum. For applications in which performance is a significant issue, one must often compromise modularity and abstraction for speed. It is wise to include documentation on the transformations applied, as well as the assumptions that led to them, in case the code needs to be modified later.

**Eliminating Unneeded Memory References**

when analyze assembly code, the data is read and write, then read the value again, we can use a local variable acc to do the work, then send the results to dest once. Compared to the loop in combine3, we have reduced the memory operations per iteration from two reads and one write to just a single read.

The best way to express a performance improvement is as a ratio of the form Told/Tnew, where Told is the time required for the original version and Tnew is the time required by the modified version.

The latency bound is encountered when a series of operations must be performed in strict sequence, because the result of one operation is required before the next one can begin.

The throughput bound characterizes the raw computing capacity of the processor’s functional units.

superscalar, which means it can perform multiple operations on every clock cycle, and out-of-order, meaning that the order in which instructions execute need not correspond to their ordering in the machine-level program.

speculative execution, the processor begins fetching and decoding instructions at where it predicts the branch will go, and even begins executing these operations before it has been determined whether or not the branch prediction was correct. If it later determines that the branch was predicted incorrectly, it resets the state to that at the branch point and begins fetching and executing instructions in the other direction.

latency, meaning the total time required to perform the operation, and the issue time, meaning the minimum number of clock cycles between two successive operations of the same type. Functional units with issue times of 1 cycle are said to be fully pipelined: they can start a new operation every clock cycle.

For a code segment forming a loop, we can classify the registers that are accessed into four categories:

Read-only: These are used as source values, either as data or to compute memory addresses, but they are not modified within the loop. The readonly registers for the loop combine4 are %rax and %rcx.

Write-only: These are used as the destinations of data-movement operations. There are no such registers in this loop.

Local: These are updated and used within the loop, but there is no dependency from one iteration to another. The condition code registers are examples for this loop: they are updated by the cmp operation and used by the jl operation, but this dependency is contained within individual iterations. Loop: These are both used as source values and as destinations for the loop, with the value generated in one iteration being used in another. We can see that %rdx and %xmm0 are loop registers for combine4, corresponding to program values i and acc.

by simply replicating the template shown on the right-hand side of Figure 5.14 n times. We can see that the program has two chains of data dependencies, corresponding to the updating of program values acc and i with operations mul and add, respectively. single-precision multiplication has a latency of 4 cycles, while integer addition has latency 1, we can see that the chain on the left will form a critical path, requiring 4n cycles to execute.

**Loop Unrolling**: we make sure the first loop does not overrun the array bounds. For a vector of length n, we set the loop limit to be n − 1. the maximum array index i + 1 will satisfy i + 1< (n − 1) + 1= n. any factor k. To do so, we set the upper limit to be n − k + 1, and within the loop apply the combining operation to elements i through i + k − 1.

floating-point multiplication and addition are not associative. Thus, combine5 and combine6 could produce different results due to rounding or overflow. For example, odd number are very large while even number are very small, near 0.0 .

In combine5, the combining is performed by the statement

acc = (acc OP data[i]) OP data[i+1];

while in combine7 it is performed by the statement

acc = acc OP (data[i] OP data[i+1]);

differing only in how two parentheses are placed.We call this a reassociation transformation, because the parentheses shift the order in which the vector elements

are combined with the accumulated value acc.

Some **Limiting Factors**：small number of registers to hold the values being accumulated. If we have a degree of parallelism p that exceeds the number of available registers, then the compiler will resort to **spilling**, storing some of the temporary values on the stack. Once this happens, the performance can drop significantly.

write code suitable for implementing conditional moves: The basic idea for translating into conditional moves is to compute the values along both branches of a conditional expression or statement, and then use conditional moves to select the desired value.

Since the load unit can only initiate one load operation every clock cycle, the CPE cannot be less than 1.00. For applications where we must load k values for every element computed, we can never achieve a CPE lower than k.

**Basic strategies for optimizing program performance:**

1. High-level design. Choose appropriate algorithms and data structures for the problem at hand. Be especially vigilant to avoid algorithms or coding techniques that yield asymptotically poor performance.

2. Basic coding principles. Avoid optimization blockers so that a compiler can generate efficient code.

– Eliminate excessive function calls. Move computations out of loops when possible. Consider selective compromises of program modularity to gain greater efficiency.

– Eliminate unnecessary memory references. Introduce temporary variables to hold intermediate results. Store a result in an array or global variable only when the final value has been computed.

3. Low-level optimizations.

– Unroll loops to reduce overhead and to enable further optimizations.

– Find ways to increase instruction-level parallelism by techniques such as multiple accumulators and reassociation.

– Rewrite conditional operations in a functional style to enable compilation via conditional data transfers.

**Amdahl’s law**: (basic idea)when we speed up one part of a system, the effect on the overall system performance depends on both how significant this part was and how much it sped up.