An instruction is encoded in binary form as a sequence of 1 or more bytes. The instructions supported by a particular processor and their byte-level encodings are known as its instruction-set architecture (ISA).
By executing different parts of multiple instructions simultaneously, the processor can achieve higher performance than if it executed just one instruction at a time.
defining the different state elements, the set of instructions and their encodings, a set of programming conventions, and the handling of exceptional events.
each instruction in a Y86 program can read and modify some part of the processor state. This is referred to as the programmer-visible state, where the “programmer” in this case is either someone writing programs in assembly code or a compiler generating machine-level code.
Register %esp is used as a stack pointer by the push, pop, call, and return instructions. three single-bit condition codes, ZF, SF, and OF, storing information about the effect of the most recent arithmetic or logical instruction.
Stat, indicating the overall state of program execution. It will indicate either normal operation, or that some sort of exception has occurred, such as when an instruction attempts to read from an invalid memory address.
branch and call destinations are given as absolute addresses, rather than using the PC-relative addressing seen in IA32。
As an example, let us generate the byte encoding of the instruction rmmovl %esp,0x12345(%edx) in hexadecimal. From Figure 4.2, we can see that rmmovl has initial byte 40. We can also see that source register %esp should be encoded in the rA field, and base register %edx should be encoded in the rB field. Using the register numbers in Figure 4.4, we get a register specifier byte of 42. Finally, the displacement is encoded in the 4-byte constant word. We first pad 0x12345 with leading zeros to fill out 4 bytes, giving a byte sequence of 00 01 23 45. We write this in byte-reversed order as 45 23 01 00. Combining these, we get an instruction encoding of 404245230100.
Three major components are required to implement a digital system: combinational logic to compute functions on the bits, memory elements to store bits, and clock signals to
regulate the updating of the memory elements. HCL (for “hardware control language”), the language that we use to describe the control logic of the different processor designs.
In HCL, we will declare any word-level signal as an int, without specifying the word size. This is done for simplicity. In a full-featured hardware description language, every word can be declared to have a specific number of bits.
second selection expression is simply 1, indicating that this case should be selected if no prior one has been. This is the way to specify a default case in HCL. Nearly all case expressions end in this manner.
By using valE as the address for the write operation, we adhere to the Y86 (and IA32) convention that pushl should decrement the stack pointer before writing, even though the actual updating of the stack pointer does not occur until after the memory operation has
completed. popl should first read memory and then increment the stack pointer.
Fetch: Using the program counter register as an address, the instruction memory reads the bytes of an instruction. The PC incrementer computes valP, the incremented program counter. Decode: The register file has two read ports, A and B, via which register values valA and valB are read simultaneously Execute: The execute stage uses the arithmetic/logic (ALU) unit for different purposes according to the instruction type. For integer operations, it performs the specified operation. For other instructions, it serves as an adder to compute an incremented or decremented stack pointer, to compute an effective address, or simply to pass one of its inputs to its outputs by adding zero. The condition code register (CC) holds the three condition-code bits. New values for the condition codes are computed by the ALU. When executing a jump instruction, the branch signal Cnd is computed based on the condition codes and the jump type. Memory: The data memory reads or writes a word of memory when executing a memory instruction. The instruction and data memories access the same
memory locations, but for different purposes.
Write back: The register file has two write ports. Port E is used to write values
computed by the ALU, while port M is used to write values read from
the data memory
Principle: The processor never needs to read back the state updated by an instruction in
order to complete the processing of this instruction.
the use of a clock to control the updating of the state elements, combined with the propagation of values through combinational logic, suffices to control the computations performed for each instruction in our implementation of SEQ. Every time the clock transitions from low to high, the processor begins executing a new instruction.
In contemporary logic design, we measure circuit delays in units of picoseconds
(abbreviated “ps”), or 10−12 seconds.
We express throughput in units of giga-instructions per second (abbreviated
GIPS), or billions of instructions per second. The total time required to perform
a single instruction from beginning to end is known as the latency. In this system,
the latency is 320 ps, the reciprocal of the throughput.
Limitation of Pipeline:Nonuniform Partitioning, the time is decided by the largest time consumed in each stages；Diminishing Returns of Deep Pipelining： each stage computation is very small, but the overhead added by register is become huge.
irmovl $50, %eax
addl %eax , %ebx
outcome of the conditional test determines whether the next instruction to execute will be
the irmovl instruction (line 4) or the halt instruction (line 7).
PC logic shifted from the top, where it was active at the end of the clock cycle, to the bottom, where it is active at the beginning.
This technique of guessing the branch direction and then initiating the fetching of instructions according to our guess is known as branch prediction. It is used in some form by virtually all processors.
When such dependencies have the potential to cause an erroneous computation by the
pipeline, they are called hazards. Like dependencies, hazards can be classified as either data hazards or control hazards.
1. Add stolling(bubble) to keep one instruction in decode stage to wait for the parameters go through the write back stage. The throughput is not good, since at most we may stop for three cycles for a correct instruction.
2. This technique of passing a result value directly from one pipeline stage to an earlier one is commonly known as data forwarding (or simply forwarding, and sometimes bypassing). It allows the instructions of prog2 to proceed through the pipeline without any stalling. Data forwarding requires adding additional data connections and control logic to the basic hardware structure.
One class of data hazards cannot be handled purely by forwarding, because memory reads occur late in the pipeline. Figure 4.53 illustrates an example of a load/use hazard, where one instruction (the mrmovl at address 0x018) reads a value from memory for register %eax while the next instruction (the addl at address 0x01e) needs this value as a source operand. This use of a stall to handle a load/use hazard is called a load interlock.
exceptions triggered by multiple instructions simultaneously. For example, during one cycle of pipeline operation, we could have a halt instruction in the fetch stage, and the data memory could report an out-of-bounds data address for the instruction in the memory stage.We must determine which of these exceptions the processor should report to the operating system. The basic rule is to put priority on the exception triggered by the instruction that is furthest along the pipeline.
pipelined implementation should always give priority to the forwarding source in the earliest pipeline stage, since it holds the latest instruction in the program sequence setting the register.
If the stage processes a total of Ci instructions and Cb bubbles, then the processor has required around Ci + Cb total clock cycles to execute Ci instructions.
CPI = (Ci + Cb)/Ci = 1.0 + Cb/Ci CPI = 1.0 + lp + mp + rp
where lp (for “load penalty”) is the average frequency with which bubbles are injected while stalling for load/use hazards, mp (for “mispredicted branch penalty”) is the average frequency with which bubbles are injected when canceling instructions due to mispredicted branches, and rp (for “return penalty”) is the average frequency with which bubbles are injected while stalling for ret instructions.
Our goal was to design a pipeline that can issue one instruction per cycle, giving a CPI of 1.0. any effort to reduce the CPI further should focus on mispredicted branches.
A typical processor has two first-level caches—one for reading instructions and one for reading and writing data. Another type of cache memory, known as a translation look-aside buffer, or TLB, provides a fast translation from virtual to physical addresses.
In some cases, the memory location being referenced is actually stored in the disk memory.When this occurs, the hardware signals a page fault exception.
From the perspective of the processor, the combination of stalling to handle short-duration cache misses and exception handling to handle long-duration page faults takes care of any unpredictability in memory access times due to the structure of the memory hierarchy
More recent processors support superscalar operation, meaning that they can achieve a CPI less than 1.0 by fetching, decoding, and executing multiple instructions in parallel. the accepted performance measure has shifted from CPI to its reciprocal—the average number of instructions executed per cycle, or IPC. It can exceed 1.0 for superscalar processors.