Central points:
* => Will not be covered in exams.
Chap 5: The Processor: Data path and control
5.1 Introduction
- We will be designing an implementation of a processor that supports the core MIPS instruction set (lw, sw, add, beq, j)
- We will rely on the use of combinational and sequential logic elements in constructing our implementations
- We will assume an edge triggered clocking methodology
5.2 Building a Datapath
- The basic components of our design include instruction and data memory, a program counter, adders, and ALU and the register file.
- We can look each instruction individually and see which functional units it uses and the path of data flow between them
5.3 A simple implementation scheme
- Using multiplexors, we can tie together all of the required datapaths for all of the instructions
- The value of the control signals is based only on the instruction being executed and thus can be implemented using combinational logic
- the cycle time of the single cycle implementation is determined by the worst case path delay for all the instructions
- A machine with more powerful instructions would result in an imbalance in the amount of time required and thus the single cycle implementation is not very practical
5.4 A multicycle implementation
- If we break each instruction into a series of steps and perform one step every cycle we can create a multicycle implementation
- In doing so, we can reuse functional units such as memory and the ALU
- Registers are needed to store values between cycles
- A finite state machine must be used for control because the values of the control signals now depend upon which step we are performing
Check whether you know all the key terms (see key terms at the end of chapter)
============================================================================
Chap 6: Pipelining
6.1: An overview of pipelining
- Pipelining improves performance by increasing instruction throughput as opposed to decreasing the execution time of an individual instruction
- Instruction sets can be designed for ease of pipeline construction
- Theoretically the speedup obtained by pipelining is equal to the number of stages in the pipeline
- Complications arise from pipeline hazards (structural, control, ,data) which often have clever solutions (forwarding, prediction)
6.2 A pipelined data path
- The division of an instruction into five stages leads to a five stage pipeline for our MIPS instruction set
- Pipeline registers are introduced to separate the stages
- Pipelines can be graphically represented so as to help aid in the understanding of how instructions are executed
6.3 Pipelined control
- To specify control for the pipeline we need only set the control values during each pipeline stage
- We can generate the control values and pass them down the pipeline along with the data
6.4 Data Hazards and Forwarding
- Instructions with data dependencies will not execute properly unless the data hazard is resolved
- Data hazards can be resolved efficiently by detecting the dependency and forwarding data directly from the appropriate pipeline registers to the ALU
6.5 Data hazards and stalls
- Forwarding cannot solve the problem when an instruction tries to read a register following a load instruction
- A hazard detection unit capable of stalling the pipeline is needed to properly handle this hazard
6.6 Branch Hazards
- When a branch instruction completes, there may already be instructions in the pipeline. This is a branch hazard.
- We can initially assume branch not taken and flush the instructions if we are wrong
- Other techniques such as dynamic branch prediction may help resolve branch hazards for more complex processors
6.7 Exceptions
- Exceptions may occur when there are already instructions in the pipeline
- We can flush the instructions and handle the exception in a manner similar to that used for branch hazards
- This is a difficult problem for more complex architectures an sometimes machines handle exceptions imprecisely to improve performance
6.8 Superscalar and Dynamic pipelining
- To make processors faster, we can try to start off more than one instruction every clock cycle (superscalar)
- Code can be scheduled to improve pipelined performance
- Dynamic pipeline scheduling performs these operations in hardware and may use branch prediction to perform speculative execution
6.9 Real stuff: PowerPC 604 and Pentium Pro Pipelines*
- Modern processors use both superscalar and dynamic scheduling techniques to achieve high performance at the cost of very complex pipeline organizations
6.10 Fallacies and pitfalls:
- Fallacy: Pipelining is easy
- Fallacy: Pipelining ideas can be implemented independent of technology
- Pitfall: Failure to consider instruction set design can adversely impact pipelining
- Fallacy: Increasing the depth of the pipelining always increases performance
Other tips:
- Without pipelining, much of the processor is idle (wasted) most of the time. Pipelining exploits this parallelism.
- Instruction set design needs to be kept simple to allow pipelining
- Pipeline stages are limited by the slowest resource
- Depending upon which design you start from (single cycle or multicycle), pipelining reduces clock cycle time or the average CPI
- Larger pipelines exacerbate problem of control hazards, by raising cost of misprediction; makes it harder for compilers to fill delayed-branch slots;
- Compiler is critical in improving performance through automatic code optimization for pipelining
- MIPS = "Microprocessor without Interlocked Pipelined Stages"
- VLIW: "Very large instruction word" design is similar to superscalar, but it requires compilers to guarantee that there are no dependencies between instructions that issue at the same time and that sufficient hardware is available. This simplifies hardware, but cannot be binary compatible with older programs !
Architecture implications of pipelining:
- Keep architecture simple.
- Eg: Constant length instructions => fetch requires fixed time
- Fewer & similar formats => allows register fetch during decode
- Simpler addressing modes (Memory operands only in lw and sw) => execute stage can be uniformly used to calculate memory addresses.
- Operands aligned in memory => data fetch takes constant amount of time.
- MIPS writes a single result for each instruction, and this is done at the end of the instruction => forwarding of results is easier. Harder if there are multiple results to forward, or if write needed before end of instruction
Check whether you know all the key terms (see key terms at the end of chapter)
============================================================================