Speeding Up Computation: The Role of Pipelining in Modern CPUs
The Stages of Pipelining: A Deep Dive into Instruction Flow
In traditional (non-pipelined) processors, instructions are executed sequentially, with each instruction being fully completed before the next one begins. So some hardware resources are idle during different phases. For example the hardware required for “Fetch” stage is idle while instruction is being decoded. Similarly, when memory access is happening most of the datepath is idle.
We want more concurrency to get higher instruction throughput.
Pipelining :
It is a concept in microarchitecture in which multiple instrcutuions are executed. It divides instructions into different stages and ensures that enough hardware resources are available to process one instrcution at each stage.
Instruction consecutive in program order are processed in consecutive stages.
This increases instruction processing throughput(1/CPI)
Slowest step determine throughput
Some issues with pipelining occurs when 1 step takes more time and other takes less. So we give more time to process the instruction that takes less so that pipeline is not stalled.
Consider an assembly line in a factory where different workers perform different tasks to assemble a product. In a similar way, each stage of the pipeline handles a part of the instruction, and multiple instructions move through the pipeline like products on an assembly line.
For example:
While instruction 1 is being executed, instruction 2 can be decoded, and instruction 3 can be fetched.
As a result, the processor can complete one instruction per clock cycle after the pipeline is full, greatly improving performance.
Types of Pipelines
There are different types of pipelines used in processors, including:
Instruction Pipeline: Focuses on improving the execution of instructions.
Arithmetic Pipeline: Used for executing arithmetic operations in parallel, particularly in floating-point calculations.
Superpipeline: Increases the number of pipeline stages, allowing for even more overlap between instruction stages.
Superscalar Pipeline: Allows multiple instructions to be issued in a single clock cycle, further increasing throughput.
Pipeline Hazards
Pipelining is effective, but everything comes with its problems. Pipelining increases throughput but can increase latency. It also causes following hazards
Data Dependency: Occur when instructions depend on the results of previous instructions. For example, if instruction 2 needs the result of instruction 1, but instruction 1 hasn't completed yet, instruction 2 must wait, causing a stall in the pipeline.
Control Hazards: Arise from branch instructions (e.g., if-else statements). If the processor predicts the branch incorrectly, it may fetch the wrong instructions, leading to wasted cycles and the need to discard those instructions.
Structural Hazards: Occur when hardware resources (e.g., memory or execution units) are not available for an instruction. This can happen when multiple instructions need to use the same resource simultaneously.
Data dependency:
Flow dependency: Occurs when an instruction depends on the result of a previous instruction. A violation of a true dependency leads to a read-after-write (RAW) hazard.
Output dependency: Write after write hazard
Anti-dependence: Write after read
Out of these only flow dependency is called True Dependency .Other 2 are easier to handle
Methods to remove flow dependency:
Flow dependencies are hard to eleminate. There are 5 fundamental ways to handle flow dependency
Detect and wait until value is available in register file.
Detect and forward/bypass data to dependent instruction.
Detect and Eleminate the dependence at software (compiler)level.
Predict the needed values ,execute speculatively and verify (We will cover this in detail in Branch Prediction)
Do something else No need to detect anything(We will cover this in fine grained multithreading)
Interlocking: It is the detection of dependance between instructions in a pipelined processer to guarantee correct execution.
It can be software or hardware interlocking
For higher efficiency we do it in hardware
Approach for interlocking:
Each register file has a valid bit associated with it and instruction that is writing to it resets the valid bit
An instruction in decode stage checks if all of its sources and destination registers are valid
If YES , there is no need to stall the pipeline
If NO, stall the instructions
Another approach is to use special logic that checks if the instruction in later satges is supposed to write to any source registers of instrcution that is being decoded
YES? Stall the pipeling
NO? No need to stall
Connect with Me:
GitHub: ranaumarnadeem/HDL
Medium: @ranaumarnadeem
Substack: We Talk Chips
LinkedIn: Rana Umar Nadeem
Tags: #DigitalLogic #CombinationalLogic #Decoders #Verilog #HDL #DigitalDesign #FPGA #ComputerEngineering #TechLearning #Electronics #ASIC #RTL #Intel #AMD #Nvidia #pipelining #interlocking #hazards #data_dependency #branchprediction #fine_grained_multi_threading #computer_organization