I would like to go more into detail about each stage of the pipeline, what they do and how they work. I know I said in the previous installment that I would not go into the details of hazards, but it looks like I will need it to document my choices in the processor architecture. So let's start with that.
Inside our pipeline, the instructions interact with each other.
- An instruction may need a resource currently used by another instruction.
→ structural hazard
- An instruction may depend on the result value of a previous instruction.
→ data hazard
- An instruction may depend on the calculation of a new address for the
Program Counter (jumps and branches),
→ control hazard
The most prominent structural hazard in our implementation is the access to the memory. To resolve it, whenever an instruction wants to load or store data into memory during the Mem stage, we have to tell the Fetch stage that it can not get the next instruction from memory during this cycle and has to stall.
This type of hazard could also be solved by using dual-ported RAM, so that both Mem and Fetch stages have concurrent access to the memory, but the price for several megabytes of dual-ported RAM is simply prohibitive.
Another way to resolve this hazard, often used in microcontrollers, is to have
different, dedicated memory spaces: one for the instructions and another one
for the data. This is perfect fine when the program running is well defined
beforehands, but not really appropriate in the case of a general purpose
processor that needs to load the programs into memory on demand.
Interestingly enough, modern processors do actually use a similar approach, by using distinct caches for instructions and data.
Data hazards occur when an instruction needs as one of its input the result of a previous instruction that has not gone through the Writeback stage yet.
addi r4, r0, 30 addi r5, r4, -12 addi r6, r4, 25
When the second instruction enters the Decode stage, where it will also
read the values of the source registers, the value of the register
r4 has not
been updated yet inside the register file. Two solutions exist to resolve this
type of hazard:
- Wait (stall) until the register file has been updated, or
- Bypass the register file: if the result is available from one of the pipeline registers, get it directly from there.
In our example above, as soon as the first instruction has gone through the
Execute stage, the resulting value for
r4 is available in the next
pipeline register. One clock cycle later it is present in the Writeback
register, then after half a clock (more on this later) it gets written to
the register file. If we were to stall the pipeline, that would induce a
penalty of two clock cycles.
Similarly, the third instruction also depends on the result of the first one.
If we were to wait for the correct value to be present in the register file for
r4, only one clock cycle would be needed, though, because the other one was
spent executing the middle instruction. But just as we did for the second
instruction, we could use a bypass and forward the correct value directly from
the Memory stage.
For any instruction after that one, the value for
r4 would be present in the
register file and there would be no hazard.
But does that mean all data hazards are covered with these two bypasses ? Now, consider the following example:
lb r7, 10(r8) addi r8, r7, 42 xori r9, r7, -1
In this situation, the data for r7 will be available in one of the pipeline
registers only after the first instruction has completed the Memory stage.
When we have a register dependency on a load instruction, we have no choice
but stall the next instruction (
addi in this case) for one clock cycle.
After that, the correct value for
r7 will be present in the Writeback
register, from which it can be forwarded.
The third instruction does not need to be stalled, and can make use of the bypass from the Writeback register.
Control hazards happen for jump and branch instructions because in either case, we need the result of the Execute stage in order to compute the new value of the program counter.
For conditional branches, the ALU has to do the comparison to know wether the branch is taken or not. It means that in all situations, the target of a branch instruction will only be known after the instruction has gone through the Execute stage.
For jump (
jal) and jump register (
jalr), the ALU computes either the sum of
the PC and the offset or the sum of the PC and the register, respectively.
Note: if the cycle time permits, and the hardware for an adder is fast enough, we could consider the use of a dedicated adder inside the Decode stage of the pipeline. Remember however that our register file is updated during the first half of our clock cycle, which means the register values are guaranteed to be correct only during the second half of the cycle. For now, let's consider that we don't have the time to do it, and that the jump target is indeed only known at the end of the Execute stage.
This means that in both jumps and taken branches, the two instructions that were feeded into the pipeline now need to be killed, they do not correspond to the correct codepath anymore.
I don't know yet what topic will be covered next time. Actual components selection, building blocks, details of some pipeline stages, ...