I would like to go more into detail about each stage of the pipeline, what they do and how they work. I know I said in the previous installment that I would not go into the details of hazards, but it looks like I will need it to document my choices in the processor architecture. So let's start with that.

Pipeline hazards

Inside our pipeline, the instructions interact with each other.

An instruction may need a resource currently used by another instruction.
→ structural hazard
An instruction may depend on the result value of a previous instruction.
→ data hazard
An instruction may depend on the calculation of a new address for the Program Counter (jumps and branches),
→ control hazard

Structural hazards

The most prominent structural hazard in our implementation is the access to the memory. To resolve it, whenever an instruction wants to load or store data into memory during the Mem stage, we have to tell the Fetch stage that it can not get the next instruction from memory during this cycle and has to stall.

This type of hazard could also be solved by using dual-ported RAM, so that both Mem and Fetch stages have concurrent access to the memory, but the price for several megabytes of dual-ported RAM is simply prohibitive.

Another way to resolve this hazard, often used in microcontrollers, is to have different, dedicated memory spaces: one for the instructions and another one for the data. This is perfect fine when the program running is well defined beforehands, but not really appropriate in the case of a general purpose processor that needs to load the programs into memory on demand.
Interestingly enough, modern processors do actually use a similar approach, by using distinct caches for instructions and data.

Data hazards

Data hazards occur when an instruction needs as one of its input the result of a previous instruction that has not gone through the Writeback stage yet.

  addi r4, r0, 30
  addi r5, r4, -12
  addi r6, r4, 25

When the second instruction enters the Decode stage, where it will also read the values of the source registers, the value of the register r4 has not been updated yet inside the register file. Two solutions exist to resolve this type of hazard:

Wait (stall) until the register file has been updated, or
Bypass the register file: if the result is available from one of the pipeline registers, get it directly from there.

In our example above, as soon as the first instruction has gone through the Execute stage, the resulting value for r4 is available in the next pipeline register. One clock cycle later it is present in the Writeback register, then after half a clock (more on this later) it gets written to the register file. If we were to stall the pipeline, that would induce a penalty of two clock cycles.

Similarly, the third instruction also depends on the result of the first one. If we were to wait for the correct value to be present in the register file for r4, only one clock cycle would be needed, though, because the other one was spent executing the middle instruction. But just as we did for the second instruction, we could use a bypass and forward the correct value directly from the Memory stage.

For any instruction after that one, the value for r4 would be present in the register file and there would be no hazard.

But does that mean all data hazards are covered with these two bypasses ? Now, consider the following example:

  lb   r7, 10(r8)
  addi r8, r7, 42
  xori r9, r7, -1

In this situation, the data for r7 will be available in one of the pipeline registers only after the first instruction has completed the Memory stage. When we have a register dependency on a load instruction, we have no choice but stall the next instruction (addi in this case) for one clock cycle. After that, the correct value for r7 will be present in the Writeback register, from which it can be forwarded.

The third instruction does not need to be stalled, and can make use of the bypass from the Writeback register.

Control hazards

Control hazards happen for jump and branch instructions because in either case, we need the result of the Execute stage in order to compute the new value of the program counter.

For conditional branches, the ALU has to do the comparison to know wether the branch is taken or not. It means that in all situations, the target of a branch instruction will only be known after the instruction has gone through the Execute stage.

For jump (jal) and jump register (jalr), the ALU computes either the sum of the PC and the offset or the sum of the PC and the register, respectively.

Note: if the cycle time permits, and the hardware for an adder is fast enough, we could consider the use of a dedicated adder inside the Decode stage of the pipeline. Remember however that our register file is updated during the first half of our clock cycle, which means the register values are guaranteed to be correct only during the second half of the cycle. For now, let's consider that we don't have the time to do it, and that the jump target is indeed only known at the end of the Execute stage.

This means that in both jumps and taken branches, the two instructions that were feeded into the pipeline now need to be killed, they do not correspond to the correct codepath anymore.

Next time

I don't know yet what topic will be covered next time. Actual components selection, building blocks, details of some pipeline stages, ...

alrj.org

Astorisc: The pipeline stages and hazards

Pipeline hazards

Structural hazards

Data hazards

Control hazards

Next time