The S/390 pipeline has three main phases: "instruction fetch", "I-unit" processing and "E-unit" processing. Each of these phases is further broken down into smaller steps.
Instruction fetch phase
- AA/AI cycle
- IF1 cycle
- IF2 cycle (concurrent to I-unit IBR cycle)
- IBR cycle
- DEC cycle
- AA cycle
- OF1 cycle
- OF2 cycle (concurrent to E-unit OBR cycle)
- OBR cycle
- EX cycle
- PA cycle
- EG cycle
- CK cycle
- CMP cycle
The instruction fetch phase begins with an AA/AI cycle that involves adding to or incrementing the address to be fetched (based on the outcome of the previous instruction), then initiating the fetch request. The instruction does not get fetched directly to he CPU; instead, it is fetched into a dedicated instruction cache. The AA/AI cycle is followed by the IF1 cycle, in which the fetched instruction is retrieved from the instruction cache. The IF1 cycle is followed by the IF2 cycle, in which additional instruction data are retrieved from the instruction cache. Note that the instruction fetch phase can occur far in advance of the I-unit phase (to optimize the pipeline).
The I-unit phase begins with acquiring the instruction into the I-register (the IBR cycle). Next, the instruction is decoded (DEC cycle). During decoding, operand address modifiers are also retrieved. Next, the AA (address add) cycle adds the operand addresses (and modifiers). Finally, operands are fetched during the OF1 cycle, and buffered for the E-Unit phase in the OF2 cycle.
The execution (E-unit) phase begins with sending a prepared instruction from the I-Unit queue to the E-unit. This cycle is called the OBR cycle. The EX cycle performs actual execution of the instruction and setting of condition codes. Note that the S/390 has dual I and E-units which operate in parallel to ensure the high reliability. After execution, the results of the operation are passed to other areas of the processor that need it for subsequent processing. This is called the PA cycle. One place the result is sent is a system register that is not programmer accessible (the R-unit). Next, an error correction code (ECC) is generated from the value in the R-unit (i.e. the output of the instruction) during the EG cycle. The check cycle (CK) then performs a parity check between the computed ECC value and each of the outputs of the E-units to determine if the outputs match. The final cycle, the CMP cycle, takes the results of the prior instruction and marks them as eligible for forwarding to the L2 cache. This cycle does not happen if a hardware exception or failure occurs during any of the previous cycles.
A visual depiction of the complete pipeline process (in sequence) is shown below. Note that every cycle of the pipeline is the same length and that the same system clock signals all cycles.
On branch prediction
The S/390, like many other processors, was required to support a legacy instruction set. This condition places some constraints on pipeline design. For example, the S/390 does not perform multiple branch target execution. Instead, it relies heavily on correct branch prediction and calculation of branch targets.
The reason behind this design approach is caused by the scarcity of condition code registers. The S/390 has a two-bit condition code register (CC) that is set by many different opcodes, but all conditional branches test the status of the condition code flags. In other words, in practical use, it is difficult to preserve the state of the condition code flags because they are a highly contested resource. Many designs begin loading branch targets early in the stages of a conditional branch instruction. However, the S/390 does not do this because it requires longer preservation of the condition code flags. Instead, the S/390 loads conditional branch targets into the pipeline almost immediately after it begins loading the instruction which will set the condition codes. This saves design effort by eliminating the need to preserve the state of the condition code flags for a significant period of time. Since branches targets are not pre-fetched, this increases the cost of branch misprediction. In order to mitigate this problem, the S/390 has powerful facilities not only for branch prediction, but also for the calculation of the branch target. This lowers the frequency of missed predictions and pipeline flushes.
In the control unit, the key components are the R-unit (an internal register), dual I- and E-units (for instruction decode and execution), and the buffer control element (BCE).
The R-unit makes available an 8-bit register address space, including nearly all control registers, access registers, floating point registers and buffered instruction address registers. In total, there are 128 32-bit and 128 64-bit registers in the R-unit. The R-unit is interconnected using a quadword (128 bit) bus.
The line size between the BCE and all other internal control unit components is 128 bits wide. The BCE provides access to other external storage or I/O devices (memory, disk, etc.) through a 64 kilobyte cache; the BCE is externally connected via a 128 byte bi-directional bus. All memory caches and main storage units interoperate via this 128 byte bus.
The I/E-units receive input from both the BCE and R-unit, but outputs from the I/E-units are directed to the comparison unit (for ECC checking), and subsequently onto a central bus that can update many registers in various components of the control unit. This is because many elements of the control unit rely on the output of the I/E unit (for effective pipelining, etc.) and a central bus to carry I/E-unit output is the most efficient approach.
The 128-byte external bus also integrates with IBM's integrated cluster bus (ICB) technology, allowing 128-byte data exchange (http://www.research.ibm.com/journal/rd/435/rao.html) with externally interfaced systems. Alternatively, up to 32 S/390 machines can be interconnected using ICB technology, sharing a high-speed 128 byte wide interconnection link.