Lesson 10 of 48 intermediate

BIU and EU: The Dual-Engine Design

How the 8086 fetches and executes at the same time — the birth of instruction pipelining

Open interactive version (quiz + challenge)

Real-world analogy

Imagine a restaurant with two workers: a waiter (BIU) and a chef (EU). The waiter goes to the market to fetch ingredients and places them in a prep queue (the instruction queue). Meanwhile, the chef grabs ingredients from the queue and cooks dishes. The waiter does not wait for the chef to finish — he keeps fetching ahead. The chef does not wait for the waiter — he grabs whatever is queued. This overlap means meals (instructions) come out faster than if one person did both jobs sequentially.

What is it?

The 8086 processor is internally split into two concurrent units: the Bus Interface Unit (BIU) and the Execution Unit (EU). The BIU handles all communication with external memory and I/O — it fetches instruction bytes into a 6-byte FIFO queue and services data read/write requests. The EU pulls instructions from this queue, decodes them, and executes them using the ALU and registers. By working in parallel, the BIU pre-fetches while the EU executes, creating a primitive instruction pipeline that significantly improves throughput compared to a single fetch-then-execute cycle.

Real-world relevance

The BIU/EU pipeline concept directly evolved into modern CPU pipelines. Your laptop's CPU pipeline has stages for fetch, decode, rename, schedule, execute, and retire — all descendants of the 8086's original two-stage idea. In embedded systems, engineers who program 8086-compatible chips must understand pipeline stalls to write efficient code, especially in real-time applications like motor controllers and sensor systems where every microsecond counts.

Key points

Two Independent Units — The 8086 is internally divided into two functional units that operate concurrently: the Bus Interface Unit (BIU) handles all external bus communication — fetching instructions and reading/writing data. The Execution Unit (EU) decodes and executes instructions using the ALU and registers. They work in parallel, connected by a 6-byte instruction queue.
Bus Interface Unit (BIU) — The BIU manages the 20-bit address bus and 16-bit data bus. It generates physical addresses by combining segment and offset registers, fetches instruction bytes from memory, and handles data read/write requests from the EU. It contains the Instruction Pointer (IP), segment registers (CS, DS, SS, ES), and the instruction queue.
The 6-Byte Instruction Queue — The BIU pre-fetches instruction bytes into a 6-byte FIFO (First-In, First-Out) queue. While the EU is busy executing the current instruction, the BIU fills the queue with upcoming bytes. When the EU needs the next instruction, it grabs it from the queue instead of waiting for a memory fetch — this is the essence of pipelining.
Execution Unit (EU) — The EU contains the ALU (Arithmetic Logic Unit), the Flag register, general-purpose registers (AX, BX, CX, DX), pointer registers (SP, BP), index registers (SI, DI), and the instruction decoder. It pulls instruction bytes from the queue, decodes them, and executes the operation. It has no direct connection to the external bus.
How BIU and EU Cooperate — The BIU and EU operate asynchronously. The BIU fetches bytes whenever the bus is free. The EU executes whenever the queue has bytes. When the EU needs to read/write memory or I/O (not instruction fetch), it requests the BIU, which pauses pre-fetching to perform the data transfer, then resumes fetching.
Pipeline Stalls and Flushes — When the EU needs data from memory, the BIU must stop pre-fetching and service the data request — this is a pipeline stall. When a JMP or CALL changes the instruction flow, all pre-fetched bytes become invalid and the queue must be flushed and refilled from the new address, wasting the pre-fetched work.
BIU Address Generation (Sigma) — The BIU contains an adder (sometimes called Sigma or the address summer) dedicated to computing 20-bit physical addresses. It shifts the selected segment register left by 4 bits and adds the offset. This dedicated adder operates independently of the main ALU in the EU, allowing address computation and arithmetic to happen simultaneously.
8086 vs 8088 Queue Difference — The 8086 has a 6-byte instruction queue and a 16-bit external data bus (fetches 2 bytes per bus cycle). The 8088 has only a 4-byte queue and an 8-bit external bus (fetches 1 byte per cycle). This makes the 8088 slower at pre-fetching, causing the EU to stall more often waiting for instruction bytes.
Why This Design Matters — The BIU/EU split was the 8086's key innovation and the conceptual ancestor of modern superscalar pipelines. Today's CPUs have 15-30 pipeline stages with multiple execution units. But the core idea — overlap fetching and executing — started right here with the 8086's two-unit design.

Code example

; Demonstrating how BIU pre-fetching affects performance
; Sequential arithmetic — BIU stays ahead of EU
.MODEL SMALL
.STACK 100h
.DATA
  val1 DW 1000
  val2 DW 2000
  result DW ?

.CODE
MAIN PROC
  MOV AX, @DATA
  MOV DS, AX

  ; --- Fast: EU works from queue, BIU fetches ahead ---
  MOV AX, 5        ; EU executes, BIU pre-fetches
  ADD AX, 10       ; likely already in queue
  SUB AX, 3        ; likely already in queue
  MOV BX, AX       ; likely already in queue

  ; --- Slower: Memory access forces BIU to pause fetch ---
  MOV AX, [val1]   ; BIU must do data read (stall fetch)
  ADD AX, [val2]   ; another data read (stall again)
  MOV [result], AX  ; data write (stall again)

  ; --- Slowest: Jump flushes the entire queue ---
  JMP continue      ; queue flushed! 6 bytes wasted
  NOP               ; these are never fetched
  NOP
continue:
  MOV AH, 4Ch
  INT 21h
MAIN ENDP
END MAIN

Line-by-line walkthrough

1. MOV AX, @DATA / MOV DS, AX — standard data segment setup. The BIU fetches these instruction bytes while the EU is idle at startup
2. MOV AX, 5 / ADD AX, 10 / SUB AX, 3 / MOV BX, AX — register-only operations. The EU executes each from the queue while the BIU freely pre-fetches ahead. No bus conflicts because these instructions only use internal registers
3. MOV AX, [val1] — the EU asks the BIU to read val1 from memory. The BIU pauses pre-fetching, performs the data read, delivers the value to the EU, then resumes fetching
4. ADD AX, [val2] — another memory read. The BIU again pauses pre-fetching to service this request. If the queue ran low during the previous stall, the EU may have to wait
5. MOV [result], AX — a memory write. The BIU handles the write cycle, again pausing instruction pre-fetch
6. JMP continue — the EU decodes a jump. The instruction queue is completely flushed because all pre-fetched bytes are for the wrong address. The BIU must start fresh from the 'continue' label
7. NOP / NOP — these instructions were sequentially next in memory and may have been in the queue, but the JMP discarded them. They are never executed
8. MOV AH, 4Ch / INT 21h — program termination via DOS. The BIU fetches the INT 21h vector from the interrupt table

Spot the bug

; Programmer expects this to be fast
; because all values fit in registers
MOV AX, [array]     ; read from memory
MOV BX, [array+2]   ; read from memory
ADD AX, BX          ; register add
MOV [result], AX     ; write to memory
JMP done             ; jump forward
done:
  MOV AH, 4Ch
  INT 21h

Need a hint?

Count how many times the BIU must pause pre-fetching. Is this really a 'register-fast' routine?

Show answer

Despite the programmer's comment, this code is NOT fast. It causes 3 BIU stalls (two memory reads and one memory write) plus a queue flush from JMP. Only the ADD AX, BX truly benefits from the pipeline. To make it faster, load all needed values into registers at the start, do all computation, then store results — minimizing interleaved memory accesses that stall the pipeline.

Explain like I'm 5

Imagine you are building with LEGO bricks. If you had to walk to the shelf, grab a brick, walk back, place it, then walk back for the next brick — it would take forever. Now imagine your friend stands by the shelf and keeps passing you bricks while you build. You never stop building, and your friend never stops grabbing. That is what the BIU and EU do: one keeps fetching instructions while the other keeps executing them.

Fun fact

The 6-byte instruction queue of the 8086 was a carefully chosen size. Most 8086 instructions are 1-6 bytes long, so the queue can typically hold at least one complete instruction ahead. Intel engineers found that a larger queue would not significantly improve performance because memory-accessing instructions (which stall the BIU) occur frequently enough to keep the queue from ever filling far ahead.

Hands-on challenge

Draw a timing diagram showing 6 clock cycles where the BIU and EU operate in parallel. Show a scenario with: (1) three register-only instructions where the queue stays full, (2) a MOV AX, [mem] instruction where the BIU stalls pre-fetching to service the data read, and (3) a JMP that flushes the queue. Label which unit is active in each cycle and what it is doing.

More resources

8086 Internal Architecture — BIU and EU (GeeksforGeeks)
8086 Bus Interface Unit Explained (TutorialsPoint)
8086 BIU and EU Architecture (YouTube)
Intel 8086 Instruction Pipelining (Wikipedia)

Open interactive version (quiz + challenge) ← Back to course: Microprocessor A–Z