Lesson 7 of 48 intermediate

Clock, T-States, and Machine Cycles

How time is measured inside a processor — the rhythm that synchronizes every operation

Open interactive version (quiz + challenge)

Real-world analogy

Think of the CPU clock as a metronome in an orchestra. Every musician (register, ALU, bus) must act in sync with the beat. One tick of the metronome is a T-state — the smallest unit of time. Several ticks together form a measure (machine cycle), like reading a note from the sheet music or playing a chord. A complete musical phrase (instruction cycle) might need several measures. The faster the metronome, the faster the orchestra plays — that's clock speed.

What is it?

The CPU clock is a precise square wave signal that synchronizes all internal operations. A T-state is one clock cycle — the smallest unit of CPU time. A machine cycle is a group of 4+ T-states that performs one bus operation (memory read, memory write, I/O read, I/O write). An instruction cycle encompasses all the machine cycles needed to completely execute one instruction. The 8086 at 5 MHz has a T-state of 200 nanoseconds, and instructions range from 2 to 200+ T-states depending on complexity.

Real-world relevance

Clock speed is why you see 'GHz' in CPU specs. A modern 5 GHz processor has T-states of just 0.2 nanoseconds — a thousand times shorter than the 8086's 200 ns. However, clock speed alone does not determine performance — a modern CPU does far more work per T-state through pipelining, superscalar execution, and caching. Understanding T-states and machine cycles is essential for embedded systems where you must guarantee precise timing, such as controlling motors, generating signals, or meeting communication protocol deadlines.

Key points

Code example

; Timing analysis of a complete program on 8086 at 5 MHz
; Each T-state = 200 ns

; --- Program ---
; Instruction         T-states  Time (ns)  Machine Cycles
; -----------------------------------------------------------
MOV AX, 2000h      ;    4       800       opcode fetch only
MOV DS, AX         ;    2       400       internal
MOV CX, 0005h      ;    4       800       opcode fetch only
MOV BX, 0000h      ;    4       800       opcode fetch only

LOOP_START:
  MOV AX, [BX]     ;   13      2600       fetch + mem read
  ADD AX, 0001h    ;    4       800       fetch + internal
  MOV [BX], AX     ;   13      2600       fetch + mem write
  ADD BX, 0002h    ;    4       800       fetch + internal
  DEC CX           ;    2       400       internal
  JNZ LOOP_START   ;   16      3200       fetch (taken=16, not=4)
                    ;    4       800       (final iteration, not taken)

; Loop body per iteration: 13+4+13+4+2+16 = 52 T-states
; Loop body total: 4 iterations * 52 + last * 40 = 248 T-states
; Setup: 4+2+4+4 = 14 T-states
;
; Total: 14 + 248 = 262 T-states
; Total time: 262 * 200 ns = 52,400 ns = 52.4 microseconds

Line-by-line walkthrough

  1. 1. Our timing analysis starts with setup instructions. MOV AX, 2000h takes 4 T-states because it needs to fetch the opcode and the immediate value — but it's fast since no memory data read is needed beyond the instruction fetch.
  2. 2. MOV DS, AX takes only 2 T-states — it's an internal register-to-segment-register transfer, requiring no bus activity beyond the instruction fetch already in the queue.
  3. 3. Inside the loop, MOV AX, [BX] takes 13 T-states. This breaks down into: the opcode fetch machine cycle, address calculation (effective address = DS*16 + BX), and one memory read machine cycle (T1-T2-T3-T4) to get the actual data.
  4. 4. ADD AX, 0001h takes only 4 T-states because the immediate value is part of the instruction stream (already in the prefetch queue) and the addition happens internally in the ALU.
  5. 5. MOV [BX], AX takes 13 T-states — similar to the read but now includes a memory write machine cycle where the data bus carries AX's value out to memory.
  6. 6. DEC CX is blazing fast at 2 T-states — entirely internal, the ALU decrements CX and updates FLAGS.
  7. 7. JNZ LOOP_START takes 16 T-states when the branch is taken (CX is not zero) because it must flush the prefetch queue and fetch from the new target address. On the final iteration when CX=0, the branch is not taken, costing only 4 T-states.
  8. 8. The total execution time of 52.4 microseconds demonstrates how even simple programs can be precisely timed when you know each instruction's T-state count.

Spot the bug

; Timing bug: Delay loop designed for 1 ms delay
; Clock: 5 MHz (T-state = 200 ns)
; Need: 1,000,000 ns / 200 ns = 5000 T-states

MOV CX, 1000     ; load counter
DELAY:
  NOP             ; 3 T-states
  NOP             ; 3 T-states
  DEC CX          ; 2 T-states
  JNZ DELAY       ; 16 T-states (taken)
                  ; Total per loop: 3+3+2+16 = 24 T-states
; Expected: 1000 * 24 = 24000 T-states = 4.8 ms
; BUG: The delay is 4.8 ms, not 1 ms!
Need a hint?
The loop body takes 24 T-states per iteration. If you need 5000 T-states total and each iteration costs 24, how many iterations do you actually need?
Show answer
Bug: The counter value is wrong. With 24 T-states per iteration and a target of 5000 T-states, you need 5000 / 24 = approximately 208 iterations, not 1000. With CX=1000, the delay is 1000 * 24 * 200 ns = 4,800,000 ns = 4.8 ms — almost 5 times too long. Fix: Change MOV CX, 1000 to MOV CX, 208. This gives 208 * 24 = 4992 T-states = 998,400 ns, which is approximately 1 ms. For exact 1 ms, you could use MOV CX, 208 plus a few extra NOP instructions after the loop to fine-tune the remaining 8 T-states.

Explain like I'm 5

Imagine a grandfather clock going tick-tock, tick-tock. Each tick-tock is one T-state — the smallest beat of time the CPU knows. To do something simple like picking up a toy, you need a few beats (a machine cycle): tick-put out your hand, tock-grab the toy, tick-pull it back, tock-done! To do something bigger like building a LEGO tower (an instruction), you need several of these grab-and-place actions, each taking a few beats. The faster the clock ticks, the faster you build!

Fun fact

The original 8086 ran at 5 MHz, meaning 5 million clock ticks per second. Today's processors run at 5 GHz — exactly 1000 times faster in clock speed. But the real performance difference is even larger because modern CPUs execute multiple instructions per clock cycle (IPC > 1), while the 8086 needed many clocks per instruction (IPC < 1). A modern CPU can be over 100,000 times faster than the 8086 in raw throughput!

Hands-on challenge

Write an 8086 assembly program that creates a precise time delay of approximately 100 microseconds, assuming a 5 MHz clock (200 ns per T-state). Calculate the exact number of T-states your delay loop takes per iteration, determine how many iterations are needed, and verify your math. Show the T-state count for each instruction in the loop. Hint: You need 100,000 ns / 200 ns = 500 T-states of delay.

More resources

Open interactive version (quiz + challenge) ← Back to course: Microprocessor A–Z