GTE Pipeline Timings
Overview
Methodology
Result tables
Patterns
Caveats and methodology limits
Overview
Each GTE instruction has a documented total cycle count, but games have
historically treated the period between issuing a cop2 imm25 opcode and
its completion as opaque - the SDK and most documentation only say "do not
modify input registers while a GTE instruction is in flight". In practice
the GTE samples each input register at a deterministic offset within
that execution window. After the latch, the input register file is no
longer read by the in-flight op and is safe to overwrite.
This page documents, for each (instruction, input register), the smallest
number N of nops between the GTE op and an MTC2/CTC2 to that input
register such that the MTC2/CTC2 does not change the GTE's
output. That N is the practical "instruction slots required between
the cop2 op and the next write to this input" - the actionable number a
developer needs to know if they want to overlap CPU work with GTE work.
All numbers are hardware-verified on a single SCPH-5501 console. Other PSX revisions have not been measured; some 1-2 cycle drift is plausible across silicon revisions.
Methodology
The premise: MTC2/CTC2 does not stall while a GTE instruction is in
flight (the CPU continues issuing instructions immediately). The write
reaches the GTE's register file after a small fixed delay. By varying
the number of CPU instructions between the cop2 op and a perturbing
MTC2/CTC2, the read site of any given input register becomes
visible: the smallest N at which the perturbation no longer affects
the output is the latch boundary.
For each (instruction, register) pair, the test runs:
scene_setup ; reset all GTE inputs to a known baseline
cop2 op_imm ; fire the GTE instruction
nop x N ; vary N from 0 upward
mtc2 canary, $reg ; perturb the target register
nop x 60 ; let the GTE complete
mfc2 ... ; read RGB FIFO, MAC, IR, FLAG, SXY/SZ FIFOs
MTC2).
A canary-pre sanity check also runs: writing the canary BEFORE the cop2 op confirms that the canary value would change the result if it landed in time during execution. This pinpoints cases where the chosen canary value happens to produce the same saturated output as the baseline (which would otherwise look like a false N=0 boundary).
Each sweep runs twice with the first iteration discarded for icache
warmup, and CPU IRQs are masked across the sweep so a stray interrupt
cannot stretch the gap between the cop2 op and the perturbing MTC2.
The full test source is in pcsx-redux at
src/mips/tests/gte-latency*/.
Result tables
Boundary N is the smallest number of nops between cop2 imm25 and an
MTC2/CTC2 to the listed register at which the perturbation no
longer affects any GTE output. N=0 means the register was already
read by the time the very first nop slot would have run - it is safe
to clobber the register from one instruction after the cop2 op onward.
A dash (-) means the instruction does not read that register.
Perspective transforms
| Input register | RTPS (15c) | RTPT (23c) |
|---|---|---|
| VXY0 / VZ0 | 0/0 | 0/0 |
| VXY1 / VZ1 | - | 3/0 |
| VXY2 / VZ2 | - | 2/0 |
| RT11RT12 | 0 | 2 |
| RT13RT21 | 0 | 4 |
| RT22RT23 | 0 | 4 |
| RT31RT32 | 0 | 0 |
| RT33 | 0 | 0 |
| TRX / TRY / TRZ | 0/0/0 | 1/4/1 |
| OFX / OFY | 1/0 | 5/4 |
| H | 1 | 5 |
| DQA | 4 | 7 |
| DQB | 3 | 6 |
Lighting (single-vertex)
| Input register | NCS (14c) | NCCS (17c) | NCDS (19c) |
|---|---|---|---|
| VXY0 / VZ0 | 0/0 | 0/1 | 0/1 |
| RGBC | - | 3 | 3 |
| L11L12 | 0 | 0 | 0 |
| L13L21 | 0 | 0 | 0 |
| L22L23 | 0 | 0 | 0 |
| L31L32 | 0 | 0 | 0 |
| L33 | 0 | 0 | 0 |
| LR1LR2 | 2 | 2 | 2 |
| LR3LG1 | 1 | 1 | 1 |
| LG2LG3 | 1 | 1 | 1 |
| LB1LB2 | 2 | 2 | 2 |
| LB3 | 3 | 3 | 3 |
| RBK / GBK / BBK | 0/2/1 | 0/2/1 | 0/2/1 |
| RFC / GFC / BFC | - | - | 2/3/4 |
Lighting (triple-vertex)
| Input register | NCT (30c) | NCCT (39c) | NCDT (44c) |
|---|---|---|---|
| VXY0 / VZ0 | 0/2 | 0/2 | 0/0 |
| VXY1 / VZ1 | 0/1 | 0/1 | 0/1 |
| VXY2 / VZ2 | 1/3 | 3/3 | 3/4 |
| RGBC | - | 12 | 15 |
| L11L12 | 0 | 1 | 1 |
| L13L21 | 0 | 0 | 0 |
| L22L23 | 3 | 3 | 3 |
| L31L32 | 0 | 1 | 2 |
| L33 | 0 | 1 | 2 |
| LR1LR2 | 6 | 5 | 5 |
| LR3LG1 | 3 | 3 | 4 |
| LG2LG3 | 8 | 8 | 7 |
| LB1LB2 | 8 | 7 | 7 |
| LB3 | 6 | 5 | 5 |
| RBK / GBK / BBK | 8/9/9 | 9/6/9 | 7/7/7 |
| RFC / GFC / BFC | - | - | 13/14/14 |
Color
| Input register | CC (11c) | CDP (13c) |
|---|---|---|
| RGBC | 0 | 1 |
| IR0 | - | 2 |
| IR1 / IR2 / IR3 | 1/2/2 | 2/3/2 |
| LR1LR2 | 0 | 0 |
| LR3LG1 | 0 | 0 |
| LG2LG3 | 0 | 0 |
| LB1LB2 | 1 | 1 |
| LB3 | 0 | 0 |
| RBK / GBK / BBK | 0/0/0 | 0/0/0 |
| RFC / GFC / BFC | - | 0/2/0 |
Depth-cue
| Input register | DPCS (8c) | DPCT (17c) | DCPL (8c) | INTPL (8c) |
|---|---|---|---|---|
| RGBC | 0 | - | 0 | - |
| RGB0 | - | 4 | - | - |
| RGB1 | - | 4 | - | - |
| RGB2 | - | 0 | - | - |
| IR0 | 1 | 4 | 0 | 1 |
| IR1 / IR2 / IR3 | - | - | 1/0/1 | 0/1/0 |
| RFC / GFC / BFC | 0/0/0 | 1/2/3 | 0/0/0 | 0/0/0 |
Math
| Input register | SQR (5c) | OP (6c) | NCLIP (8c) |
|---|---|---|---|
| IR1 / IR2 / IR3 | 0/0/1 | 0/1/0 | - |
| RT11RT12 | - | 0 | - |
| RT22RT23 | - | 0 | - |
| RT33 | - | 0 | - |
| SXY0 | - | - | 0 |
| SXY1 | - | - | 1 |
| SXY2 | - | - | 1 |
Misc
| Input register | AVSZ3 (5c) | AVSZ4 (6c) | GPF (5c) | GPL (5c) |
|---|---|---|---|---|
| SZ0 | - | 0 | - | - |
| SZ1 / SZ2 / SZ3 | 0/0/0 | 0/0/0 | - | - |
| ZSF3 / ZSF4 | 0 / - | - / 0 | - | - |
| IR0 | - | - | 0 | 0 |
| IR1 / IR2 / IR3 | - | - | 0/0/0 | 0/0/0 |
| MAC1/2/3 | - | - | - | 0/0/0 |
MVMVA
MVMVA is parameterized over (mx, v, cv); 8 cycles regardless of parameter selection. Three documented parameter combinations were probed:
| Input register | (RT, V0, TR) | (LL, V0, BK) | (LC, IR, BK) |
|---|---|---|---|
| VXY0 / VZ0 | 0/0 | 0/2 | - |
| IR1 / IR2 / IR3 | - | - | 1/0/1 |
| Selected matrix | all 0 | max 1 | max 1 |
| TR / BK | 0/0/0 | 0/0/0 | 0/0/0 |
Patterns
The GTE snapshots its inputs early
For nearly every instruction, every input register has a boundary in the first ~4 cycles. The GTE essentially snapshots its input register file at the start of execution and works from internal pipeline storage afterward. From the developer's perspective, the documented per-instruction cycle count is misleading as a "do not touch the inputs" window: the "actually reading the inputs" window is much shorter.
Triple-vertex variants extend the boundary by ~3 nops
NCCT, NCT, NCDT, RTPT, DPCT all push V2's boundary 2-4 nops later than single-vertex variants. The GTE walks V0 -> V1 -> V2 over the first several cycles before the matrix multiplies start in earnest.
RGBC for NCCT-class triples latches mid-execution
Unlike the matrices and BK (latched in the first ~4 cycles), RGBC for NCCT, NCDT latches at cycle 12-15. The first sub-pass's color stage is the read site. The sweep also shows a clean two-step transition: at one N value the V0 result reverts to baseline (V0's color stage finished reading RGBC); at the next N value the V1 and V2 results revert together. This suggests the hardware reads RGBC twice during a triple-vertex op - once for V0, once covering V1+V2 - rather than re-reading per sub-pass.
Depth-queue inputs latch latest among RTPS inputs
DQA / DQB are read at cycle 3-4 of RTPS (cycle 6-7 of RTPT). They are used at the end of the projection pipeline to compute IR0 from depth.
Far Color (FC) latching depends on instruction
DPCS, DCPL, INTPL all latch FC by N=0 (depth-cue is the bulk of these 8-cycle ops, and FC is read very early). NCDS / NCDT spread the reads across the per-vertex depth-cue stages and push the boundary out (NCDT BFC = 14). DPCT shows progressive boundaries 1/2/3 for RFC/GFC/BFC, matching its per-channel pipelining.
Caveats and methodology limits
icache layout sensitivity
Boundaries fluctuate by 1-2 nops between code-layout variants of the test (the alignment of the probe block within the icache page matters). The numbers in this table reflect a single test build's results; treat single-cycle differences as noise.
Single-console caveat
All measurements are from one SCPH-5501. Other PSX revisions (PAL, late
SCPH, PSone, PS2 backwards-compat) have not been verified. The
short-instruction N=0 results should be very stable; the long
instructions and triple variants are more likely to drift between
silicon revisions.
Saturation collapse
If the canary perturbation drives the output to the same saturated value as the baseline (e.g. canary saturates IR to 0x7fff which is the same as baseline IR), the test would report a false N=0 boundary. The canary-pre sanity check catches this and forces the test to fail loudly. The canary values used here are chosen smaller than the baseline so the perturbed pipeline produces a distinguishable result.
"Doesn't affect output" vs "internal latch cycle"
The boundary measures when the GTE finished reading the register as
observable through the output. If a register is read into an internal
pipeline that gets multiplied by zero (e.g., R12 multiplied against
V0_X = 0), the test reports N=0 because no perturbation can change
the output - even if the GTE is internally still reading R12 later.
The methodology measures the practical "safe to clobber" cycle, not
the internal hardware read schedule.
LWC2 excluded
LWC2 is excluded from the methodology. The memory-side latency (DRAM
controller, BIU state) would smear the boundary measurement. MTC2 is
the clean probe because the CPU-to-COP2 path is deterministic.
GAS does not auto-insert nops
Verified by disassembling the probe code: cop2 imm25; .rept N\n nop\n .endr; mtc2
emits exactly N nops between the cop2 op and the mtc2. The
psyq SDK's habit of conservatively padding cop2 with surrounding nops
does not apply to inline assembly using .rept.