Optimizing Assembly Code for dZ80 CPUs

Optimizing Assembly Code for dZ80 CPUsThe dZ80 family—Zilog Z80-compatible CPUs and microcontrollers used in retrocomputing, embedded projects, and hobbyist systems—offers a compact, efficient instruction set that rewards careful assembly-level optimization. This article covers practical techniques to improve performance, reduce code size, and manage resources on dZ80 targets. It assumes familiarity with Z80 assembly syntax, registers (A, F, B, C, D, E, H, L, IX, IY, SP, PC), addressing modes, and basic assembler directives.

Why optimize for dZ80?

dZ80 designs often run on constrained hardware: limited clock speeds, small RAM/ROM, and simple peripherals. Optimizing assembly code yields:

Faster execution: crucial for real-time tasks, games, and signal processing.
Smaller code size: leaves room for additional features and data.
Lower power consumption: shorter active CPU time reduces energy usage.
Predictable timing: important for hardware interfacing and tight loops.

Understand the instruction timings and sizes

Before optimizing, know the cycle counts and byte sizes for instructions on your specific dZ80 variant. While many timings match classic Z80, some implementations differ—check your CPU’s documentation. General rules:

Use 8-bit operations when possible—8-bit arithmetic and loads are smaller and faster than 16-bit equivalents.
Avoid repeated multi-byte instructions in tight loops.
Favor single-byte instructions (e.g., INC A, DEC A, NOP) where they suffice.

Register usage strategies

Efficient register allocation reduces memory access and instruction overhead.

Keep frequently accessed variables in registers (A, B, C, D, E, H, L).
Use HL (or IX/IY with offsets) as a pointer to data structures in memory.
Reserve a pair (e.g., BC or DE) for loop counters; 8-bit counters can be faster.
Save/restore registers sparingly—use PUSH/POP only when necessary due to cost (cycles + bytes). If a routine is leaf-only, avoid saving registers at all.

Example: loop counter in B (8-bit) rather than BC (16-bit) when the count fits 0–255.

Optimize loops

Loops are where most cycles are spent. Techniques:

Use DJNZ for byte-sized loop counts—it’s compact and efficient (2 bytes, 13 cycles on classic Z80).
Unroll very hot inner loops if it reduces branching and improves throughput, but balance against code size.
Combine operations to reduce overhead: compute values in registers before loop entry rather than inside the loop.
Use relative jumps (JR) where possible; they are smaller than absolute JP.

Example loop patterns:

; Good: DJNZ-based loop for 8-bit count in B     LD B, 100 loop:     ; body using A, HL, etc.     DJNZ loop

For counts >255, use a nested loop with an outer DE as a 16-bit counter and inner DJNZ.

Minimize memory accesses

Memory loads/stores are slower than register ops.

Use LD A,(HL) and operate on A instead of reloading frequently.
For repeated reads from consecutive addresses, INC HL is cheaper than using indexed addressing repeatedly.
Use block transfer routines (LDIR/LDDR) for bulk copies; they’re efficient and often faster than manual loops.

Example: copying N bytes from (HL) to (DE):

    LD BC, N     LDIR        ; efficient block transfer: copies, increments, decrements BC to 0

Be aware of side effects (flags, registers) when using block instructions.

Use IX/IY and offsets for structured data

IX and IY with signed 8-bit offsets are ideal for accessing fields inside structures or arrays without recomputing addresses:

Load IX once with base address, then use instructions like LD A,(IX+offset).
This reduces instructions needed to compute addresses and keeps code readable.

Note: IX/IY-prefixed instructions are two bytes longer and may be slower than HL, so use them when the addressing convenience outweighs costs.

Arithmetic and logical optimizations

Prefer INC/DEC and ADD/SUB with registers rather than working through memory.
Multiply/divide are not native—implement efficient routines:
- Multiplication: use shift-and-add for small factors; lookup tables for fixed multiplies.
- Division: use restoring/non-restoring division algorithms or reciprocal multiplication when applicable.

Bit operations:

Use BIT, SET, RES for single-bit tests and modifications; they are faster and clearer than masking sequences.
Use RLA/RRA/RLCA/RRCA for efficient shifts/rotates with carry handling.

Branch prediction and conditional sequences

dZ80s don’t have sophisticated branch prediction, so reduce mispredicted branches by:

Rearranging code so the most common path follows the fall-through case (no jump).
Using conditional execution patterns that avoid multiple jumps, e.g., compute a mask and AND rather than branching for small decisions.

Example: prefer

    CP #value     JR Z, equal_case     ; fall-through is common path

Inline small routines and use CALL sparingly

CALL/RET have overhead (~11–17 cycles). For very short code used only in one place, inline it to save CALL/RET overhead. Use CALL when code reuse justifies the cost.

Tail-call optimization: if a routine ends by CALLing another routine and doesn’t need to return, replace CALL+RET sequence with JP to save stack and cycles.

Optimize for code density when ROM-limited

Use short forms (LD A,B instead of LD A,(BC) when applicable).
Fold constants into instructions (e.g., LD A,n) rather than loading from tables.
Use conditional assembly to include only needed features.

Consider compressing rarely used routines into a compressed format if runtime decompression is acceptable.

Use assembler macros and conditional assembly wisely

Macros can reduce source duplication and improve maintainability, but careful: macros expand inline, increasing code size. Use them for clarity in infrequent paths; use CALLs or shared routines for large repeated code.

Conditional assembly helps target different dZ80 variants or include/exclude features for size/performance trade-offs.

Profile and measure

Optimizations must be guided by measurements:

Use cycle-accurate emulators or hardware timers to profile hotspots.
Count cycles for candidate sequences; prefer changes that reduce cycles in hot paths even if they slightly increase size.
Verify behavior across edge cases—timing changes can alter hardware interactions.

Example: optimize an inner pixel loop (illustrative)

Unoptimized version (conceptual):

loop:     LD A,(HL)      ; load pixel     AND #mask     LD (HL),A      ; store back     INC HL     DJNZ loop

Optimized:

Load multiple pixels into registers if possible.
Use LDI/LDIR if copying/transformation applies to blocks.
Keep mask in a register and use BIT/RES where appropriate.

Hardware interfacing and timing-sensitive I/O

When toggling ports or waiting for hardware:

Use precise instruction timing to produce required pulse widths.
Replace NOP chains with tight loops using JR to reduce code size while preserving timing.
Disable interrupts only for the shortest critical sections; use EI/DI sparingly.

Portability and maintenance

Document assumptions (timings, register usage), and isolate hardware-specific code. Keep a portable core where possible, and add optimized assembly per dZ80 variant as separate modules.

Checklist for dZ80 assembly optimization

Profile to find hotspots.
Keep hot data in registers.
Use DJNZ and relative jumps for compact loops.
Prefer block instructions for bulk memory ops.
Minimize PUSH/POP and CALL/RET in inner loops.
Use IX/IY for structured data access, mindful of overhead.
Inline tiny routines where beneficial; reuse larger ones.
Test timing-sensitive code on real hardware/emulator.

Optimizing for dZ80 is a balance: speed vs. size vs. clarity. Measure, apply focused changes to hot paths, and keep code readable where possible so future maintenance is feasible.