Hit the Ceiling – Going Virtual

End April i hit the ceiling. I'm very tall, but that's not the reason – the code size for the z80 system reached 32 kBytes.
I was working on the file system and when it was in a state where it compiled – just half way done – the resulting rom size was only a tiny amount below 32 kB.

So what could i do?

I could remove all test code, and then i could use z88dk which allegedly creates slightly smaller code, but that would probably not really help: I'd just die later.

I could write everything in assembler.

Or i could finish the z80 backend for Vcc, my 'virtual code compiler'.

I couldn't decide on whether to create real z80 code or virtual code for a Forth-style interpreter. So i implemented them both, mostly. While programming i made some measurements.

Compare virtual code with native z80 code, running the opcode test program:

                virtual code    z80 code
total rom size  10935           11455 bytes
= code blob     5811            4428 bytes  
+ test code     5128            7027 bytes
time            3.630s          2.376s

In this not representative program the z80 code is 37% bigger and 35% faster than virtual code. (The 'code blob' is the support library; 'test code' is what grows.)

I have also compiled my serial driver in various versions.

sdcc                1974 bytes
Vcc z80 code        1520 bytes
Vcc virtual code    1169 bytes

The z80 code generated by Vcc is 23% shorter than that of sdcc, and the virtual code is even 40% shorter than sdcc z80 code or 23% shorter than Vcc z80 code.

When i worked on the z80 backend, i was a little bit frustrated about the little amount of code size reduction i could achieve, though i used all 'illegal' tricks, e.g. i use the RST opcodes for the most frequent building blocks to reduce code size.

The code shrink of approx. 25% is simply not enough, because it does not take into account the size of the static support code blob. This is currently at 4428 bytes and i expect a final size of around 8 kB, after adding all int32 code and if i leave out floating point. That is 25% of the rom size of 32 kB. So before i actually save space, the code size must be reduced by at least 25%. And this looks like the maximum i can achieve with my z80 backend. (though 'nothing saved' is only true for code in rom. Any program loaded into ram will see the full size reduction. And i neglect that sdcc pulls in some library code as well…)

The code shrink of 40% of the virtual code version looks much better, though it will have a slightly larger support code blob. And it will come at a price: Speed…

Before i go into details here a comparison of the compiler outputs of a simple function:

Vcc:

uint8 avail_out(SerialDevice¢ channel) 
{ 
    return obusz - (channel.obuwi-channel.oburi); 
}

uint8 sio_avail_out(SerialDevice* channel) 
{ 
    return obusz - (channel->obuwi - channel->oburi); 
}

This function determines how many free space is left in a sio output buffer. The Vcc function is a member function. 'channel' is a struct, 'obuwi' = output buffer write index, 'oburi' = output buffer read index, 'obusz' = output buffer size. I hope you get it.

sdcc: In the case of such a short function, sdcc creates very good code. But don't be fooled: if it can no longer keep everything in registers, the code becomes ugly… So this is actually not a representative example for sdcc. [25 bytes total]

_sio_avail_out::
    pop    de            ; return address
    pop    bc            ; 'channel'
    push   bc            ; everything back:
    push   de            ;     caller is responsible for cleaning up the stack…
    push   bc
    pop    iy            ; iy = 'channel'
    ld     e,15 (iy)
    ld     l, c          ; superfluous
    ld     h, b          ; superfluous
    ld     bc, #0x0010   ; load into hl instead
    add    hl, bc
    ld     c,(hl)
    ld     a,e
    sub    a, c
    ld     c,a
    ld     a,#0x40
    sub    a, c
    ld     l,a
    ret

hand-coded assembler: This is for the Vcc memory model with 'handles', so i must dereference a pointer to a pointer to the struct data. And as i see by the last instruction, it's for the virtual code machine: [19 bytes total]

sio_avail_out::          ; in: de -> -> channel    
    ex     hl,de         ; hl -> -> channel
    ld     e,(hl)
    inc    hl
    ld     d,(hl)        ; de -> channel
    ld     a,obusz       ; a=obusz
    ld     hl,obuwi
    add    hl,de         ; hl -> channel.obuwi
    sub    a,(hl)        ; a=obusz-obuwi
    inc    hl            ; hl -> oburi
    add    a,(hl)        ; a=obusz-obuwi+oburi
    ld     e,a
    ld     d,0           ; out: de = return value
    jp     next          ; jump to next opcode

Z80 code created by Vcc. It's an early state and there are some optimizations left. It looks poor when compared with the sdcc generated code, but as already said, things become different for functions with more than one line of code. Then this code is still representative but sdcc looks poor too. The first line is a program label, though a little bit longish. :-) But if you compare it with the function's signature then it hopefully makes sense. [total 36 bytes]

SerialDevice.avail_out__12SerialDeviceC_5uint8:
    pop     hl           ; move the return address to the VM's return stack
    call    pushr_hl    
    rst     ivalu8       ; push obusz: 'ivalu8' = immediate uint8 value
    db      64
    push    de
    ld      l,2+2        ; get local variable 'channel'
    rst     lget         ;    'lget' = get local variable
    ld      hl,15        ; get item 'obuwi' at offset 15
    rst     igetu8       ;    'igetu8' = get uint8 struct item
    push    de        
    ld      l,4+2        ; get local variable 'channel'
    rst     lget    
    ld      hl,16        ; get item 'oburi' at offset 16
    rst     igetu8
    pop     hl        
    and     a            ; subtract obuwi - oburi
    sbc     hl,de
    ex      hl,de
    pop     hl
    and     a            ; subtract obusz - (obuwi - oburi)
    sbc     hl,de
    ex      hl,de    
    pop     af           ; discard 2nd value on stack (the 'channel') 
    jp      return       ; get back the return address and return

Virtual code created by Vcc with minimum optimization: [29 bytes total]

SerialDevice.avail_out__12SerialDeviceC_5uint8:
    rst  p_enter         ; the proc is entered in z80 code: switch to virtual code
    dw   IVAL, 64        ; push obusz
    dw   LGET            ; get local variable 'channel'
    db   2
    dw   IGETu8          ; get item 'obuwi' at offset 15
    db   15
    dw   LGET            ; get local variable 'channel'
    db   4
    dw   IGETu8          ; get item 'oburi' at offset 16
    db   16
    dw   SUB             ; subtract obuwi - oburi
    dw   SUB             ; subtract obusz - (obuwi - oburi)
    dw   TOR             ; nip 2nd value on stack (the 'channel') 
    dw   DROP            ;     by temporarily moving the top value to the return stack
    dw   FROMR           ;    and droping the 'channel'
    dw   RETURN

Virtual code created by Vcc after proper optimization: [20 bytes total]

SerialDevice.avail_out__12SerialDeviceC_5uint8:
    rst  p_enter
    dw   IVALu8          ; uint8 opcode with 1-byte argument
    db   64
    dw   OVER            ; instead of LGET 2
    dw   IGETu8        
    db   15
    dw   OVER2           ; instead of LGET 4
    dw   IGETu8
    db   16
    dw   SUB
    dw   SUB
    dw   NIP0RETURN - 1  ; nip one value (the 'channel') and return

One astonishing difference between z80 code and virtual code is: optimization.

When you optimize z80 code, the following equation is true:

codesize = speed

The bigger your code, the higher the speed. Every effort to increase speed results in bigger code.

When you optimize virtual code, this equation is true:

codesize = 1 / speed

Whenever you reduce code size, the speed goes up. This is because the standard method to optimize virtual code is to create 'combi opcodes' for frequently occurring opcode pairs, which eliminates one opcode fetch. As a result it is much more fun to optimize virtual code because you are rewarded twice. :-) Though caveat: the size of the support code blob grows! :-(

1 comment:

Unknown02 June, 2016 01:42
A little off-topic here but do you happen to have some source code made public? I would be interested in testing code size generation with z88dk and perhaps looking at ways the generated code could be improved.

You may not be aware but z88dk has its own version of sdcc that currently produces better code than sdcc. I did note in a past blog entry that you found that finding "z88dk_callee" and "z88dk_fastcall" linkages in sdcc was a little odd - well now you know that the two projects have been working together for the past year to give sdcc access to z88dk's library and crts :)

The code improvements over plain sdcc derive mainly from bugfixes that have not yet been applied to sdcc, a much larger peephole rule set, and most importanty, a more complete c library written in assembly language. Another contributor that may affect ROM size in particular is that z88dk can produce an lz77-compressed data section that is stored in ROM, rather than a block of raw bytes that are simply ldir'ed into RAM at startup.

Kio's Hardware Projects

2016-05-29