Hit the Ceiling – Going Virtual
End April i hit the ceiling. I'm very tall, but that's not the reason – the code size for the z80 system reached 32 kBytes.
I was working on the file system and when it was in a state where it compiled – just half way done – the resulting rom size was only a tiny amount below 32 kB.
So what could i do?
I could remove all test code, and then i could use z88dk which allegedly creates slightly smaller code, but that would probably not really help: I'd just die later.
I could write everything in assembler.
Or i could finish the z80 backend for Vcc, my 'virtual code compiler'.
I couldn't decide on whether to create real z80 code or virtual code for a Forth-style interpreter. So i implemented them both, mostly. While programming i made some measurements.
Compare virtual code with native z80 code, running the opcode test program:
virtual code z80 code total rom size 10935 11455 bytes = code blob 5811 4428 bytes + test code 5128 7027 bytes time 3.630s 2.376s
In this not representative program the z80 code is 37% bigger and 35% faster than virtual code. (The 'code blob' is the support library; 'test code' is what grows.)
I have also compiled my serial driver in various versions.
sdcc 1974 bytes Vcc z80 code 1520 bytes Vcc virtual code 1169 bytes
The z80 code generated by Vcc is 23% shorter than that of sdcc, and the virtual code is even 40% shorter than sdcc z80 code or 23% shorter than Vcc z80 code.
When i worked on the z80 backend, i was a little bit frustrated about the little amount of code size reduction i could achieve, though i used all 'illegal' tricks, e.g. i use the RST opcodes for the most frequent building blocks to reduce code size.
The code shrink of approx. 25% is simply not enough, because it does not take into account the size of the static support code blob. This is currently at 4428 bytes and i expect a final size of around 8 kB, after adding all int32 code and if i leave out floating point. That is 25% of the rom size of 32 kB. So before i actually save space, the code size must be reduced by at least 25%. And this looks like the maximum i can achieve with my z80 backend. (though 'nothing saved' is only true for code in rom. Any program loaded into ram will see the full size reduction. And i neglect that sdcc pulls in some library code as well…)
The code shrink of 40% of the virtual code version looks much better, though it will have a slightly larger support code blob. And it will come at a price: Speed…
Before i go into details here a comparison of the compiler outputs of a simple function:
Vcc:
uint8 avail_out(SerialDevice¢ channel) { return obusz - (channel.obuwi-channel.oburi); }
C:
uint8 sio_avail_out(SerialDevice* channel) { return obusz - (channel->obuwi - channel->oburi); }
This function determines how many free space is left in a sio output buffer. The Vcc function is a member function. 'channel' is a struct, 'obuwi' = output buffer write index, 'oburi' = output buffer read index, 'obusz' = output buffer size. I hope you get it.
sdcc: In the case of such a short function, sdcc creates very good code. But don't be fooled: if it can no longer keep everything in registers, the code becomes ugly… So this is actually not a representative example for sdcc. [25 bytes total]
_sio_avail_out:: pop de ; return address pop bc ; 'channel' push bc ; everything back: push de ; caller is responsible for cleaning up the stack… push bc pop iy ; iy = 'channel' ld e,15 (iy) ld l, c ; superfluous ld h, b ; superfluous ld bc, #0x0010 ; load into hl instead add hl, bc ld c,(hl) ld a,e sub a, c ld c,a ld a,#0x40 sub a, c ld l,a ret
hand-coded assembler: This is for the Vcc memory model with 'handles', so i must dereference a pointer to a pointer to the struct data. And as i see by the last instruction, it's for the virtual code machine: [19 bytes total]
sio_avail_out:: ; in: de -> -> channel ex hl,de ; hl -> -> channel ld e,(hl) inc hl ld d,(hl) ; de -> channel ld a,obusz ; a=obusz ld hl,obuwi add hl,de ; hl -> channel.obuwi sub a,(hl) ; a=obusz-obuwi inc hl ; hl -> oburi add a,(hl) ; a=obusz-obuwi+oburi ld e,a ld d,0 ; out: de = return value jp next ; jump to next opcode
Z80 code created by Vcc. It's an early state and there are some optimizations left. It looks poor when compared with the sdcc generated code, but as already said, things become different for functions with more than one line of code. Then this code is still representative but sdcc looks poor too. The first line is a program label, though a little bit longish. :-) But if you compare it with the function's signature then it hopefully makes sense. [total 36 bytes]
SerialDevice.avail_out__12SerialDeviceC_5uint8: pop hl ; move the return address to the VM's return stack call pushr_hl rst ivalu8 ; push obusz: 'ivalu8' = immediate uint8 value db 64 push de ld l,2+2 ; get local variable 'channel' rst lget ; 'lget' = get local variable ld hl,15 ; get item 'obuwi' at offset 15 rst igetu8 ; 'igetu8' = get uint8 struct item push de ld l,4+2 ; get local variable 'channel' rst lget ld hl,16 ; get item 'oburi' at offset 16 rst igetu8 pop hl and a ; subtract obuwi - oburi sbc hl,de ex hl,de pop hl and a ; subtract obusz - (obuwi - oburi) sbc hl,de ex hl,de pop af ; discard 2nd value on stack (the 'channel') jp return ; get back the return address and return
Virtual code created by Vcc with minimum optimization: [29 bytes total]
SerialDevice.avail_out__12SerialDeviceC_5uint8: rst p_enter ; the proc is entered in z80 code: switch to virtual code dw IVAL, 64 ; push obusz dw LGET ; get local variable 'channel' db 2 dw IGETu8 ; get item 'obuwi' at offset 15 db 15 dw LGET ; get local variable 'channel' db 4 dw IGETu8 ; get item 'oburi' at offset 16 db 16 dw SUB ; subtract obuwi - oburi dw SUB ; subtract obusz - (obuwi - oburi) dw TOR ; nip 2nd value on stack (the 'channel') dw DROP ; by temporarily moving the top value to the return stack dw FROMR ; and droping the 'channel' dw RETURN
Virtual code created by Vcc after proper optimization: [20 bytes total]
SerialDevice.avail_out__12SerialDeviceC_5uint8: rst p_enter dw IVALu8 ; uint8 opcode with 1-byte argument db 64 dw OVER ; instead of LGET 2 dw IGETu8 db 15 dw OVER2 ; instead of LGET 4 dw IGETu8 db 16 dw SUB dw SUB dw NIP0RETURN - 1 ; nip one value (the 'channel') and return
One astonishing difference between z80 code and virtual code is: optimization.
When you optimize z80 code, the following equation is true:
codesize = speed
The bigger your code, the higher the speed. Every effort to increase speed results in bigger code.
When you optimize virtual code, this equation is true:
codesize = 1 / speed
Whenever you reduce code size, the speed goes up. This is because the standard method to optimize virtual code is to create 'combi opcodes' for frequently occurring opcode pairs, which eliminates one opcode fetch. As a result it is much more fun to optimize virtual code because you are rewarded twice. :-) Though caveat: the size of the support code blob grows! :-(
A little off-topic here but do you happen to have some source code made public? I would be interested in testing code size generation with z88dk and perhaps looking at ways the generated code could be improved.
ReplyDeleteYou may not be aware but z88dk has its own version of sdcc that currently produces better code than sdcc. I did note in a past blog entry that you found that finding "z88dk_callee" and "z88dk_fastcall" linkages in sdcc was a little odd - well now you know that the two projects have been working together for the past year to give sdcc access to z88dk's library and crts :)
The code improvements over plain sdcc derive mainly from bugfixes that have not yet been applied to sdcc, a much larger peephole rule set, and most importanty, a more complete c library written in assembly language. Another contributor that may affect ROM size in particular is that z88dk can produce an lz77-compressed data section that is stored in ROM, rather than a block of raw bytes that are simply ldir'ed into RAM at startup.