2016-05-29

Hit the Ceiling – Going Virtual

Hit the Ceiling – Going Virtual

End April i hit the ceiling. I'm very tall, but that's not the reason – the code size for the z80 system reached 32 kBytes.
I was working on the file system and when it was in a state where it compiled – just half way done – the resulting rom size was only a tiny amount below 32 kB.

So what could i do?

I could remove all test code, and then i could use z88dk which allegedly creates slightly smaller code, but that would probably not really help: I'd just die later.

I could write everything in assembler.

Or i could finish the z80 backend for Vcc, my 'virtual code compiler'.

I couldn't decide on whether to create real z80 code or virtual code for a Forth-style interpreter. So i implemented them both, mostly. While programming i made some measurements.

Compare virtual code with native z80 code, running the opcode test program:

                virtual code    z80 code
total rom size  10935           11455 bytes
= code blob     5811            4428 bytes  
+ test code     5128            7027 bytes
time            3.630s          2.376s

In this not representative program the z80 code is 37% bigger and 35% faster than virtual code. (The 'code blob' is the support library; 'test code' is what grows.)

I have also compiled my serial driver in various versions.

sdcc                1974 bytes
Vcc z80 code        1520 bytes
Vcc virtual code    1169 bytes

The z80 code generated by Vcc is 23% shorter than that of sdcc, and the virtual code is even 40% shorter than sdcc z80 code or 23% shorter than Vcc z80 code.

When i worked on the z80 backend, i was a little bit frustrated about the little amount of code size reduction i could achieve, though i used all 'illegal' tricks, e.g. i use the RST opcodes for the most frequent building blocks to reduce code size.

The code shrink of approx. 25% is simply not enough, because it does not take into account the size of the static support code blob. This is currently at 4428 bytes and i expect a final size of around 8 kB, after adding all int32 code and if i leave out floating point. That is 25% of the rom size of 32 kB. So before i actually save space, the code size must be reduced by at least 25%. And this looks like the maximum i can achieve with my z80 backend. (though 'nothing saved' is only true for code in rom. Any program loaded into ram will see the full size reduction. And i neglect that sdcc pulls in some library code as well…)

The code shrink of 40% of the virtual code version looks much better, though it will have a slightly larger support code blob. And it will come at a price: Speed…

Before i go into details here a comparison of the compiler outputs of a simple function:

Vcc:

uint8 avail_out(SerialDevice¢ channel) 
{ 
    return obusz - (channel.obuwi-channel.oburi); 
}

C:

uint8 sio_avail_out(SerialDevice* channel) 
{ 
    return obusz - (channel->obuwi - channel->oburi); 
}

This function determines how many free space is left in a sio output buffer. The Vcc function is a member function. 'channel' is a struct, 'obuwi' = output buffer write index, 'oburi' = output buffer read index, 'obusz' = output buffer size. I hope you get it.

sdcc: In the case of such a short function, sdcc creates very good code. But don't be fooled: if it can no longer keep everything in registers, the code becomes ugly… So this is actually not a representative example for sdcc. [25 bytes total]

_sio_avail_out::
    pop    de            ; return address
    pop    bc            ; 'channel'
    push   bc            ; everything back:
    push   de            ;     caller is responsible for cleaning up the stack…
    push   bc
    pop    iy            ; iy = 'channel'
    ld     e,15 (iy)
    ld     l, c          ; superfluous
    ld     h, b          ; superfluous
    ld     bc, #0x0010   ; load into hl instead
    add    hl, bc
    ld     c,(hl)
    ld     a,e
    sub    a, c
    ld     c,a
    ld     a,#0x40
    sub    a, c
    ld     l,a
    ret

hand-coded assembler: This is for the Vcc memory model with 'handles', so i must dereference a pointer to a pointer to the struct data. And as i see by the last instruction, it's for the virtual code machine: [19 bytes total]

sio_avail_out::          ; in: de -> -> channel    
    ex     hl,de         ; hl -> -> channel
    ld     e,(hl)
    inc    hl
    ld     d,(hl)        ; de -> channel
    ld     a,obusz       ; a=obusz
    ld     hl,obuwi
    add    hl,de         ; hl -> channel.obuwi
    sub    a,(hl)        ; a=obusz-obuwi
    inc    hl            ; hl -> oburi
    add    a,(hl)        ; a=obusz-obuwi+oburi
    ld     e,a
    ld     d,0           ; out: de = return value
    jp     next          ; jump to next opcode

Z80 code created by Vcc. It's an early state and there are some optimizations left. It looks poor when compared with the sdcc generated code, but as already said, things become different for functions with more than one line of code. Then this code is still representative but sdcc looks poor too. The first line is a program label, though a little bit longish. :-) But if you compare it with the function's signature then it hopefully makes sense. [total 36 bytes]

SerialDevice.avail_out__12SerialDeviceC_5uint8:
    pop     hl           ; move the return address to the VM's return stack
    call    pushr_hl    
    rst     ivalu8       ; push obusz: 'ivalu8' = immediate uint8 value
    db      64
    push    de
    ld      l,2+2        ; get local variable 'channel'
    rst     lget         ;    'lget' = get local variable
    ld      hl,15        ; get item 'obuwi' at offset 15
    rst     igetu8       ;    'igetu8' = get uint8 struct item
    push    de        
    ld      l,4+2        ; get local variable 'channel'
    rst     lget    
    ld      hl,16        ; get item 'oburi' at offset 16
    rst     igetu8
    pop     hl        
    and     a            ; subtract obuwi - oburi
    sbc     hl,de
    ex      hl,de
    pop     hl
    and     a            ; subtract obusz - (obuwi - oburi)
    sbc     hl,de
    ex      hl,de    
    pop     af           ; discard 2nd value on stack (the 'channel') 
    jp      return       ; get back the return address and return

Virtual code created by Vcc with minimum optimization: [29 bytes total]

SerialDevice.avail_out__12SerialDeviceC_5uint8:
    rst  p_enter         ; the proc is entered in z80 code: switch to virtual code
    dw   IVAL, 64        ; push obusz
    dw   LGET            ; get local variable 'channel'
    db   2
    dw   IGETu8          ; get item 'obuwi' at offset 15
    db   15
    dw   LGET            ; get local variable 'channel'
    db   4
    dw   IGETu8          ; get item 'oburi' at offset 16
    db   16
    dw   SUB             ; subtract obuwi - oburi
    dw   SUB             ; subtract obusz - (obuwi - oburi)
    dw   TOR             ; nip 2nd value on stack (the 'channel') 
    dw   DROP            ;     by temporarily moving the top value to the return stack
    dw   FROMR           ;    and droping the 'channel'
    dw   RETURN

Virtual code created by Vcc after proper optimization: [20 bytes total]

SerialDevice.avail_out__12SerialDeviceC_5uint8:
    rst  p_enter
    dw   IVALu8          ; uint8 opcode with 1-byte argument
    db   64
    dw   OVER            ; instead of LGET 2
    dw   IGETu8        
    db   15
    dw   OVER2           ; instead of LGET 4
    dw   IGETu8
    db   16
    dw   SUB
    dw   SUB
    dw   NIP0RETURN - 1  ; nip one value (the 'channel') and return

One astonishing difference between z80 code and virtual code is: optimization.

When you optimize z80 code, the following equation is true:

codesize = speed

The bigger your code, the higher the speed. Every effort to increase speed results in bigger code.

When you optimize virtual code, this equation is true:

codesize = 1 / speed

Whenever you reduce code size, the speed goes up. This is because the standard method to optimize virtual code is to create 'combi opcodes' for frequently occurring opcode pairs, which eliminates one opcode fetch. As a result it is much more fun to optimize virtual code because you are rewarded twice. :-) Though caveat: the size of the support code blob grows! :-(

2016-04-27

One of the most useless features in C

While writing code for a "file descriptor" which imposes an array of function pointers, i stumbled over a "problem" which i first thought was an error in sdcc. In order to report the error i simplified the source until it only consisted of these 4 lines:

typedef int (*MyFPtr)(struct Data*);
struct Data { int a; };
extern int bar(struct Data* f);
MyFPtr foo = bar;

➜ I make a typedef for a function pointer, because function pointers are so awkward in c (though they are a pretty compared to function pointers in c++ ... which resulted in the invention of the data type 'auto' ...)
Then at some point i actually define the struct.
Later i declare a function which matches the typedef.
Finally i try to assign this function to a function pointer variable.

compiling this source resulted in an error for the last line:

/foo-1.c:4: error 78: incompatible types
from type 'unsigned-int function ( struct Data generic* fixed) __reentrant fixed'
  to type 'unsigned-int function ( struct Data generic* fixed) __reentrant fixed'
/foo-1.c:5: error 78: incompatible types
from type 'unsigned-int function ( struct Data generic* fixed) __reentrant fixed'
  to type 'unsigned-int function ( struct Data generic* fixed) __reentrant fixed'

btw.: ignore the double error. Error messages in sdcc are always a little bit suboptimal.

This looked as if the compiler had a problem to see that too identical types are identical.

And indeed they aren't.

As i learned from my bug report, the first line implicitly declares a local data type. Local to – yes, i don't know exactly to what. But it's local. And so it's different to the later and globally defined struct.

One suggested solution was:

typedef unsigned int (*T)(struct Data*);
extern unsigned int foo(struct Data* f);
T bar = foo;

which compiled without error. But this now was actually an error in sdcc: This source is wrong too:
Line 1 and 2 both declare local data types which by that are different. Line 3 shouldn't work. That there was an error could be proved by actually trying to use the function pointer typedef:

typedef unsigned int (*T)(struct Data*); // local data type
extern unsigned int foo(struct Data* f); // local data type
T bar = foo;                             // works in sdcc but shouldn't
struct Data { unsigned int a; };
int main() { struct Data d = {0}; foo(&d); } // rejected

This lead me to the question:

For what is the implicit declaration of a local data type in a function's argument list good, anyway? I can't think of a real use case. It's near impossible to call such a function. You have to cast the function to a function which accepts the data type you actually have, because you cannot even cast your data to the local data type...

Second, it's just a pitfall: If you declared the data type before the typedef and before the function declaration or definition, then the global data type is used. If you didn't, then a local data type is used. Imagine a local variable a in a function body was local only if there was no global variable a defined before…

2016-04-20

Firmware Download and Access to IDE Board

Two topics in this post:

  • Firmware download
  • Access IDE board, IDE and CF devices

Firmware download

After my explorations into CRC generation, i worked on the firmware download code. This code has to run in RAM, because the eeprom can't be read while it is busy writing a block of data into it's cells.

I tried to keep things simple, and so the program flow looks like this:
write_eeprom.s

1. wait for SIO output to become empty
2. disable interrupts
3. receive magic header bytes
   if they are wrong: bail out
4. copy code from rom into ram
   jump to 6.

in ram:
5. Retry: receive bytes until magic header detected
6. in a loop:
7.    receive 64 bytes of data (last block may be shorter)
      update the crc after each byte
8.    write block of data into eeprom
9.    wait while eeprom busy
10. loop
11. receive and check crc:
    error: print a message, flush input, wait for a key and retry at 5
    ok:    print a message, flush input, wait for a key and reset

Already complicated enough.

ad 1: I wait for the SIO output to become empty, because there may be (and typically are) some bytes left in the output buffer, and as soon as i disable interrupts they will never be sent. This resulted in truncated "last messages".

ad 7: The program receives blocks of 64 bytes, which is the eeprom's block size, writes them into the eeprom and reuses the receive buffer for the next block. It does not keep the bytes around until all bytes are received, so the whole eeprom can be reprogrammed, though i have only 32k of RAM minus approx. 1k for code and buffers available.

The program actively polls the UART and retrieves the bytes as soon as they pop up in the UART queue. Then it immediately updates the CRC, so no extra loop is needed for CRC calculation.

The data is written into the eeprom despite the CRC is not yet checked, which can only be done at the end of the transmission.

ad 9: Waiting for the eeprom takes up to 10 ms (acc. to Atmel docs). During this time i do not poll the UART, for simplicity. And i do not use any flow control. Currently the UART is programmed to 9600 Baud, which means 960 characters per second or 9.6 characters per 10 ms. The UART has an input queue of 16 bytes: The UART is doing my job! :-)

ad 11: Up to now i never had a CRC error.

Overall workflow is now like this:

1. Work on the source, compile & assemble the rom.
2. Launch the new rom in the emulator
   The emulator silently creates a download file from the bare rom file.
3. Connect to the board (most times CoolTerm _is_ connected…)   
4. Select "download firmware" from the menu presented by the board
5. upload a "Textfile" in CoolTerm
6. type A letter to reboot

The upload including writing to the eeprom now takes exactly as long as uploading the rom itself: 1 second per 960 bytes. Currently roughly 20 seconds.

Kio: "See Stager, that's how it works!"

Next i made a modification to the rom, uploaded it and found, that it no longer booted.

Stager: "See Kio, you still need me…" :-(

Access the IDE board

Now that turn-around cycles are much faster and less painful, i started first tests to access the IDE board.

I talked to the i²c eeprom on the board... and it answered! :-)

Then i collected all my knowledge about IDE, which mostly centers around the DivIDE emulation in zxsp, and made up my mind on what to send to and read from the IDE device at first.

The test program looks roughly like this:

1. disable interrupts (in c: you remember the __critical bug?)
2. select the IDE board
3. read status and error register (from master, which is selected after reset)
4. print something
5. wait for ready and !busy in the status register
6. check that the device is not unexpectedly waiting for data
7. finally: issue command "IDENTIFY"
8. wait for data request in the status register, bail out on unexpected state
9. read 256 words into a sector buffer
10. read status and error register
11. print the sector data in hex
12. inspect and print various fields in the sector data

ad 1: Currently i'm back to the sdcc 3.6 nightly build, though it produced the so much slower code than version 3.4.

ad 2: As you may know, if you have followed this blog for the last years B-) boards attached to the K1 bus must be "selected" and from that on any i/o goes to that board. This really is a nice method and i'm really happy with it. I just must make sure that i'm not interrupted while i'm working with a device, as the interrupt (sic!) will probably leave some other board selected…

ad 9: I really read 256 words, not 512 bytes. (Though, in essence off course i do read 512 bytes.) The K1 bus is a 16 bit bus and the Z80 board contains two 8-bit buffers for sending and receiving the high word. Then reading 16 bit values works like this:

Implementation of a function for c:

;  uint16 in_w( uint8 addr ) __z88dk_fastcall;
;
_in_w::
    ld    a,l            ; a = register address
    or    a,k1_rd_data   ; add bits to access the bus
    ld    c,a            
    in    l,(c)       ; read the low byte
    ld    c,k1_rd_hi     ; access the high-byte register
    in    h,(c)          ; read the high byte from the high-byte register
    ret

The function is marked as __z88dk_fastcall, which is really funny. z88dk is the (only?) competitor of sdcc in the field of Z80. __z88dk_fastcall means, that the argument to the function, which must have exactly one argument, is passed in l, hl or hlde, depending on size, and not on the stack. In my opinion this should be the default.

PQI DOM, CF cards and Seagate ST1

For some reason it first didn't work, but then, out of a sudden, i got good looking data from a device. The only thing i did to make it happen was, to scrutinize the circuit diagram for errors. And as soon as i could prove there was no error, reality was modified to match my expectations. Check.

Off course the ascii texts of model name etc. were byte-swapped in the first version. Probably most people fall into this pitfall. The 4-byte values were calculated wrong in one place but correctly in another. After a few iterations the output from the "built-in" PQI DiskOnModule looked like this:

$00: 045A 02EE 0000 0008 0000 0210 0020 0002 EE00 0000 2020 2020 2020 2020 2020 2020
$10: 2020 2020 2020 2020 0002 0002 0004 6462 3031 2E32 3061 5051 4920 4944 4520 4469
$20: 736B 4F6E 4D6F 6475 6C65 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 0001
$30: 0000 0200 0000 0200 0000 0001 02EE 0008 0020 EE00 0002 0100 EE00 0002 0000 0000
$40: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
     ...
$F0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

model name = PQI IDE DiskOnModule                    
serial number =                     
firmware revision = db01.20a
fixed disk
ATA version = 0
LBA supported
Default capacity (sectors) = 192000
default C/H/S = 750/8/32
default capacity (sectors) = 192000
current C/H/S = 750/8/32
current capacity (sectors) = 192000
device supports PIO mode 3 or DMA mode 1 or above

Next reading from the slave, a Seagate ST1 hard disk in the CF card slot, did not work. The ST1 always reported !ready.

I thought there could be a severe problem with the pin assignment of the CF card slot, but after double checking, it was ok. Next i suspected a missing pull-up on the /CS line, which discriminates between master and slave, but these inputs have an internal pull-up.
Then i checked the ST1: with the help of an USB-CF-Card adapter i attached it to my Mac and it spined up. I watched some very cool short videos which i had saved on the drive a few years ago:

Do ya know any of them?

Then i rummaged around for some Compact "Flash" cards and came up with a 256 MB and a 16 MB one (the latter one i actually don't own. Hi Axel, do you miss your 16 MB card? I just found it… :-))

I tested them.

And they worked:

$00: 848A 02B7 0000 000F 0000 0200 0030 0007 A2B0 0000 5830 3130 3220 3230 3033 3130
$10: 3237 3032 3536 3137 0002 0002 0004 5265 7620 332E 3030 4869 7461 6368 6920 5858
$20: 4D32 2E33 2E30 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 0001
$30: 0000 0200 0000 0100 0000 0001 02B7 000F 0030 A2B0 0007 0100 A2B0 0007 0000 0000
     ...
$F0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

model name = Hitachi XXM2.3.0                        
serial number = X0102 20031027025617
firmware revision = Rev 3.00
removable medium
ATA version = 0
LBA supported
Default capacity (sectors) = 500400
default C/H/S = 695/15/48
default capacity (sectors) = 500400
current C/H/S = 695/15/48
current capacity (sectors) = 500400
device supports PIO mode 3 or DMA mode 1 or above

and

$00: 848A 00F4 0000 0004 4000 0200 0020 0000 7A00 0000 3932 3130 3336 3230 3130 3938
$10: 3939 3039 3132 3831 0002 0002 0004 5631 2E30 3220 2020 4C45 5841 5220 4154 4120
$20: 464C 4153 4820 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 0001
$30: 0000 0200 0000 0200 0000 0003 00F4 0004 0020 7A00 0000 0100 7A00 0000 0000 0000
     ...
$70: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
$80: 0000 75D0 0075 A800 75B8 0078 0174 55F6 08F4 B800 FA08 F4F5 F0E6 B5F0 10F4 F5F0
$90: 08B8 00F5 7801 B455 E674 0001 A700 7455 01A7 90C0 5974 01F0 E4F0 90C0 2D8D E6A6
$A0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
     ...
$F0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

model name = LEXAR ATA FLASH                         
serial number = 92103620109899091281
firmware revision = V1.02   
removable medium
ATA version = 0
LBA supported
Default capacity (sectors) = 31232
default C/H/S = 244/4/32
default capacity (sectors) = 31232
current C/H/S = 244/4/32
current capacity (sectors) = 31232
device supports PIO mode 3 or DMA mode 1 or above
device supports Ultra DMA

But the ST1 refused to become ready.

I played around with the master/slave setting and found, that the "jumper" on the PQI module probably enforced master for the device. This had to be taken into account when changing the master/slave jumper on the IDE board. (I first thought it had something to do with power supply, because this is an IDE module and the IDE bus normally provides no power, but i was wrong.)

Finally i jumpered the CF card adapter as master, pulled the master "master" jumper on the PQI module, and the CF card and the PQI module still answered, with roles swapped, and, unbelievable, the Seagate ST1 answered too! So, to make the ST1 work, i need to set the CF card adapter – the "removable" medium – to master and the "fixed" PQI module to slave? I tried with an empty CF card slot and the PQI module still answered - as slave. Is this IDE standard? (Actually i know very little about CF, IDE and so on, the official documents cost money for download or membership. So i get what can be found with the help of aunt Google.)

$00: 848A 12ED 0000 0010 7E00 0200 003F 004A 8530 0000 2020 2020 2020 2020 2020 344D
$10: 4430 3433 4B53 2020 0003 0100 0004 332E 3034 2020 2020 5354 3632 3532 3131 4346
$20: 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 8010
$30: 0000 0B00 0000 0200 0000 0007 12ED 0010 003F 8530 004A 0100 8530 004A 0000 0407
$40: 0003 0078 0078 0078 0078 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
$50: 0000 0000 7069 500C 4000 7069 100C 4000 0007 0000 0000 4040 0000 400D 8080 0000
$60: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
     ...
$90: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
$A0: 814A 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
$B0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
     ...
$F0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

model name = ST625211CF                              
serial number =           4MD043KS  
firmware revision = 3.04    
removable medium
ATA version = 0
command sets supported = $7069 500C
command sets enabled   = $7069 100C
LBA supported
Default capacity (sectors) = 4883760
default C/H/S = 4845/16/63
default capacity (sectors) = 4883760
current C/H/S = 4845/16/63
current capacity (sectors) = 4883760
device supports PIO mode 3 or DMA mode 1 or above
device supports Ultra DMA
max sectors per R/W MULTIPLE = 16
CFA power mode = 33098

Data transfer to and from the IDE disks is not really fast with a Z80 processor, the best what i can get is:

ld      c,k1_wr_data
    ...
    ; send 1 word / 2 bytes in a loop 
    ; or unroll as long as you like    
    ld      e,(hl++)
    ld      a,(hl++)
    out     (k1_wr_hi),a    ; store byte in the high-byte register
    out     (c),e           ; send 1 word / 2 bytes to the device

This takes 25 cc per byte, or, if inc hl can be replaced with inc l only, 21 cc; plus loop and setup overhead. The system has a clock frequency of 6 MHz, which means i can transfer 32 kB (the whole RAM) in 32k * 25 cc = 819200 cc or, at 6 MHz, in 1/7 sec. Ok, the CPU is slow, but the RAM is small as well. :-)

A single sector can be transferred in 512 * 25 cc = 12800 cc or, at 6 MHz, in 1/450 sec. This is also very fine because this means, that i can disable interrupts for a whole sector i/o without losing timer and SIO interrupts, as the timer interrupt is at 100 Hz currently. I could also increase the timer frequency to 300 or 400 Hz to allow a SIO speed of up to 38.4 kBaud, if i desired, but CPU time consumption will go up as well and with 100 Hz it's already around 2% with an idle SIO. And with very intelligent use of INI or OUTI a few cycles for at least one IN or OUT could be saved.

It should be noted that the CF cards can be operated in a byte-wide mode. Then INIR and OTIR should be usable and one byte transferred in 16 cc.

Conclusion

All boards work and now everything left to do is software.

p.s.: Still not yet written to the IDE devices. Surprise ahead? (Shiver…)

2016-04-15

sdcc, crc and queues

Hello,

this week i was busy working on my Z80 project. Though it always looks like moving in small circles, i made some progress.

In this post:

  • sdcc created broken code
  • Stager Electrics' programmer failed to program an eeprom
  • There is crc-16, crc-16 and crc-16
  • Stager Electrics' programmer failed to program an eeprom
  • Simple design of queues and how long can it take to spot an error
  • Stager Electrics' programmer failed to program an eeprom
  • sdcc could produce really fast code. could…

Stager Electrics

The code to detect whether i'm running on an e- or eeprom also write protects the eeprom (called SDP = software data protection) so that it cannot be overwritten when the program crashes. For that you just write certain bytes into certain addresses. Next i wrote a test message into the eeprom. This is done by writing the SDP sequence and then the bytes to program. Crashed at first as the destination buffer was calculated too small `:-) but worked on the second try.

Next i tried to overwrite an eeprom with the Stager Electrics programmer. Off course this did not work. It took 11 minutes to write the eeprom, and after that verify failed. The programmer knows the eeprom by name and manufacturer but cannot deactivate software data protection in the eeprom. And it can't erase the device as a whole. Luckily i have more than one of these eeproms.

In a later iteration of the rom i added code which, before doing anything else, tests whether an eeprom is inserted in the ram socket (you remember: they are pin-compatible); And if so, it disables software data protection and happily bails out with a blink code. Even later i added an option to disable SDP in the current eeprom. Now that Stager thing can program the eeproms again.

sdcc

I probably spent one full day (after work) on tracking down a not so reproducible crash when trying to read from the i2c eeprom on the SIO board. Finally i could prove it's an error in the C compiler. In a __critical function, which means that it is executed with interrupts disabled, on entry the state of the interrupt enable flip flop is pushed on the stack so the interrupts can be re-enable or not re-enabled on return. The generated code ignored this additional word on the stack and read everything from wrong local variables. By chance the address of the destination buffer was falsely taken from the i2c eeprom reading start address, which was 0, and so reading the eeprom overwrote ram from address 0x0000 onwards. Clearly not a good idea.

Stager Electrics

I forgot to disable SDP in the eeprom and had to add another 11 minutes after eeprom verification failed…

crc-16

I want to download new rom images to the Z80 system so that the system can reprogram itself, which takes 5 seconds (at most) and not 11 minutes. The current speed on the serial port is 9600 Baud, which means 960 bytes can be transmitted per second which means after roughly 17 seconds 16 kB are transmitted (which is the current rom size) or at most 34 seconds to overwrite the whole 32 kB of the eeprom. For the "protocol" i decided after some pros and cons to just wrap the rom image with a 2-byte start and stop prefix/postfix and to add a crc checksum for error detection.

I already have a CCITT crc-16 implementation in C at hand and googled for a Z80 version which was quickly found. Then i did some tests to compare the result and found ... nothing in common.

Ok, there are crc-16 and crc-16 and crc-16 and they are all different.

Let's look at the c implementation:

uint crc16_ccitt( uint8 const* q, uint count, uint crc )
{
   while(count--)
   {
      for( uint c = 0x0100/*stopper*/ + *q++; c>1; c >>= 1 )
      {
         crc = (crc^c) & 1  ?  (crc >> 1) ^ 0x8408  :  (crc >> 1);
      }
   }
   return crc;
}

And the Z80 version converted to a c function for easy understanding:

uint crc16_z80( uint8 const* q, uint count, uint hl )
{
  while(count--)
   {
      uint8 b;        
      hl ^= *q++ << 8;
      for(b=0; b<8; b++)
      {
         if((signed int)hl < 0) hl = (hl<<1) ^ 0x1021;
         else                   hl = (hl<<1);
      }
   }
   return hl;
}

First chance to make a difference is the input value for the crc. This must be 0xffff for the CCITT version and then the function may be called repeatedly to update the CRC as bytes arrive. Off course i called them both with the same starting value. Check.

Next you see that both functions use different polynomials: 0x8408 and 0x1021. Off course they must be the same to produce the same result, and they _ARE_ the same: The c function shifts bits from left to right, the z80 version from right to left, so they just work bit-reversed. Check.

Ok, they work bit-reversed when compared to each other. So the result must be bit reversed. But even when reverting one result the CRCs are completely different.

So what's the difference?

The bytes read from the data buffer must be bit reversed as well (in any one function) to make all data bit-reversed, then the result (of any one function) can be bit reversed and then they will be actually identical!

The fully bit-reversed version of the first function looked like this:

#define  R1(N) ((N<<7)&0x80)+((N<<5)&0x40)+((N<<3)&0x20)+((N<<1)&0x10) + \
               ((N>>7)&0x01)+((N>>5)&0x02)+((N>>3)&0x04)+((N>>1)&0x08)
#define R4(N)  R1(N),R1((N+1)),R1((N+2)),R1((N+3))
#define R16(N) R4(N),R4((N+4)),R4((N+8)),R4((N+12))
#define R64(N) R16(N),R16((N+16)),R16((N+32)),R16((N+48))
uint8 rev[256] = { R64(0), R64(0x40), R64(0x80), R64(0xC0) };

uint16 crc16r( uint8 const* q, uint count )
{
  uint crc = 0xffff;
  while(count--)
  {
    for( uint c = 0x0100 + rev[*q++]; c>1; c >>= 1 )
    {
      crc = (crc^c) & 1 ? (crc >> 1) ^ 0x8408 : (crc >> 1);
    }
  }
  return rev[crc>>8] + (rev[crc&0xff]<<8);
}

Now i have a C and a Z80 implementation for a CRC-16 checksum which work identical. `:-)

Note: To calculate the CCITT CRC-16 checksum with the first function, calculation must be started with CRC = 0xFFFF and the final CRC must be complemented. Then all sources say that you must swap the low and high byte. But that's not true, or, that's not the point. Whether you must swap the bytes depends on how you read the CRC from the data stream and what byte order your computer uses. I believe that the low byte is transmitted first. (to be tested somehow & somewhen…)

The Z80 version calculates the CRC-16 used in the XMODEM file transmission protocol. Here the CRC must be initialized with 0x0000, the final CRC must not be complemented and the high byte is sent first.

Stager Electrics

I forgot to disable SDP in the eeprom and after programming eeprom verification failed and i thought it was defective now…

Queues

I use a nice design for queues (in the sio implementation) which avoids the need for locks (or mutexes).

#define busize 64  // 2^N
#define bumask busize-1

uint8 bu[busize];
uint  ri;          // read_index
uint  wi;          // write_index

Normally writing to a queue works like this:
(I'll only describe writing, reading is similar.)

bu[wi++] = mybyte;
wi &= bumask;

Drawback:

You cannot distinguish between a full and an empty buffer, so you fill it up to at most busize-1 bytes.

This can be helped:

bu[wi++ & bumask];

Now the buffer is empty if wi==ri and full if (wi-ri)==busize.
ri and wi will at some time overflow but the integer arithmetics remain valid.

As not obvious, this implementation needs locking: wi is incremented before the byte is written and the buffer reader could interrupt between wi++ and writing the byte into the buffer, and read the not yet written byte. But this can be remedied like this:

bu[wi & bumask]; wi++;

Now the byte is stored first and then the write pointer is incremented, "releasing the semaphore".

how long can it take to spot an error?

These are the data structs containing the data for each channel:

struct SioData 
{ 
  bool  hw_handshake; 
  uint8 sw_handshake;   // bit.0: enabled  
  uint8 clk_handshake;  // bit.0: emit TX clock
  uint8 device;         // select mask
  uint8 channel;        // 0 = channel A; 1 = channel B
  uint8 baudrate;       // baudrate / 2400

  uint8 ibuwi;          // input  buffer write index
  uint8 iburi;          // input  buffer read index
  uint8 obuwi;          // output buffer write index
  uint8 oburi;          // output buffer read index

  uint8 ibu[ibusz];     // input  buffer
  uint8 obu[obusz];     // output buffer
};

These are two actual implementations in in my sio source:

uint sio_avail_in(struct SioData* channel)  
{ 
  return channel->ibuwi - channel->iburi; 
}
uint sio_avail_out(struct SioData* channel) 
{
  return obusz - (channel->obuwi - channel->oburi); 
}

Nice! :-)

And both wrong. :-?

When i tested transmission of data from my Mac to the Z80 system, i only got transmission errors. The Z80 system received all data when CoolTerm was at 50 .. 80%. I suspected CoolTerm. I suspected the USB-RS232 driver software. (Which actually _IS_ pretty buggy.) I suspected sdcc. I scrutinized the Z80 assembler interrupt routine. I examined the test routine itself. (A common place. Actually i started here… ;-)) I examined gets(…), which receives all available data into a buffer and which is written in C. I examined sio_avail_in(…). Not only once … My source and what sdcc compiled. And sio_avail_in(…) was buggy. But it took me hours to see the error. Do you spot the error? C'mon, it's only one line of code. A single subtraction of two values…

sdcc could produce really fast code. could…

I have written several versions of the CRC routine, two (similar versions) in Z80 and some in C. I timed them and i got interesting results.

CRC-16 ZMODEM of rom (asm1) dt=1180 ms
CRC-16 ZMODEM of rom (asm2) dt=1430 ms
CRC-16 CCITT  of rom (c)    dt=9500 ms
CRC-16 ZMODEM of rom (c)    dt=3020 ms

The C functions to calculate the XMODEM CRC is much faster than the function to calculate the CCITT CRC, though they both contain equivalent source.

That was with sdcc 3.4.

Due to the __critical error mentioned at the beginning of this post i looked for the latest version of sdcc. I thought, if i send in a bug report they'll surely complain that it's for version 3.4, which is 2 years old.

So i looked for the latest version: Version 3.5, which is 10 months old. (sigh).

It still had the __critical bug but i found the bug tracker and an entry for this bug: Fixed in 9'2015. Version 3.5 is from 6'2015. sigh…

So i searched and found the beta versions (more like nightly builds) and the latest OSX version was 13 minutes old. :-) It no longer has the __critical bug (tested), needs some other includes (copied) and produces slightly larger code. And i ran the CRC test again: (rom now slightly bigger)

CRC-16 ZMODEM of rom (asm1) dt=1200 ms
CRC-16 ZMODEM of rom (c)    dt=8500 ms

The C routine is now nearly 3 times slower?

So i reverted to sdcc 3.4 and reinstalled my workaround for the __critical bug…

    ... Kio !

p.s.: @ Google: The editor is crap. could you please fix it?

---- SPOILER WARNING ----

p.p.s.: the read and write indexes in the sio struct are (unsigned) bytes.
When they are subtracted in sio_avail_*(…) they are extended to 2-byte values.
If the write index has already overflowed and the read index not, then the difference is not limited to 8 bits as expected but the high byte of the result is 0xFF.


2016-04-10

Z80 Microcomputer with SRAM and K1-Bus

After suspending the project for a while, i'm now back to it. This is an update to the current state.

Hardware

There were some errors in the circuit, which i could fix. The V2.0 Eagle file on my website already contains these fixes.

2016-04-08 fixed board
Blue wires: A6 and A7 are used (beside A5) to select the target of an i/o operation. One of these is the access to the i2c bus on the k1 bus, where also my debugging LEDs are attached. When i2c is selected in an i/o operation, then A6 and A7 are used to set the i2c data and clock lines. – But wait, A6/7 are used to select i2c operation AND to select something within the i2c operation? Merde… So i rerouted the i2c lines to use A3 and A4 instead.
Yellow wires: As you can see by a look at the ram/rom address decoder in the last post, ram and rom selection is exchanged. First i fixed this by inserting eeprom and ram into each other's socket, which is possible with the eeprom, but not with the eprom.
For my (e)eproms i use a programmer from Stager Electric, Shenzhen, China. If you ever see something made by Stager Electric: run as fast as you can! It takes ~11 minutes to write a few bytes into an eeprom. The application has an option to "disinterest blanck" and eventually in the next version (which i never saw) it even worked… So it always programs full 32k and, since even this can be done in less than 32768/64*10ms = 5.12 seconds, and it actually takes 11 minutes, which is more than 100 times longer, i presume the eeprom is programmed byte by byte, making sure that it's write endurance of 10000 is in reachable distance... So i wanted to use eproms and fixed the circuit. Programming eproms is even faster as well.
Component side: A minor glitch is on the component side: I have carefully engraved "E" and "EE" for the eprom/eeprom selection pin header into the copper layer, and again, did it wrong: exchanged, as always…

Software

I was playing a little with my c-style compiler to add a Z80 target, and found: the Z80 is really bad suited to implement anything a compiler might try to create. Too few registers which frequently have special features. Deploying the second register set is near impossible. Using index registers is painfully slow. (you already knew that) Local variables on the stack are a pain to access.

Basically you have the choice to generate real machine code, which is not only slow but bloated as well, and some kind of Forth-style virtual code, which is short but even slower.

I finally came to the "fastest possible Forth-style" code model, which i will pursue later: It uses a jump table and opcodes which are 1-byte index into this table; which is faster (and shorter) than using 2-byte addresses in the program as Forth implementations typically do. Drawback: i need the table and the table can contain only ~256/3 entries. So there must be "prefix" opcodes which then are slower.

The jump table looks like this:

vector: macro $NAME
$NAME:: equ $ - vtable 
jp _$NAME 
endm                          

; ------------------------------------

vector RESET  ;( -- ) 
vector SHELL  ;( -- ) 
vector NATIVE ;( -- ) 
vector ABORT  ;( uint -- )   

vector MODs   ;( n n -- n ) 
vector DIVs   ;( n n -- n ) 
vector MODu   ;( n n -- n ) 
vector DIVu   ;( n n -- n ) 
vector MUL    ;( n n -- n )  

vector JP1    ;( n $dest -- )
vector JP0    ;( n $dest -- )
vector JP     ;( $dest -- )  

and so on. You see, each entry is a JP opcode (by virtue of the macro), but "inline" code in the table is sometimes possible as well, e.g. if a variant of an opcode just needs a short mockup of the arguments, it's code can be put directly in the table before the other opcode, where it simply runs into. It's a trade-off of used space and gained speed.

A typical "word" looks like this:

_SUB: ;( n1 n2 -- n )           

pop hl     ; hl=n1 
and a      ; de=n2 
sbc hl,de  ; hl=result
ex de,hl   ; de=result
next                      

where next is a macro:

; fetch next virtual opcode and jump to handler
; 
next: MACRO 
ld h,hi(vtable) 
ld a,(bc) 
inc bc 
ld l,a 
jp hl 
ENDM                 

An alternative is to jump to any implementation of macro next, which is slightly slower (10 cc for the jump) but also shorter (just 3 bytes). If it can be done in a relative jump, then it's even shorter (2 bytes) and even slower as well…

As you can see i use register pair BC for the virtual program counter and DE as result register, which frees HL so that machine coded sub routines can pop the return address into HL, do some work, e.g. pop arguments, and finally return via JP HL, which is not possible if you use HL as result register.

If an opcode implementation does not modify the h register, then it does not need to reload h with the high byte of the vtable address. There are actually some (few) opcodes which can exploit this additional speed boost. :-)

As you can see, the interpreter reads just one byte from the program and jumps into the vtable which contains jumps to the actual implementation of the virtual opcodes. This is faster than reading 2 bytes from the program, the program is shorter, but i need the tables and implementations for all opcodes.

The alternative i'm currently working with – because the Z80 backend of my compiler is not yet completed – is sdcc, the "Small Devices C Compiler", which has a Z80 backend. I can really tell that the generated code is bloated, and sometimes suboptimal, the syntax of the generated code is "unusual" and sometimes the compiler even crashes for me. Especially when i use the "<<" operator.

Here an example of what sdcc generates:

;/Firmware-Sdcc/sio.c:394: if(this->clk_handshake)
8520: DD7EFE   ld  a,-2 (ix)
8523: DD77FB   ld  -5 (ix),a
8526: DD7EFF   ld  a,-1 (ix)
8529: DD77FC   ld  -4 (ix),a
852C: DD6EFB   ld  l,-5 (ix)
852F: DD66FC   ld  h,-4 (ix)
8532: 23       inc hl       
8533: 23       inc hl       
8534: 6E       ld  l,(hl)   
8535: 7D       ld  a,l      
8536: B7       or  a, a     
8537: 2814     jr  Z,00106$ 

The first line (the comment) is the compiled source line. As you can see the compiled code reads a word from (ix-2), which seems to be 'this' ( a valid variable name in C ;-) ) and stores it at (ix-5) which seems to be a scratch cell and immediately reads it back into HL. Then it reads the desired value into l and immediately moves it into a for testing. A wonder of elegance. (note: the scratch value is not used anywhere later, l is used later, but while the value in a is still valid too.) 

Current State of the Project

Current setup
Last and this weekend i refitted all hardware, which is the CPU board, as SIO board and a (not yet tested) IDE board, as can be seen to the left, and hooked it up to a regulated power supply. Current consumption is pretty low, as it's all CMOS: only 50 to 80 mA (depending on how many LEDs are lit) for all three boards, including a 96MB IDE flash rom (hiding between the IDE and the SIO board) and a 2.5GB compact flash size hard drive (sticking out from the rear side so you can't see it as well).
Slowly iterating from one broken software step to the next, regularly erasing and reusing my eproms and finally even testing some steps in the emulator (erm, yes, i have written an emulator for the system too, using my Z80 emulation from zxsp and the SIO and a LCD display emulation from my K1 CPU project) i finally got the first text message from the board. I have attached the SIO port A to a RS232-to-USB converter and use CoolTerm on my Mac to receive the messages. I stepped back to an old version of CoolTerm, as the current versions very quickly use 100% of one CPU core. 
For the SIO software i use a simplified approach: The SIO ports are polled on the system timer interrupt (which is generated by the UART as well) which is currently 100 Hz. The UART has 16 byte fifo queues, so 100 Hz is way enough for 9600 Baud. But i'll probably go up to 200 Hz for 19200 Baud at least. As sending data works, the system interrupt works as well. 
Idle CPU usage for this interrupt is approx. 2% (calculated), and will be ~4% with 200 Hz, if i don't find a better solution. On the photo above you can see that the red LED in front is lit. This LED indicates WAIT state and currently the CPU waits approx. 98% of the time. (Or 96%, as it's also sending some text through some ugly compiled c code…) So i can say, this LED works as well. :-)
Today i have tested writing of data into the eeprom. (actually only detecting whether it's an eeprom or not, but that's quite similar.) This works too. 

Next steps: 
  • Actually write some bytes into the eeprom    ➞ done 2016-04-11
  • test reading of the i2c eprom on the SIO board
  • test writing to the i2c eeprom
  • receive data from the SIO port
  • receive program data from the sio port and write it into eeprom.
  • lock away the Stager Electrics programmer. ;-)
Final question is: what should i do with the board? hm hm…

Stay tuned. 

p.s.: Today i wrote a test message into the eeprom. Off course it did not work right from the start – it crashed because the destination space was too short, and behind that in the eeprom was the sio interrupt handler which was then partly overwritten.
I also tried to overwrite the eeprom with the Stager Electrics programmer, – which could not overwrite it. It took 11 minutes to write, and after that verify failed. I had expected this: Off course the programmer cannot deactivate software data protection in the eeprom. And it can't erase the device as a whole. Luckily i have more than one of these eeproms. And i already have a plan to make them writable again (else they'd be nice ceramic bricks): I can insert them into the ram socket and write a short eprom which does the job…