The 8-bit inspiration vol 3: Plus/4 assembly XORSHIFT generator for dummies

Greetings for those who eventually happen to read this blog. I've been taciturn for quite a time as I had the impression that nobody, not even any of my friends tend to read this blog. Alright, I didn't expect a too high visibility and I was writing these essentially for fun, as I had a lot of more important things to do, I lost my initial enthusiasm.

However, right now I had been playing around with something I had to write down to understand anyway. The fact is, I'm involved in a research project deailing with uncertaintiy. I'm thus tampering with pseudo random generators, and I have particular tasks with linear feedback shift registers and similar beasts.

Meanwhile I ran into this book:

which I used to have back in the day, and now I have it again. For those whose Hungarian is a bit rusty: this one is about assembly programming of the Plus/4. In fact, I've never been doing any assembly: as a kid Basic was clear enough but Assembly was a bit too abstract, and later I had encountered high-level languages, too. So I thoutht it would be an interesting challenge to get into some acquaintance with assembly. After writing a Hello, World! program following this nice little tutorial, I thought the next thing to go for is a pseudorandom generator. My reasons were:

I have to understand these anyway, to implement one in Assembly definitely adds to the understanding, so I'm not completely wasting time.
The chosen generator is a Marsaglia Xorshift generator (c.f. G. Marsaglia: Xorshift RNGs, J. Stat. Softw. vol. 8 issue 14, (2003) DOI: 10.18637/jss.v008.i14 for details) is based on bit shifts and XOR operations so it suits pretty much for developing some basic Assembly skills.

Of course the nicest would have been to sit down with the article, the book, and a sheet of paper next to the Plus/4 and do it. However, I ran into a webpage where there is a Z80 Code for this generator, and someone has put an assembly code there, too. In addition the article starts with a simple C code that I understand…

So I said O.K., I follow the lazy approach, but at least I should understand what is going on. This blog is a side effect of this understanding. Note that I can be wrong here anywhere, it is written for myself, but if you eventually find some bug, I'm grateful if you share it with me.

The interesting part is that actually the I've found that the code they gave in assembly was wrong, so here I correct it and give it a try. Let's go for it.

1 C version

The C code (from here) reads:

/* 16-bit xorshift PRNG */

unsigned xs = 1;

unsigned xorshift( )
{
    xs ^= xs << 7;
    xs ^= xs >> 9;
    xs ^= xs << 8;
    return xs;
}

Let's see how the steps of this algorithm look like. It is a 16 bit generator, so we have two bytes; h stands for "high", l stands for low. The operator "^=" is an XOR operation; << and >> are left and right bit shifts, respectively. So the steps are as follows:

   x x x x x x x x |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8|x x x x x x x x
   x h1h2h3h4h5h6h7|h8l1l2l3l4l5l6l7|l80 0 0 0 0 0 0 |x x x x x x x x
XOR------------------------------------------------------------------
                    i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8

The i-s form a new byte, and m1 is the first bit of the new low byte; note that XOR-ing with 0-s is the identity operation, so apart from the first bit, the rest of the lower byte is untouched.

   x x x x x x x x |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8|x x x x x x x x
   x x x x x x x x |0 0 0 0 0 0 0 0 |0 m1l2l3l4l5l6l7|l8x x x x x x x
XOR------------------------------------------------------------------
                    i1i2i3i4i5i6i7i8|m1n2n3n4n5n6n7n8

In this step we leave the high byte and the first bit of the low byte untouched; the nontrivial part is the rest of the low byte.

   x x x x x x x x |i1i2i3i4i5i6i7i8|m1n2n3n4n5n6n7n8|x x x x x x x x
   i1i2i3i4i5i6i7i8|m1n2n3n4n5n6n7n8|0 0 0 0 0 0 0 0 |x x x x x x x x
XOR------------------------------------------------------------------
                    o1o2o3o4o5o6o7o9|m1n2n3n4n5n6n7n8

In this step the low byte is not touched anymore.

For the 8 bit implementation one should note that in each step the XOR is applied to single bytes only; this makes is suitable for the 8-bit implementation.

2 Plus/4 Assembly version

We have the two bytes in two memory addressess: $H and $L. We can operate on the AC register and we have an additional carry bit, we can do XOR's between the AC and memory addresses; this instruction is called EOR. In addition we can shift single bits. For this we shall use 2 instructions which work in this way:

LSR

    AC              |carry  
    a1a2a3a4a5a6a7a8|c
LSR 0 a1a2a3a4a5a6a7|a8

ROR

    AC              |carry  
    a1a2a3a4a5a6a7a8|c
ROR c a1a2a3a4a5a6a7|a8

First I quote the code I had found here, it is said that this should be doing the same as the C code on the 16 bits represented by rng_zp_high and rng_zp_low:

random
LDA rng_zp_high
LSR
LDA rng_zp_low
ROR
EOR rng_zp_high
STA rng_zp_high ; high part of x ^= x << 7 done
ROR ; A has now x >> 9 and high bit comes from low byte
EOR rng_zp_low
STA rng_zp_low ; x ^= x >> 9 and the low part of x ^= x << 7 done
EOR rng_zp_high
STA rng_zp_high ; x ^= x << 8 done
RTS

Let's understand how the algorithm works step by step. At a point we shall depart from it though.

Read $H to the AC

       AC              |carry  |$H              |$L      
       x x x x x x x x |x      |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8
LDA $H-----------------------------------------------------------
       h1h2h3h4h5h6h7h8|x      |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8

Apply LSR to AC

       AC              |carry  |$H              |$L      
       h1h2h3h4h5h6h7h8|x      |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8
LSR    ----------------------------------------------------------
       0 h1h2h3h4h5h6h7|h8     |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8

The first two steps were done to realize a single goal: to have h8 in the carry; its use will be clear in step 4.

Load $L to AC

       AC              |carry  |$H              |$L      
       0 h1h2h3h4h5h6h7|h8     |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8
LDA $L ----------------------------------------------------------
       l1l2l3l4l5l6l7l8|h8     |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8

Apply ROR

       AC              |carry  |$H              |$L      
       l1l2l3l4l5l6l7l8|h8     |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8
ROR    ----------------------------------------------------------
       h8l1l2l3l4l5l6l7|l8     |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8

As you can see now we have in AC the very thing the high byte has to be XOR-ed with in Step 1 of the C code. So let's XOR the AC with $H:

Apply EOR

       AC              |carry  |$H              |$L      
       h8l1l2l3l4l5l6l7|l8     |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8
EOR $H ----------------------------------------------------------
       i1i2i3i4i5i6i7i8|l8     |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8

Store the high byte after the frist step.

       AC              |carry  |$H              |$L      
       i1i2i3i4i5i6i7i8|l8     |h1h2h3h4h5h6h7h8|l1l2l3l4l5l6l7l8
STA $H ----------------------------------------------------------
       i1i2i3i4i5i6i7i8|l8     |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8

It is not a problem that we've lost h1… h8 from now on; we do not need them anymore. Flippin'eck, we are at the 6th step, and we have done only the first half of the first step of the 3 of the algorithm. This illustrates why I prefer C over assembly… Anyway, let's proceed.

7? Apply ROR (WRONG)

       AC              |carry  |$H              |$L      
       i1i2i3i4i5i6i7i8|l8     |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8
ROR    ----------------------------------------------------------
       l8i1i2i3i4i5i6i7|i8     |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8

This would be the next step according to the code from the forum. But now what? Note that the first bit of AC is l8 which has to be XOR-ed with l1 in the first step of the C algorithm; this can be done with EOR $L to calculate m1=l1 XOR l8. Meanwhile the description says "AC has now x >> 9 and high bit comes from low byte". The second half of the statement we have recognized, but how does AC have x>>9? He must mean that after l8 we should have m1l2l3l4l5l6l7, but come on, we have i1i2i3i4i5i6i7. We will have m1 only after the XOR.

Actually someone in a comment at the page did say afterwards that the Z80 code the guy had ported was incorrect, too, and indeed, it seems that it is. So let's try and fix it…

Looking at the first step of the C code, we have the I-s, but we are in the need of m1, this we have to calculate as the second step needs it as an input.

7: Zero out AC

         AC              |carry  |$H              |$L      
	 i1i2i3i4i5i6i7i8|l8     |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8
LDA #$00 ----------------------------------------------------------
         0 0 0 0 0 0 0 0 |l8     |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8

8: Get l8 to the first bit of AC:

         AC              |carry  |$H              |$L      
         0 0 0 0 0 0 0 0 |l8     |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8
ROR      ----------------------------------------------------------
         l80 0 0 0 0 0 0 |0      |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8

8: Calculate m1 as the first bit of AC and get the rest of $L to AC

         AC              |carry  |$H              |$L      
         l80 0 0 0 0 0 0 |0      |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8
EOR $L   ----------------------------------------------------------
         m1l2l3l4l5l6l7l8|0      |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8

Note that the EOR with 0 is actually the identity:

0 EOR 0 = 0
0 EOR 1 = 1

thus we got the copy of l2 at the right place. So the first step of the C code is almost done; we just need to store the result:

         AC              |carry  |$H              |$L      
         m1l2l3l4l5l6l7l8|0      |i1i2i3i4i5i6i7i8|l1l2l3l4l5l6l7l8
STA $L   ----------------------------------------------------------
         m1l2l3l4l5l6l7l8|0      |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8

Hurray, ready for step 2 of the C code.

Shift AC

         AC              |carry  |$H              |$L      
         m1l2l3l4l5l6l7l8|0      |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8
LSR      ----------------------------------------------------------
         0 m1l2l3l4l5l6l7|l8     |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8

Note that in the present situation ROR would have done the same.

Do step 2 of the C code

         AC              |carry  |$H              |$L      
         0 m1l2l3l4l5l6l7|l8     |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8
EOR $L   ----------------------------------------------------------
         m1n2n3n4n5n6n7n8|l8     |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8

Store the final lower byte

         AC              |carry  |$H              |$L      
         m1n2n3n4n5n6n7n8|l8     |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8
STA $L   ----------------------------------------------------------
         m1n2n3n4n5n6n7n8|l8     |i1i2i3i4i5i6i7i8|m1n2n3n4n5n6n7n8

Do the last step of the C code

Luckily our AC is just ready for this:

         AC              |carry  |$H              |$L      
         m1n2n3n4n5n6n7n8|l8     |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8
EOR $H   ----------------------------------------------------------
         o1o2o3o4o5o6o7o9|l8     |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8

Store the final result's high byte

         AC              |carry  |$H              |$L      
         o1o2o3o4o5o6o7o9|l8     |i1i2i3i4i5i6i7i8|m1l2l3l4l5l6l7l8
STA $H   ----------------------------------------------------------
         o1o2o3o4o5o6o7o9|l8     |o1o2o3o4o5o6o7o9|m1l2l3l4l5l6l7l8

And it is done. When iterating, AC and the carry shall be initialized anyway so it is not a problem that we leave some rubbish there.

Our final code thus reads, in a TEDMON-compatible fasion, assuming that the low bit is at $3000 and the high bit is at $3001:

A 2000 LDA $3001
       LSR
       LDA $3000
       ROR
       EOR $3001
       STA $3001
       LDA #$00
       ROR
       EOR $3000
       STA $3000
       LSR
       EOR $3000
       STA $3000
       EOR $3001
       STA $3001
       RTS
       JSR $2000
       BRK

Here it is how it looks like as I type it in TEDMON (in the emulator for the time being):

and here is the complete listing:

We thus have our iteration subroutine at 2000, so after seeding we can do iterations with G 2025, which is the last but one line that is a small main program calling the iteration once.

The command M 3000 3001 can be used for looking at the two bytes as well as seeding; here it is how it looks like:

It seems to work indeed. I hope that it is correct now.

3 Conclusions and outlook

What next? I shall surely sit to my real Plus/4 to try it there, too. After all, this is my first nontrivial assembly code for the Plus/4 some 32 years after my first attempt. I don't easily give up my dreams. It would be nice now to have some plot in the screen on the basis of these random bytes, that will be the next project probably.

But frankly even though it was fun to implement this and there were lessons to learn, I wouldn't say I became a great fan of Assembly. I appreciate highly that we do have high-level languages and one can write efficient codes in them. If I consider the real-life problems I'm solving nowadays, if I were to solve them in Assembly, well… It is nice to think back to the good old days but I'd not go back.