O2 N64 emulator

Well done!
I periodically rethink "is the O2 the same as the Nintendo 64" and "can we unlock the VICE to do something new" - and here you are! It's so cool you have posted this amazing research and gotten this far. Thank you!
As I was reading I was thinking: Mips Linux patches from early 2000s - check. The Docs - check. Ancient binutils - check. Got running code - that's the break thru: game set and match!

Are you or could you - put up a bit of this onto github ? Binutils and the triangles on desktop.
 
Thanks ! For sure it has been a fun ride up to now :) The VICE docs definitely changed the course of the project, from a RE and kinda painful perspective to a more focused effort on the emulator side of things. But I'm still discovering a lot by experimenting, as the docs are sometimes vague or incomplete.

I also believe that the availability of most (if not all ?) of hardware related docs opens a lot of possibilities for doing "something new" with this machine. The kernel sources floating around also proved to be a valuable source of information, especially for kernel driver development, which is not so well documented on some aspects.

Regarding sharing my work : for sure that's the plan. Lately I've been investigating the BSP a lot and have a better understanding of this little peculiar chip, and that will be the topic of my next post. I had to fix some bugs and add some missing features to the original patchset, which was a work in progress, so I'll take some time to iron things up and host the code somewhere. Please note that up to now I didn't test and investigate the binutils MSP patch as, for this emulation project it's not really useful (for now).
For the "push triangles" example it would not be very useful as is, as it requires the kernel driver to map the CRIME memory mapped registers. But if you are interested in experimenting with it feel free to ping me on the Discord server :)
 
  • Like
Reactions: rooprob
Memory access interception

Now that we have various bits and bobs of experimentation code, it's time to put the pieces we have together. To use these pieces in some kind of emulation, we will need to be able to call various routines when the emulated code writes (or read) memory regions of interest, specifically memory mapped control registers.

As I mentioned in the first post, there are various options to handle this.

The first one that comes to mind is write-protecting the memory segments we're interesting in, then handle the page fault, call our emulation routine, un-protect the page, write the data, re-protect the page, then continue execution. I didn't test this scenario because not only it feels terribly slow and a bit painful, and also because it can't intercept read access which is also required (unless you disable read access, not sure if it's possible). This would also require protecting full memory segments where we would want more fine grained control to catch access to specific registers.

The second possibility would be to patch the code to jump to emulation routine for each memory access we're interested in. This is probably the best scenario in terms of performance, but also the most complex because this means detecting access to specific memory region when statically analyzing the assembly code, so not trivial at all. More on the challenges of static patching later.

The third one involves memory watchpoints and debugging tooling. I've explored various possibilities kernel side, but the most obvious and simple way seems to be the procfs filesystem. This pseudo filesystem provides userland access to running process informations. More interestingly, it also provides some debugging facilities, especially the memory watchpoints we're interested in. Let's dig into that option, and put some code together to test it.

First, we need to setup a memory region to watch, then set a SIGTRAP signal handler to catch the signal generated by the kernel on memory watchpoint access, and finally setup the specific watchpoints we want to catch :
C:
    // The memory we're interested in
    unsigned char test[32];

    // Open a file descriptor to our own process in the procfs
    char proc_file[32];
    mypid = getpid();
    snprintf(proc_file, 32, "/proc/%010d", mypid);
    proc_fd = open(proc_file, O_RDWR);

    // Set a SIGTRAP handler, and get some infos in our handler
    struct sigaction sa;
    memset(&sa, 0, sizeof(sa));
    sa.sa_sigaction = sigtrap_handler;
    sa.sa_flags = SA_SIGINFO;
    sigaction(SIGTRAP, &sa, 0);

    // Setup 2 watchpoints, 1 byte each, one for write access and one for read access
    prwatch_t prw;
    prw.pr_vaddr = test;
    prw.pr_size = 1;
    prw.pr_wflags = MA_WRITE;
    int res = ioctl(proc_fd, PIOCSWATCH, &prw);

    prw.pr_vaddr = test + 1;
    prw.pr_size = 1;
    prw.pr_wflags = MA_READ;
    res = ioctl(proc_fd, PIOCSWATCH, &prw);

The signal handler could look like this :

C:
void sigtrap_handler(int sig, siginfo_t* siginfo, ucontext_t* ucontext)
{
    printf("Got SIGTRAP SIGNO %d SICODE %d ERRNO %d ADDR %X EPC %X\n",
        siginfo->si_signo, siginfo->si_code, siginfo->si_errno, siginfo->si_addr, ucontext->uc_mcontext.__gregs[CTX_EPC]);
<snip>
}

Now, the the next step is to get the memory address that generated the SIGTRAP so we can call the right routine. From the documentation, we are supposed to get the accessed memory address in the si_addr member of the siginfo_t structure. Unfortunately my experiments showed that we don't get anything there :(.

Hopefully, we can get valuable info in the ucontext parameter of the handler. This ucontext_t structure contains the full user context (general registers, floating point registers, exception PC and a few other stuff) at the time of exception raising.
The first usefull info is the exception PC, because it means we can get the specific instruction that raised the exception. Then, the user context provides all the infos we need to complete the puzzle, that is all the registers at the time of the exception.
So, by decoding the instruction we can tell if the access was read or write, and with the registers we can tell the specific address of the memory access. Now our SIGTRAP handler could look like :

C:
void sigtrap_handler(int sig, siginfo_t* siginfo, ucontext_t* ucontext)
{
    printf("Got SIGTRAP SIGNO %d SICODE %d ERRNO %d ADDR %X EPC %X\n",
        siginfo->si_signo, siginfo->si_code, siginfo->si_errno, siginfo->si_addr, ucontext->uc_mcontext.__gregs[CTX_EPC]);
    unsigned int insn = *(unsigned int *)(ucontext->uc_mcontext.__gregs[CTX_EPC]);

    unsigned char base, rt;
    short offset;
    switch (insn >> 26)
    {
        // Read
        case 0b100100: // LBU
            base = (insn >> 21) & 0b11111;
            rt = (insn >> 16) & 0b11111;
            offset = insn & 0xFFFF;
            printf("LBU $%d, %d($%d)\n", rt, offset, base);
            printf("%d <- %X\n", ucontext->uc_mcontext.__gregs[rt], ucontext->uc_mcontext.__gregs[base] + offset);
            // Now we know it's read access into rt register, at ucontext->uc_mcontext.__gregs[base] + offset location
        break;
        // Write
        case 0b101000: // SB
            base = (insn >> 21) & 0b11111;
            rt = (insn >> 16) & 0b11111;
            offset = insn & 0xFFFF;
            printf("SB $%d, %d($%d)\n", rt, offset, base);
            printf("%X <- %d\n", ucontext->uc_mcontext.__gregs[base] + offset, ucontext->uc_mcontext.__gregs[rt]);
            // Now we know it's write access to ucontext->uc_mcontext.__gregs[base] + offset location, from rt register
        break;
    }
}
(printf and syscalls are very very bad in signal handlers, I promise not to do it anymore !)

Progress ! But unfortunately we still have an issue in our hands. When the signal handler returns control to the kernel, the process execution resumes where the exception occured, so we'll get an inifinite loop...
We need to find a way to resume execution at the next instruction. I experimented various stuff with inline assembly, from jumping back at EPC + 4 or messing with the stack to change the return address, but all this proved to be a dead end. Finally I settled with :
C:
ucontext->uc_mcontext.__gregs[CTX_EPC] += 4;
And it does the trick :)
When the user process returns control to the kernel, the kernel will restore user context from the ucontext_t structure. More interestingly, it will restore PC from the CTX_EPC "register" of the ucontext_t structure.

This is all fine : we caught read/write access at a specific memory location, were able to get the specific instruction, registers and memory address at access time, but we have one last problem.
If we want our read/write intruction to *effectively* read/write the memory, we can't simply read/write memory from our handler, because it will trigger the watchpoint again and end up in an infinite loop !

Hopefully, the procfs gives us a solution to this issue : we can read() or write() to our procfs file descriptor at an offset representing the process virtual address, and the memory will be read/written without trigering our watchpoints :)

All in all, this gives us all we need to intercept reads and writes and trigger the right emulation handlers. This is probably *very* slow because of the context switches, but that will do for now !
I almost have a build of binutils 2.14 with bsp patched (hand fixed) but compiler doesn’t like the lvalue use so will need more time. I tried mipspro and ancient patch, but gave into modern Linux tools and then hand fixing. My goal is to print an asm listing from a BEX files from vicetre, and you demoed early midway into your post. And second goal to put the VICE into a loop as you did in later posts.

Ultimate goal is to do a vector operation.
 
  • Like
Reactions: joshyfishy22
Down the rabbit hole 1/2

This a a long overdue post regarding my findings about the BSP "BitStream Processor".

As its name implies, this chip is tailored at processing bitstreams of data, and provides a few interesting features for doing so. First peculiarity, this is a 16 bits processor, with 16 bits instructions, registers and address space.
Besides a few classical MIPS fashioned Load/Store, Arithmetic/Logical and Branch instructions, the instruction set also provides many instructions dedicated at stream processing and encoding / decoding :

getBits(q) rD, NFetch N bits from the bitstream registers and put these into a general purpose register
probeBits rD, NSame as getBits but do not shift th input stream
ShiftStream(N.q))Shift the input stream N bits but do not copy in GPR (discard)
leaf_run_level_parse(q)Performs the decoding of VLC data in the bit-
stream (H.261 and MPEG1)
block_run_level_parse(q)Similar to leaf but process 8x8 blocks
generic_leaf_parseSimilar to run_level_parse
block_run_size_parse(q)This instruction supports the JPEG compression standards’ technique of encoding the 8x8 block of pixel data.
code_search(q)Searches the bitstream until it matches the content of the CMP register
load_code_pack(q,p) offsetUsed for Huffman encoding of an 8x8 block of data
load_code_packH261Similar to load_code_pack, H.261 specific
generic_lookup_pack rTUsed to perform Huffman encoding of information other than the 8x8 blocks of DCT coefficients
pack_bitstream(q) L, rTUsed for encoding bitstream, append to the end of the outgoing bitstream
byte_alignForces byte alignment of the bitstream

I don't know much about media Codecs, and won't have any use of most of these instructions, but some of them may be interesting in my planned use-case, which is parsing the Display Lists produced by the MSP running the N64 microcode. Speaking of microcodes, they are usually provided in 2 different flavors : the regular DMEM variant and the FIFO variant.

The regular variant will output RDP commands into a DMEM buffer which can hold up to 6 commands. When the buffer is full the RSP will stall until there is space available in the buffer. For our display list parsing application, this would mean fetching from DMEM with the BSP Load instructions.
The FIFO variant (and the preferred one if I'm not wrong) will output its RDP command to RDRAM (system memory) into a user sized FIFO buffer, and the RDP will fetch DPs from there. This case would be trickier for the BSP because its Load/Store instructions can only access VICE's DMEM, and any access to system memory must be done by the DMA engine, but the BSP has no access to it :( .

This is where our stream processing instructions comes handy ! The doc mentions that the bitstream searching / consuming / provinding instructions operate on a FIFO buffer that is illustrated in this diagram :

1740767143456.png


There are two of these buffers, one for input and one for output, and they are supposed to be filled / emptied by the VICE DMA. Unfortunately the doc does not really explains how to setup the DMA engine for FIFO bitstream processing so it's time for experimenting with the BSP :sneaky:

For this, we will need :
Some BSP Code
Code:
.macro LOAD reg, value
    lih \reg \value >> 8
    lil \reg \value & 0xff
.endm

.text
begin:
LOAD r6 0 ;iterator
LOAD r4 0x9000 ;DRAM_C

loop:
getbitsi puke r3 0xf
copyto rpage r4
sh r3 0
addi r4 2
addi r6 1
cmpi r6 112 ; 0xE0 / 2
bne loop
nop

done:
break
This code will fetch 224 bytes of data from the bitstream and simply copy it to VICE DRAM bank C (0x9000)

And, as usual, a support C Host program that will perform the following tasks :
  1. Open the VICE device, reset the BSP chip
  2. Clear BSP IRAM and VICE DRAM to 0, to make sure the read values are from the current run
  3. Load the BSP code binary to BSP IRAM
  4. Setup buffers for DMA and load some known data pattern in the buffers
  5. Setup the VICE DMA engine for a transfert from the data buffers to the BSP input FIFO
  6. Trigger BSP execution
  7. Dump DMA engine status and some control registers to compare before / after
For the first run I'll intentionnaly setup the DMA engine to transfer more data that will be processed by the BSP and see what happens :
Code:
$ ./run_bsp test_bsp_fifo /tmp/testdata 0 0x100
regs 4000000
BSP reset
loop halt
loop halt reset
BSP fill IRAM
VICE fill DRAM
Read bsp code size 42
BSP data size 412
Read Offset 0 datasize 100
000000 17 38 17 39 18 30 18 31 18 32 18 33 18 34 18 35  .8.9.0.1.2.3.4.5
000010 18 36 18 37 18 38 18 39 19 30 19 31 19 32 19 33  .6.7.8.9.0.1.2.3
000020 19 34 19 35 19 36 19 37 19 38 19 39 20 30 20 31  .4.5.6.7.8.9 0 1
000030 20 32 20 33 20 34 20 35 20 36 20 37 20 38 20 39   2 3 4 5 6 7 8 9
VICE DMA_MEM_PT_CH1 00000000
VICE DMA_VICE_PT_CH1 8D82BE3F
VICE DMA_COUNT_CH1 0000
DMA RUN
BSP_IN_COUNT 1180
BSP_AVALID_BITS 0
BSP_FVALID_BITS 1048592
BSP run
PC 0000
PC 000A
PC 001E
PC 001E
...
EPC 18 Cause 4 BREAK
DMA Done : DMA Not complete
DMA Error : No DMA error has occurred
DMA Active : DMA is running
DMA R/W : Read descriptor
DMA Descriptor : DMA working on or pointing to First Descriptor Set
DMA Status Code : DMA moving data on internal VICE bus
VICE DMA_MEM_PT_CH1 00810000
VICE DMA_VICE_PT_CH1 7800
VICE DMA_COUNT_CH1 0001
BSP_IN_COUNT 1236
BSP_AVALID_BITS 0
BSP_FVALID_BITS 1048592
Halt 2 4
000000 00 00 00 00 00 01 02 03 04 05 06 07 08 09 10 11  ................
000010 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  ........ !"#$%&'
000020 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43  ()0123456789@ABC
000030 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59  DEFGHIPQRSTUVWXY
000040 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75  `abcdefghipqrstu
000050 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91  vwxy............
000060 92 93 94 95 96 97 98 99 10 30 10 31 10 32 10 33  .........0.1.2.3
000070 10 34 10 35 10 36 10 37 10 38 10 39 11 30 11 31  .4.5.6.7.8.9.0.1
000080 11 32 11 33 11 34 11 35 11 36 11 37 11 38 11 39  .2.3.4.5.6.7.8.9
000090 12 30 12 31 12 32 12 33 12 34 12 35 12 36 12 37  .0.1.2.3.4.5.6.7
0000a0 12 38 12 39 13 30 13 31 13 32 13 33 13 34 13 35  .8.9.0.1.2.3.4.5
0000b0 13 36 13 37 13 38 13 39 14 30 14 31 14 32 14 33  .6.7.8.9.0.1.2.3
0000c0 14 34 14 35 14 36 14 37 14 38 14 39 15 30 15 31  .4.5.6.7.8.9.0.1
0000d0 15 32 15 33 15 34 15 35 15 36 15 37 15 38 00 00  .2.3.4.5.6.7.8..
0000e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

A few interesting things to note from the output above :
  • The BSP program went to completion (cause BREAK)
  • The memory dump from our input data buffers are output data (VICE DRAM) are identical for the first 224 bytes, so the copy loop went OK
  • The status register BSP_IN_COUNT went from 1180 to 1236 double words, that is 56 * 4 = 224 bytes processed, as expected
  • The most interesting here is the DMA engine status :
    • DMA Active : DMA is running
    • DMA Status Code : DMA moving data on internal VICE bus
Which means that the DMA engine is still running and processing our descriptor.

Now for the 2nd run, I'll setup the DMA engine to transfert less data than the BSP will process :
Code:
$ ./run_bsp test_bsp_fifo /tmp/testdata 0 40
regs 4000000
BSP reset
loop halt
loop halt reset
BSP fill IRAM
VICE fill DRAM
Read bsp code size 42
BSP data size 412
Read Offset 0 datasize 40
000000 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15  ................
000010 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  .... !"#$%&'()01
000020 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47  23456789@ABCDEFG
000030 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63  HIPQRSTUVWXY`abc
VICE DMA_MEM_PT_CH1 00000000
VICE DMA_VICE_PT_CH1 8D82BE3F
VICE DMA_COUNT_CH1 0000
DMA RUN
BSP_IN_COUNT 2132
BSP_AVALID_BITS 0
BSP_FVALID_BITS 1048592
BSP run
PC 0000
PC 000E
PC 000C
PC 000C
PC 000C
...
EPC 8 Cause 0 Unknown
DMA Done : DMA Complete
DMA Error : No DMA error has occurred
DMA Active : DMA is not running
DMA R/W : Read descriptor
DMA Descriptor : DMA working on or pointing to First Descriptor Set
DMA Status Code : DMA Halted from DMA Halt bit in Descriptor
VICE DMA_MEM_PT_CH1 00810000
VICE DMA_VICE_PT_CH1 7800
VICE DMA_COUNT_CH1 0001
BSP_IN_COUNT 2148
BSP_AVALID_BITS 0
BSP_FVALID_BITS 0
Halt 2 0
000000 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15  ................
000010 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  .... !"#$%&'()01
000020 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47  23456789@ABCDEFG
000030 48 49 50 51 52 53 54 55 56 57 00 00 00 00 00 00  HIPQRSTUVW......
000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
000070 00 00 00 00 00 00 00 00 00 00 00 6e 00 00 00 00  ...........n....
000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0000c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0000e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

This time, we can see that the DMA engine is Done processing its descriptor (DMA Done : DMA Complete, DMA Status Code : DMA Halted from DMA Halt bit in Descriptor), but the BSP is hanging at the GetBits instruction.

This is a very cool feature of the BSP, working in tandem with the DMA engine, the DMA engine waiting for the FIFO to have available room to insert more data and, conversely, the GetBits instruction hanging until there is data available. Hopefully we can expect a similar behavior for the output FIFO to transfer data from VICE DRAM to system memory.

What I did not relate in this post is that beeing able to compile a proper BSP binary executable was not really straightforward to say the least, and that will be covered in the next post.
 
Last edited:
Down the rabbit hole 2/2

In one of my previous posts I mentioned both the binutils patch adding BSP support, and my inability to do more than a simple "break" BSP program.
Well, I should have paid more attention to this little note from the original authors :rolleyes:
currently including patches to binutils to add instructions support for the VICE Media Streaming Processor and Bitstream Streaming Processor, and a library with sample code. Just started and very EXPERIMENTAL.

To be able to produce proper BSP code as illustrated in my previous post, I had to :
  • Port the patch from an unknown CVS revision to release 2.14
  • Fix a few issues for the ADDI, LLH and LSH instructions
  • Fix a mask issue for the f-RD-CD operands instructions, that affected the bitstreams instructions I was interested in (GetBits & co) and a few others
  • Add support for PC relative branch instructions (eg. conditionnal branch to label)
For the sake of having a "complete" VICE binutils, I also ported the (much simpler) MSP patch from binutils 2.13 to binutils 2.14.
As a sanity check for the MSP patch, I compared a 2002 binary provided by the libvice folks to the output of the newly compiled binutils, with the following result :
Code:
$ diff -u /tmp/msp_code.ref /tmp/msp_code.test
--- /tmp/msp_code.ref    2025-02-28 14:36:38.512631023 +0100
+++ /tmp/msp_code.test    2025-02-28 14:36:46.763711170 +0100
@@ -8,7 +8,7 @@
 0000070 814c 0008 a000 2108 2130 ffff 814c 0010
 0000080 884c 0020 894c 0018 8a4c 0028 864c 0030
 0000090 874c 0038 0124 1100 c144 0080 4144 0088
-00000a0 2130 0100 2010 feff 4144 0088 00ca 0020
+00000a0 2130 0000 2010 feff 4144 0088 00ca 0020
 00000b0 01ca 0120 02ca 0220 03ca 0320 04ca 0420
 00000c0 05ca 0520 06ca 0620 07ca 0720 08ca 0820
 00000d0 09ca 0920 0aca 0a20 0bca 0b20 0cca 0c20
@@ -34,9 +34,9 @@
 0000210 814c 0008 a000 2108 2130 ffff 814c 0010
 0000220 884c 0020 894c 0018 8a4c 0028 864c 0030
 0000230 874c 0038 0124 1100 c144 0080 4144 0088
-0000240 2130 0100 2010 feff 4144 0088 7322 1000
+0000240 2130 0000 2010 feff 4144 0088 7322 1000
 0000250 ad21 0100 b115 7cff 0000 0000 7402 2298
 0000260 8002 2160 0c00 0061 6c02 2098 ce21 0100
 0000270 d215 73ff 0000 0000 0124 0100 c144 00b8
-0000280 0100 0d00 0000 0000 0000 0000 0000 0000
-0000290
+0000280 0100 0d00                            
+0000284

The only difference (appart from the padding) is an ADDI instruction that was obviously wrong in the 2.13 version and fixed and 2.14, so nothing directly related to the MSP patch.

All of this is still work in progress, PRs are welcome on the Github repo :)

If you're interested in building this, please make sure to specify the correct target at configure time :
Code:
$ ./configure --target=bsp
For the BSP build, and
Code:
$ ./configure --target=mips
For the MSP build.

To compile MSP stuff, you'll also have to add the "-mdmx" flag on "as" command line invocation, or add a ".set mdmx" in the ASM listing.
 
Last edited:
  • Love
Reactions: joshyfishy22
Oh wow. This makes me want to get an O2 running.

Awesome stuff Bplaa. I look forward to more developments!
 

About us

  • Silicon Graphics User Group (SGUG) is a community for users, developers, and admirers of Silicon Graphics (SGI) products. We aim to be a friendly hobbyist community for discussing all aspects of SGIs, including use, software development, the IRIX Operating System, and troubleshooting, as well as facilitating hardware exchange.

User Menu