What I've been up to

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
Hi, my name is Erno and I like to make computers do stuff. I figured that I might as well make a journal-like thread of my ventures in the barren lands of almost-POSIX operating systems. I try to update this thing often enough.

"Why?", you may ask. Well, why does anyone climb a mountain or run a marathon? Because they want to.

Currently I'm playing with an old version of Firefox, the 17 ESR. I started this sometime last summer already, choosing the version by simple mathematics: last supported version was in 2006, now is (was) 2019, let's take something from the middle and see how painful that is. FF has some dependencies, but the only thing that required more work was GLib 2. Now I have GTK+ 2, Cairo, Pango and other bits and pieces in place. I also compiled Xrender from latest Xorg because FF refuses to build if Cairo doesn't have the Xrender backend enabled. I already got the build go through once only to be greeted by all the binaries linked against libxul.so dumping core.

I had my suspicions of this and started fresh with the excellent 9.2.0 + hammy's toolchain. Because my last 8.2.0 used setjmp/longjmp exceptions + unwinding, this meant recompiling all the dependencies to make sure everything works. After that was done, I came across an interesting bug where using both -pthread and -lpthread made binaries dump core - our best guess was something like constructors executed twice or something like that. Someone with ungodly amounts of persistence could take a look here - meanwhile I just make sure only one of the flags is passed :D

There's a couple of other things in the queue, too. Next Thursday I'm visiting hacklab and bringing home my old Octane CPU module, which should finally allow me to set up an SGI workstation at home. This would mean that I'd finally be able to run graphical SGI software without either a 100-kilometer drive or an X11 forward. Thus, GAMES. SDL2 and all that. We'll see.
 

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
As SGI X servers don't have Xrender, I was thinking of this: one could write a Xrender-like library that emulates it in OpenGL. o_O
 

Elf

Storybook
Feb 4, 2019
253
57
28
I am continually impressed by what you manage to do on an SGI remotely! Very excellent work, and I can count at least a few supposedly "impossible" things on that list :)
 

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
Thanks for your kind words. My actual strong points as a programmer tend to be not believing what others say combined with extreme stubbornness.

Anyway, had some really fun times with Firefox build. As you may or may not know, back when GCC still had official IRIX support, the common way of running it was with IRIX linker and assembler instead of GNU equivalents. However, when I started with this stuff, my only feasible option was crossbuilding, thus I hacked together a somewhat-working toolchain with GNU binutils as I couldn't run IRIX bits on a PC.

So, because of all this, GNU binutils was never tested very extensively on IRIX. The version now in sgug-rse is rather solid - with one exception. Basically: if you link with libstdc++ you must make sure its init code is run before libpthread is loaded, or kaboom - a core dump, possibly with a corrupt stack and thus backtrace, when loading the binary. After figuring this out I just took care of having the libraries load in correct order, and got stuff like plugin-container to no longer dump core. If you're ever experiencing something similar, use readelf to examine library load order in the binary. sgug-rse binutils will eventually be revisited, but for now manual intervention is the way to go.

Once I get this particular ordeal to a point where I can think of something else, I think I'll take a stab at CMake. Or possibly something else.
 
  • Like
Reactions: Elf

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
During the past couple of days, a lot of fun has been had.

Disclaimers: this post contains a lot of guessing and assuming, which I won't state separately each time, so if you know something better, please do tell. Educated guesses are welcome, too! Also, most of the C++ terminology is guessed as my knowledge of the language is pretty much helloworld level. We could also say that I've never had to dive this deep before.

Firefox 17 consists mostly of a huge library called libxul.so, which is linked against just about every dependency on the system. I had trouble with the library loading - a test executable can either link against it or dlopen() it, access all the symbols in it and everything works fine. However, when I try this with the actual browser binary, no cigar but instead a core dump. gdb can't see anything, we just end up in a certain memory address without anything resembling a backtrace. All rld.debug can tell me is that the last thing to happen is libxul.so initialization code being run. So, let's get to business and start to debug for real.

First, to give an idea about the sheer size of the thing, here are some selected bits of readelf output for your entertainment.

Code:
  Start of program headers:          52 (bytes into file)
  Start of section headers:          529194088 (bytes into file)
...
Relocation section '.rel.dyn' at offset 0x32124 contains 312286 entries:
Linking the thing takes somewhere around 3-4 minutes on the machine I'm using.

There was an initial educated guess from @hammy about this probably being a class constructor somewhere acting up. Single-stepping through the firefox binary revealed this to be rather likely, as the crash happens right after the dlopen() call. The problem is backtrace looking like this:

Code:
Program received signal SIGSEGV, Segmentation fault.
0x61c2e980 in ?? ()
(gdb) bt
#0  0x61c2e980 in ?? ()
#1  0x060fea44 in ?? ()
warning: GDB can't find the start of the function at 0x60fea42.
No symbols, no nothing. I just dismissed this as a corrupted stack or something similar. Turns out it wasn't that - read on.

At this point it was pretty much just shots in the dark as I had no real idea on where to look next. First idea was to confirm that the C++ constructors are actually executed - so let's add a debug print in one of them. Try it out with the test executable and, sure enough, they are executed. What's up with that? Try adding some thread code to the test executable and execute it before dlopen() just to make sure it's not some init order thing. No, it's not, everything works fine.

Having spent a considerable amount of time looking at gdb output with the test executable I became rather familiar with the address range of functions in the library. At that moment a chain of flashbulbs was lit in some dark chamber of my brain - the crash location when dlopening the library from firefox is within the same range, but there's no symbol associated with it. I'm not sure how exactly I came to this additional conclusion, but I also noticed that with the test binary the library gets loaded at the exact same addresses where it is "naturally", as in, where the initial GOT values point etc. Fire up gdb, load the library in it, type x/20i 0x61c2e980 and be greeted with this:

Code:
   0x61c2e980 <__do_global_ctors_aux>:  addiu   sp,sp,-32
   0x61c2e984 <__do_global_ctors_aux+4>:        sd      gp,16(sp)
   0x61c2e988 <__do_global_ctors_aux+8>:        lui     gp,0x3d
   0x61c2e98c <__do_global_ctors_aux+12>:       addu    gp,gp,t9
   0x61c2e990 <__do_global_ctors_aux+16>:       addiu   gp,gp,-3444
   0x61c2e994 <__do_global_ctors_aux+20>:       lw      v1,13608(gp)
   0x61c2e998 <__do_global_ctors_aux+24>:       li      v0,-1
   0x61c2e99c <__do_global_ctors_aux+28>:       sd      ra,24(sp)
   0x61c2e9a0 <__do_global_ctors_aux+32>:       lw      t9,476(v1)
   0x61c2e9a4 <__do_global_ctors_aux+36>:       sd      s1,8(sp)
   0x61c2e9a8 <__do_global_ctors_aux+40>:       beq     t9,v0,0x61c2e9cc <__do_global_ctors_aux+76>
...
Ta-fucking-duh! The __do_global_ctors_aux function is a part of GCC, found in libgcc/crtstuff.c. During the GCC build process, this file is compiled with different preprocessor flags to produce object files to be linked as parts of the final executable. They are mostly responsible for very early startup and very late cleanup before and after the actual executable code - a good example of such a task would be handling class constructors and destructors.

So, let's try the same disassembly with our crashed firefox in gdb.

Code:
(gdb) x/20i 0x61c2e980
=> 0x61c2e980:  Cannot access memory at address 0x61c2e980
Aha, the library must've been relocated by the runtime linker, then. Sure enough:
Code:
(gdb) i sh
From        To          Syms Read   Shared Object Library
...
0x04746e80  0x060fe9f0  No          /usr/people/esp/src/ff17build.sgug/dist/bin/libxul.so
...
Nowhere near the 0x60000000-ish range where it was built. Very first idea: make sure it is compiled as position-independent code - yes it is. So, does this mean that moving the library somehow breaks something and thus we end up trying to jump into darkness? Let's compile the test binary forcing its data section address somewhere so that it's guaranteed to overlap with libxul, thus forcing rld to move it. Ha! The failure mode is exactly the same.

But... given the amount of stuff compiled with this toolchain working just fine, this can't affect everything, can it? Just to make sure and maintain sanity, let's write a very small test library whose constructor sets a property to a value and the only accessible function prints that value, and a small program using that library. Compile, run. Works. Compile with special flags to force the library to be moved by the runtime linker, confirm this with gdb, run. Works.

So what the actual fuck is up with libxul? There's only one way to find out. Something somewhere makes that jump to the wrong place. Let's set a breakpoint to the constructor function address and see what calls it.
Code:
(gdb) break *0x61c2e980
Breakpoint 1 at 0x61c2e980
(gdb) run
Starting program: /usr/people/esp/src/a.out

Breakpoint 1, 0x61c2e980 in __do_global_ctors_aux () from ff17build.sgug/dist/lib/libxul.so
(gdb) bt
#0  0x61c2e980 in __do_global_ctors_aux () from ff17build.sgug/dist/lib/libxul.so
#1  0x61c2ea44 in ?? () from ff17build.sgug/dist/lib/libxul.so
The caller has no symbol information? Interesting. However, it does have an address. Let's revisit what readelf told us.
Code:
  [11] .gcc_init         PROGBITS        61c2e9f0 1c4e9f0 000064 00  AX  0   0  1
That address is in that section. Let's just disassemble the damn init code and see what it does.
Code:
(gdb) x/20i 0x61c2e9f0
   0x61c2e9f0:  daddiu  sp,sp,-16
   0x61c2e9f4:  sd      ra,0(sp)
   0x61c2e9f8:  sd      gp,8(sp)
   0x61c2e9fc:  bal     0x61c2ea04
   0x61c2ea00:  nop
   0x61c2ea04:  move    v0,gp
   0x61c2ea08:  lui     gp,0x3c
   0x61c2ea0c:  addiu   gp,gp,-3556
   0x61c2ea10:  addu    gp,gp,ra
   0x61c2ea14:  lw      t9,-32716(gp)
   0x61c2ea18:  jalr    t9
   0x61c2ea1c:  nop
   0x61c2ea20:  bal     0x61c2ea28
   0x61c2ea24:  nop
   0x61c2ea28:  move    v0,gp
   0x61c2ea2c:  lui     gp,0x3d
   0x61c2ea30:  addiu   gp,gp,-3612
   0x61c2ea34:  addu    gp,gp,ra
   0x61c2ea38:  lw      t9,13612(gp)
   0x61c2ea3c:  jalr    t9
   0x61c2ea40:  nop
Store stuff on stack, pointer math, jump, more pointer math, jump. We end up in the constructor thing from the latter jump.

So, $gp is the global pointer. Let's take a look at where it's pointing when we're just about to jump.
Code:
(gdb) p/x $gp
$1 = 0x61ffdc0c
So what's that address? Let's start by figuring out which section this is, once again from the ELF section listing.
Code:
Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
...
  [20] .got              PROGBITS        61fe5c30 2005c30 03d17c 04 WAp  0   0 16
Global offset table? Right. On MIPS that's accessed by indexed memory operations, such as the lw instruction in our code at 0x61c2ea38. So, this means that we should be able to see this also on readelf output, as it prints out a nicely indexed listing of GOT contents. Let's take a look at index 13612 then.
Code:
  61ff114c  13612(gp) 60849be0
Errm what? This doesn't look like our address. Also, it points to something completely different:
Code:
   0x60849be0 <nsDOMEventTargetHelper::GetListenerManager(bool)>:       addiu   sp,sp,-48
What the actual flying fuck is going on? Hint: look at $gp value and the address on the readelf output row. I have no formal education in mathematics, but according to my understanding x + y > x is universally true. Except not here. $gp + 13612 can't possibly be less than $gp, can it? Let's take another look at what readelf had to say about the global offset table.
Code:
Primary GOT:
 Canonical gp value: 61fedc20
That's not what our $gp is pointing to, not at all... apply brain on full, figure out that maximum size of GOT must be around 16 k entries as one entry takes four bytes and indexed memory access can only index +- 32 kilobytes. However, let's revisit what readelf told us about the GOT section. The size of the GOT section is 0x3D17Ch bytes, or in decimal almost 256 kilobytes. But... the pointer access by the code from the init section sure does look like it's accessing a GOT-like memory area, just using a different base address 0x61ffdc0c for it! Let's poke around a bit.
Code:
(gdb) x/i *(0x61ffdc0c+(13612))
   0x61c2e980 <__do_global_ctors_aux>:  addiu   sp,sp,-32
(gdb) x/i *(0x61ffdc0c+(8192))
   0x61a92fcc <js::Debugger::ScriptQuery::consider(JSScript*, js::GlobalObject*, JS::AutoScriptVector*)>:       addiu   sp,sp,-112
(gdb) x/i *(0x61ffdc0c+(-8192))
   0x61741fec
     <std::_Rb_tree<unsigned long long, std::pair<unsigned long long const, mozilla::layers::LayerTreeState>, std::_Select1st<std::pair<unsigned long long const, mozilla::layers::LayerTreeState> >, std::less<unsigned long long>, std::allocator<std::pair<unsigned long long const, mozilla::layers::LayerTreeState> > >::_M_get_insert_hint_unique_pos(std::_Rb_tree_const_iterator<std::pair<unsigned long long const, mozilla::layers::LayerTreeState> >, unsigned long long const&)>:   addiu   sp,sp,-80
(gdb) x/i *(0x61ffdc0c+(-16384))
   0x61652c18 <evhttp_remove_header>:   addiu   sp,sp,-48
Right, so all of them look like function entry points, thus looks like our toolchain has built another GOT right behind the usual GOT as that got (heh) full. After some Internet digging I found this particularly helpful post from LLVM folks:


I really wish that MIPS wiki link was still available somewhere.

Let's conclude - I have no idea whether this is the right conclusion, but it's a start. Apparently this other GOT and its existence is somehow not understood by the runtime linker and thus we only get relocations for some of the symbols - namely, those that fit in the first, actual GOT.

So, what to do about it? That's what I intend to find out during the weekend. I also have to come up with test cases for this with MIPSpro toolchain to see if they have some way of circumventing this. It isn't impossible that some bit somewhere in GOT is different and this tells the runtime linker to process additional GOT areas, or maybe we're supposed to have multiple areas marked as GOTPLT in the DYNAMIC section instead of just one (and make sure they're 64 k max), or possibly something completely else.

Sometimes porting is fixing a mmap call there and adding a putenv there. Sometimes it's this.

TL;DR: figuring stuff out is fun.
 
Last edited:
  • Like
Reactions: Elf

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
Instant update!

Code:
esp@arpakuutio ~/src/got-test $ /opt/local/bin/python generate.py > test.c && cc -shared -o libtest.so test.c
ld32: INFO    171: Multigot invoked. Gp relative region broken up into 2 separate regions.
ld32: ERROR   97 : GOT overflow in test.o.
See the dso(5) manpage.
ld32: INFO    152: Output file removed because of error.
Fascinating.
 

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
It's figured out now.

I used MIPSpro to compile some Python-generated source files guaranteed to overflow the single GOT. Turned out that if that happens within a single function, IRIX linker can't cope with that situation, so I had to spread it across multiple functions. Then, to compare, I created a smaller version of the same library, guaranteed to not overflow the GOT. So now I know how this is done!

Here are the relevant parts of the section list as reported by readelf:

Code:
  [ 3] .dynamic          DYNAMIC         00400170 000170 000138 08   A  6   0  4
  [ 4] .dynamic_2        DYNAMIC         004002a8 0002a8 000048 08   A  6   0  4
...
  [14] .got              PROGBITS        0059c000 19c000 00bbfc 04 WAp  0   0  4
  [15] .got_2            PROGBITS        005a7bfc 1a7bfc 005e30 04 WAp  0   0  4
...
Note how we have two DYNAMIC sections instead of one. First one contains, among other things, these:

Code:
 0x70000025 (MIPS_LOCALPAGE_GOTIDX)      0x1
 0x70000025 (MIPS_LOCALPAGE_GOTIDX)      0xf
 0x70000026 (MIPS_LOCAL_GOTIDX)          0x1c
 0x70000027 (MIPS_HIDDEN_GOTIDX)         0x1c
 0x70000028 (MIPS_PROTECTED_GOTIDX)      0x1c
 0x7000000a (MIPS_LOCAL_GOTNO)           28
...
 0x70000030 (MIPS_GP_VALUE)              0x5a3ff0
...
 0x00000003 (PLTGOT)                     0x59c000
...
 0x70000031 (MIPS_AUX_DYNAMIC)           0x4002a8
Of particular interest are the GP value for the .got section access, its definition as a PLTGOT area and a MIPS_AUX_DYNAMIC tag with address, informing us that there's another DYNAMIC section in there too. Let's look at that other section - I just dumped it as hex because couldn't figure out how to make readelf parse it.

Code:
Hex dump of section '.dynamic_2':
  0x004002a8 70000030 005afbec 00000003 005a7bfc p..0.Z.......Z{.
  0x004002b8 70000025 00000000 70000025 0000000d p..%....p..%....
  0x004002c8 70000026 0000001a 70000027 0000001a p..&....p..'....
  0x004002d8 70000028 0000001a 7000000a 0000001a p..(....p.......
  0x004002e8 00000000 00000000                   ........
First is the GP value for the .got2 section, then its definition as another PLTGOT area. We can also see the same tags as on the other section, but with different values. This'll require either documentation or reverse-engineering to figure out what those values are, but the actual mystery is now solved. readelf actually seems to understand multi-GOT to some degree, as here's the switch from first to second GOT in GOT content listing:

Code:
  005abfe0  32752(gp) 0055b01c 0055b2d4 FUNC    UND fun4348AkxUz
  005abfe4  32756(gp) 0055b034 0055b2ec FUNC    UND fun4349AkxUz
  005abfe8  32760(gp) 0055b04c 0055b304 FUNC    UND fun4350AkxUz
  005abfec  32764(gp) 0055b064 0055b31c FUNC    UND fun4351AkxUz
  005abff0            0055b07c 0055b334 FUNC    UND fun4352AkxUz
  005abff4            0055b094 0055b34c FUNC    UND fun4353AkxUz
  005abff8            0055b0ac 0055b364 FUNC    UND fun4354AkxUz
  005abffc            0055b0c4 0055b37c FUNC    UND fun4355AkxUz
Just had to document this quickly before I forget how it works.
 

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
Not quite there yet. GNU binutils create a different, but still working, GOT without the auxiliary dynamic sections and I've confirmed IRIX rld can relocate this just fine and everything still works.

If anyone ever needs to debug anything like this, have a handy Python script:


It generates a library1 with n functions and n global variables in it, a library2 that acts as a proxy accessing those functions and variables and finally a set of files that can be used to generate a test binary. To avoid GOT overflow for a single object, the generated code has at most 5000 function and variable references per source file. When run, it prints out the commands necessary to build the test libraries and the test application. The test application runs and verifies every function in library1 by calling functions in library2 and making sure the returned values match - this is a sanity check to ensure that the GOT is in sync and the functions called are what the toolchain thinks they are.

I can't break the toolchain with this, no matter how hard I try. The only test case successfully breaking it is the libxul itself, and it's starting to look like the dynamic symbol count is estimated incorrectly and we end up with a dynamic symbol table missing a lot of entries.

So, in the end the GOT may have been a red herring, but stuff was learned which is always nice. Stay tuned.
 
  • Like
Reactions: Elf

Elf

Storybook
Feb 4, 2019
253
57
28
So, in the end the GOT may have been a red herring, but stuff was learned which is always nice. Stay tuned.
Personally I have been enjoying this exercise because it has been "learn something new every day" about ELF (no pun intended). It is to the point where I am quite curious now and want to sit down with a reference text to understand what all the tables are for and how code relocation happens.
 

Jacques

New member
Dec 21, 2019
18
2
3
Somerset, United Kingdom
This all sounds awesome, much respect! It's straight over my head and sounds a bit like somebody dropped me halfway into the 68k assembly coding tutorial by Photon / Scoopex!
 

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
Spent a week skiing in Lapland.

On Saturday I finally took the Octane home and for the first time I'm able to use the workstation locally instead of over ssh + X11 forwarding. I did compile a couple of things - SDL 1.x latest and Dosbox latest. Oddly enough it seems to work fine over forwarding, but on Xsgi key events end up producing completely wrong results. For example, f outputs y, etc. Have to look deeper into this.
 

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
Also, my quick hack of Python 3.8.1 seems to work enough to be able to run SCons build system. I did also get CMake 3.7.2 to a stage where the bootstrap is done and I'm supposed to run make now. Once I get home and boot the system up, I'll do exactly that.
 
  • Like
Reactions: Elf

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
Got CMake 3.7.2 going. Luckily I had got libuv into a somewhat working state last summer, so now I got to backport those pieces into CMake's bundled older libuv. If anyone's interested, I can post the WIP patch here.

Also, been working on all kinds of things with certain themes for later release. I intend to keep an air of mystery around this stuff until the time is right. Because I can.
 

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
Got CMake 3.14.6 going, too, thanks to patch by mormyrid and some further patching by yours truly. Also, I've been itching to play Diablo for a bit over 20 years now. Looks like I can finally get to it.



Regarding game code, the most common problems are endianness and alignment. The modern-day source ports usually try to take these things into account at least to some degree, making porting easier.

Note - this is not the air of mystery thing I was referring to earlier. Or is it?
 
  • Like
Reactions: Elf

onre

SYS 64738
Feb 8, 2019
121
56
28
Toijala, Finland
I've had GTK+ 2.24.32 built for quite some time. However, I've not been particularly keen on porting any GTK+ 2 applications because it has had a really show-stopping bug. Anything involving a file chooser has just hung. Today I managed to finally fix that bug.

A bit of background; apparently GDB on IRIX hasn't ever supported debugging threads. Other OSes have thread debugging support in code, but there's none for IRIX. The native debugger dbx can do threads, but our 9.2.0 toolchain can't currently produce debugging information understood by dbx. In longer term, the goal is to create debugging information that dbx can understand and get full thread debugging capacity. However, for now I figured I'll just try to get this working somehow.

So, off we go. Create a minimal application which displays a window with a button. Pushing that button should open a file dialog. Compile said program, run it and have it not display anything. Try gdb, find out a thread is created and gdb hangs. Try dbx, find out multiple threads are created over time and nothing happens - and because symbol display is not possible, figure out nothing really. Feel frustrated. Try a bit of quick bisecting with different glib and gtk versions to no avail.

Sleep. Feel refreshed and confident. Abandon previous plan and decide to printf() the hell out of it. Keep on printfing until you finally find out what's going on; when you create a button that opens a file chooser, you actually instantiate the dialog itself, too. As a part of its initialization the dialog instantiates a filesystem backend - in our case it is of type "unix". As a part of this filesystem backend there is a volume manager component coming from the GIO module of the GLib library. Turns out the hang happens in the initialization of that component. Let's investigate.

The volume manager is, unsurprisingly enough, aware of mount points and device names associated with those. Apparently some Linux of the old days has had a /dev/root which has been some sort of alias for another device. The system-specific code branch in the volume manager that we end up in has a special case handling - if mounted device name is /dev/root, the code tries to resolve that as a device, and simply can't handle the device name actually being /dev/root but instead just hangs. #ifdef out the special case for IRIX, run the test app and feel immense satisfaction when a file chooser dialog opens.

 
Last edited:
  • Like
Reactions: foetz

callahan

Member
Jul 20, 2019
33
22
8
This is awesome. Way above my capabilities, but enjoyable reads and great to see such amazing progress.

Kudos!
 

About us

  • Silicon Graphics User Group (SGUG) is a community for users, developers, and admirers of Silicon Graphics (SGI) products. We aim to be a friendly hobbyist community for discussing all aspects of SGIs, including use, software development, the IRIX Operating System, and troubleshooting, as well as facilitating hardware exchange.

User Menu