During the past couple of days, a lot of
fun has been had.
Disclaimers: this post contains a lot of guessing and assuming, which I won't state separately each time, so if you know something better, please do tell. Educated guesses are welcome, too! Also, most of the C++ terminology is guessed as my knowledge of the language is pretty much helloworld level. We could also say that I've never had to dive
this deep before.
Firefox 17 consists mostly of a huge library called
libxul.so
, which is linked against just about every dependency on the system. I had trouble with the library loading - a test executable can either link against it or
dlopen()
it, access all the symbols in it and everything works fine. However, when I try this with the actual browser binary, no cigar but instead a core dump.
gdb
can't see anything, we just end up in a certain memory address without anything resembling a backtrace. All
rld.debug
can tell me is that the last thing to happen is
libxul.so
initialization code being run. So, let's get to business and start to debug for real.
First, to give an idea about the sheer size of the thing, here are some selected bits of
readelf
output for your entertainment.
Code:
Start of program headers: 52 (bytes into file)
Start of section headers: 529194088 (bytes into file)
...
Relocation section '.rel.dyn' at offset 0x32124 contains 312286 entries:
Linking the thing takes somewhere around 3-4 minutes on the machine I'm using.
There was an initial educated guess from
@hammy about this probably being a class constructor somewhere acting up. Single-stepping through the
firefox
binary revealed this to be rather likely, as the crash happens right after the
dlopen()
call. The problem is backtrace looking like this:
Code:
Program received signal SIGSEGV, Segmentation fault.
0x61c2e980 in ?? ()
(gdb) bt
#0 0x61c2e980 in ?? ()
#1 0x060fea44 in ?? ()
warning: GDB can't find the start of the function at 0x60fea42.
No symbols, no nothing. I just dismissed this as a corrupted stack or something similar. Turns out it wasn't that - read on.
At this point it was pretty much just shots in the dark as I had no real idea on where to look next. First idea was to confirm that the C++ constructors are actually executed - so let's add a debug print in one of them. Try it out with the test executable and, sure enough, they are executed. What's up with that? Try adding some thread code to the test executable and execute it before dlopen() just to make sure it's not some init order thing. No, it's not, everything works fine.
Having spent a considerable amount of time looking at
gdb
output with the test executable I became rather familiar with the address range of functions in the library. At that moment a chain of flashbulbs was lit in some dark chamber of my brain -
the crash location when dlopening the library from firefox is within the same range, but there's no symbol associated with it. I'm not sure how exactly I came to this additional conclusion, but I also noticed that with the test binary the library gets loaded at the exact same addresses where it is "naturally", as in, where the initial GOT values point etc. Fire up
gdb
, load the library in it, type
x/20i 0x61c2e980
and be greeted with this:
Code:
0x61c2e980 <__do_global_ctors_aux>: addiu sp,sp,-32
0x61c2e984 <__do_global_ctors_aux+4>: sd gp,16(sp)
0x61c2e988 <__do_global_ctors_aux+8>: lui gp,0x3d
0x61c2e98c <__do_global_ctors_aux+12>: addu gp,gp,t9
0x61c2e990 <__do_global_ctors_aux+16>: addiu gp,gp,-3444
0x61c2e994 <__do_global_ctors_aux+20>: lw v1,13608(gp)
0x61c2e998 <__do_global_ctors_aux+24>: li v0,-1
0x61c2e99c <__do_global_ctors_aux+28>: sd ra,24(sp)
0x61c2e9a0 <__do_global_ctors_aux+32>: lw t9,476(v1)
0x61c2e9a4 <__do_global_ctors_aux+36>: sd s1,8(sp)
0x61c2e9a8 <__do_global_ctors_aux+40>: beq t9,v0,0x61c2e9cc <__do_global_ctors_aux+76>
...
Ta-fucking-duh! The
__do_global_ctors_aux
function is a part of GCC, found in
libgcc/crtstuff.c
. During the GCC build process, this file is compiled with different preprocessor flags to produce object files to be linked as parts of the final executable. They are mostly responsible for very early startup and very late cleanup before and after the actual executable code - a good example of such a task would be handling class constructors and destructors.
So, let's try the same disassembly with our crashed firefox in gdb.
Code:
(gdb) x/20i 0x61c2e980
=> 0x61c2e980: Cannot access memory at address 0x61c2e980
Aha, the library must've been relocated by the runtime linker, then. Sure enough:
Code:
(gdb) i sh
From To Syms Read Shared Object Library
...
0x04746e80 0x060fe9f0 No /usr/people/esp/src/ff17build.sgug/dist/bin/libxul.so
...
Nowhere near the 0x60000000-ish range where it was built. Very first idea: make sure it is compiled as position-independent code - yes it is. So, does this mean that moving the library somehow breaks something and thus we end up trying to jump into darkness? Let's compile the test binary forcing its
data
section address somewhere so that it's guaranteed to overlap with
libxul
, thus forcing
rld
to move it. Ha! The failure mode is exactly the same.
But... given the amount of stuff compiled with this toolchain working just fine, this can't affect everything, can it? Just to make sure and maintain sanity, let's write a very small test library whose constructor sets a property to a value and the only accessible function prints that value, and a small program using that library. Compile, run. Works. Compile with special flags to force the library to be moved by the runtime linker, confirm this with
gdb
, run. Works.
So what the actual fuck is up with
libxul
? There's only one way to find out. Something somewhere makes that jump to the wrong place. Let's set a breakpoint to the constructor function address and see what calls it.
Code:
(gdb) break *0x61c2e980
Breakpoint 1 at 0x61c2e980
(gdb) run
Starting program: /usr/people/esp/src/a.out
Breakpoint 1, 0x61c2e980 in __do_global_ctors_aux () from ff17build.sgug/dist/lib/libxul.so
(gdb) bt
#0 0x61c2e980 in __do_global_ctors_aux () from ff17build.sgug/dist/lib/libxul.so
#1 0x61c2ea44 in ?? () from ff17build.sgug/dist/lib/libxul.so
The caller has no symbol information? Interesting. However, it does have an address. Let's revisit what
readelf
told us.
Code:
[11] .gcc_init PROGBITS 61c2e9f0 1c4e9f0 000064 00 AX 0 0 1
That address is in that section. Let's just disassemble the damn init code and see what it does.
Code:
(gdb) x/20i 0x61c2e9f0
0x61c2e9f0: daddiu sp,sp,-16
0x61c2e9f4: sd ra,0(sp)
0x61c2e9f8: sd gp,8(sp)
0x61c2e9fc: bal 0x61c2ea04
0x61c2ea00: nop
0x61c2ea04: move v0,gp
0x61c2ea08: lui gp,0x3c
0x61c2ea0c: addiu gp,gp,-3556
0x61c2ea10: addu gp,gp,ra
0x61c2ea14: lw t9,-32716(gp)
0x61c2ea18: jalr t9
0x61c2ea1c: nop
0x61c2ea20: bal 0x61c2ea28
0x61c2ea24: nop
0x61c2ea28: move v0,gp
0x61c2ea2c: lui gp,0x3d
0x61c2ea30: addiu gp,gp,-3612
0x61c2ea34: addu gp,gp,ra
0x61c2ea38: lw t9,13612(gp)
0x61c2ea3c: jalr t9
0x61c2ea40: nop
Store stuff on stack, pointer math, jump, more pointer math, jump. We end up in the constructor thing from the latter jump.
So, $gp is the global pointer. Let's take a look at where it's pointing when we're just about to jump.
Code:
(gdb) p/x $gp
$1 = 0x61ffdc0c
So what's that address? Let's start by figuring out which section this is, once again from the ELF section listing.
Code:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
...
[20] .got PROGBITS 61fe5c30 2005c30 03d17c 04 WAp 0 0 16
Global offset table? Right. On MIPS that's accessed by indexed memory operations, such as the
lw
instruction in our code at
0x61c2ea38
. So, this means that we should be able to see this also on
readelf
output, as it prints out a nicely indexed listing of GOT contents. Let's take a look at index 13612 then.
Code:
61ff114c 13612(gp) 60849be0
Errm what? This doesn't look like our address. Also, it points to something completely different:
Code:
0x60849be0 <nsDOMEventTargetHelper::GetListenerManager(bool)>: addiu sp,sp,-48
What the
actual flying fuck is going on? Hint: look at $gp value and the address on the
readelf
output row. I have no formal education in mathematics, but according to my understanding
x + y > x is universally true. Except not here. $gp + 13612 can't possibly be less than $gp, can it? Let's take another look at what
readelf
had to say about the global offset table.
Code:
Primary GOT:
Canonical gp value: 61fedc20
That's not what our $gp is pointing to, not at all... apply brain on
full, figure out that maximum size of GOT must be around 16 k entries as one entry takes four bytes and indexed memory access can only index +- 32 kilobytes. However, let's revisit what
readelf
told us about the GOT section. The size of the GOT section is 0x3D17Ch bytes, or in decimal almost 256 kilobytes. But... the pointer access by the code from the init section sure does look like it's accessing a GOT-like memory area, just using a different base address
0x61ffdc0c
for it! Let's poke around a bit.
Code:
(gdb) x/i *(0x61ffdc0c+(13612))
0x61c2e980 <__do_global_ctors_aux>: addiu sp,sp,-32
(gdb) x/i *(0x61ffdc0c+(8192))
0x61a92fcc <js::Debugger::ScriptQuery::consider(JSScript*, js::GlobalObject*, JS::AutoScriptVector*)>: addiu sp,sp,-112
(gdb) x/i *(0x61ffdc0c+(-8192))
0x61741fec
<std::_Rb_tree<unsigned long long, std::pair<unsigned long long const, mozilla::layers::LayerTreeState>, std::_Select1st<std::pair<unsigned long long const, mozilla::layers::LayerTreeState> >, std::less<unsigned long long>, std::allocator<std::pair<unsigned long long const, mozilla::layers::LayerTreeState> > >::_M_get_insert_hint_unique_pos(std::_Rb_tree_const_iterator<std::pair<unsigned long long const, mozilla::layers::LayerTreeState> >, unsigned long long const&)>: addiu sp,sp,-80
(gdb) x/i *(0x61ffdc0c+(-16384))
0x61652c18 <evhttp_remove_header>: addiu sp,sp,-48
Right, so all of them look like function entry points, thus looks like our toolchain has built
another GOT right behind the usual GOT as that got (heh) full. After some Internet digging I found this particularly helpful post from LLVM folks:
reviews.llvm.org
I
really wish that MIPS wiki link was still available somewhere.
Let's conclude - I have no idea whether this is the right conclusion, but it's a start. Apparently this other GOT and its existence is somehow not understood by the runtime linker and thus we only get relocations for some of the symbols - namely, those that fit in the first, actual GOT.
So, what to do about it? That's what I intend to find out during the weekend. I also have to come up with test cases for this with MIPSpro toolchain to see if they have some way of circumventing this. It isn't impossible that some bit somewhere in GOT is different and this tells the runtime linker to process additional GOT areas, or maybe we're supposed to have multiple areas marked as GOTPLT in the DYNAMIC section instead of just one (and make sure they're 64 k max), or possibly something completely else.
Sometimes porting is fixing a
mmap
call there and adding a
putenv
there. Sometimes it's this.
TL;DR: figuring stuff out is fun.