Advanced Namespace Tools blog

24 January 2018

A dive into the memory allocation pool

I wake up in the morning and see in my “faces” program that I have a new email message. I click on it and the message opens, but I see an error print tear through my subrio. After reading the message, I exit the subrio and find this:

   mem user overflow
   pool sbrkmem block 55da98
   hdr 0a110c09 00003f40 0020ddb6 00209358 faf0f1fe deadfa11
   tail 00001bde 252b4a04 00001393 5a687d4d cafebabe ffffffff | ef2800be 00003f40
   user data d0 24 00 00  d8 6f 68 5a | de 1b 00 00  04 4a 2b 25
   panic: pool panic
   upas/fs: user: glenda; note: sys: trap: fault read addr=0x0 pc=0x220ca6
   fs 64945: suicide: sys: trap: fault read addr=0x0 pc=0x220ca6

We try a check with ps -a |grep Broke reveals fs upas/fs plumbing in the Broken state. We attach acid and check lstk() and find - not much:

   access(mode=0x41f248)+0x50 /sys/src/libc/9sys/access.c:27
          db=0x7ffffeffe8a8
   0x7ffffeffe8a8 ?file?:0

So we have a mystery to solve: what exactly does the “mem user overflow” error mean, and what code triggered it?

Where does “mem user overflow” come from?

The first step is always to dig into /sys/src and see what we can find there. I used grep to search the kernel files in /sys/src/9/port and /sys/src/9/pc64 for “overflow” but no luck there. Probably something in the libc then? A search in /sys/src/libc found what I was looking for:

   term% cd /sys/src/libc
   term% grep overflow */*
   [...]
   port/pool.c: * marked with a magic value, which is checked to detect user overflow.
   port/pool.c:  printblock(p, b, "mem user overflow (magic0)");
   port/pool.c:  printblock(p, b, "mem user overflow");
   port/pool.c:  panicblock(p, b, "mem user overflow");
   [...]

Bingo. Looking at the code in pool.c was not immediately revelatory, so I performed the next standard step in debugging: call for help in irc. This produced a lot of output, much of which flattered me by presuming that I understood things better than I actually did.

At this point, a small digression is in order. It can be hard, when given obviously informative answers which are somewhat over your head, to not just say “ok, I’ll check that out” and then abandon the whole thing and hope that nobody asks about the results of your investigation. It can also be hard not to say “I have no understanding of this; you figure it out, if you care about it.”

Neither of these is the right thing to do. Treat these moments as golden opportunities for in-depth learning. If you are willing to use your brain and engage with the code and make your best effort to figure things out, people will almost certainly provide you with pointers to the pieces you are missing. You have to demonstrate a willingness to make the effort required, to show that you are putting in the effort to learn so that next time, you will be able to figure it out on your own. It is a case of “teach a man to fish” - but, to change aphorisms, you do not want to be the horse that is led to water but doesn’t bother to drink.

So, I was informed that the error I saw indicated that memory had been corrupted. The code in pool.c was checking the validity of the allocated memory, and discovered an error.

The high-level overview

The goal of this debugging process, which I have not yet completed, is to find the code that corrupted memory. I was informed that this information can be obtained from the corrupted memory itself, because somewhere in the data structures created by the memory allocation process, the address of the allocating function is stored. What I need to do is:

understand the datastructures created by the memory allocation process
understand how to use acid to find the function address within those structures

What I have learned so far is that there is both a “pool header” and a “malloc header”. The malloc header is what has the function address I am looking for. To find it, I need to use the information in the pool header to determine its location.

Where I’m at now

I’ve been told there are useful functions in /sys/lib/acid/pool to help view these structures. A first attempt to use them met with an error, because /sys/src/libc/port/pool.amd64.acid did not exist. To create it:

   term% cd /sys/src/lib/port
   term% 8c -a pool.c >pool.amd64.acid

Now it worked to start acid with acid -l pool attached to the process. I can do things like:

   acid: checkblock(0x55da98)
   0x0055da98 size 16152 user has overwritten boundary

Well, we knew that already.

   acid: print(Block(0x55da98))
          alloc
          data 0x0055daa0
          datasize 16152
          magic  168889353
          size   16192
          0x005619d0
          magic0 0xbe
          datadiff      40
          magic1 0xef
          size   0x00003f40
          hdr    0x0055da98
   {}

I was told whopointsat() was a very useful function, but unfortunately

   acid: print(whopointsat(0x55da98))
   {}

A very helpful email link

Someone in irc provided a link to a very useful email from many years ago on 9fans: https://marc.info/?l=9fans&m=111558947828485&w=2

This link contains a good explanation of how to decode the block hdr:

0a110c09 - magic number for an allocated block

00000020 - size of entire block, header and tail included

000023b5 - pc of malloc for block

00001f3e - pc of realloc for block

0023ff24 - begin actual data space

caf0f1fe ef1500be - this is four bytes: be 00 15 ef. the be and ef

are magic numbers, the 00 15 is the amount of header

and tail in the block, not counting the two pc words.

since the block is 0x20, the data size is 0x20 - 0x15 - 2*4 = 3.

00000020 - size of entire block, header and tail included

So, applying this to the data i got, this tells me that the pc of malloc for my troublesome block is 0020ddb6. Let’s see what code lives there using acid:

(Digression on assumptions - I did this the hard way. Acid has, and I knew it had, a convenient src(address) function to print the source code that corresponds to a given program address. However, I didn’t try to use this function at this point. Why not? Because I assumed that the ?file? I saw in the uninformative lstk() output meant that due to the process forking, acid had “lost” the information about the originating file and source code. Because of this assumption, I skipped using src() which would have directed me right to the c source code, rather than having to work badwards as I did. I keep trying to teach myself not to make assumptions about things I don’t fully understand, but they keep creeping in - partly because things like this debugging process are already very time-intensive, and testing every assumption seems like it extends it even further. The thing is, inaccurate assumptions will often cost you more time in the long run than you can ever save with mental shortcuts. I retain the following material for this post because it isn’t wrong, even though there was an easier method available.)

   acid: asm(0x0020ddb6)
   erealloc+0xb 0x000000000020ddb6    ANDB   $0x28,AL
   erealloc+0x10 0x000000000020ddbb   JNE    erealloc+0x17(SB)
   erealloc+0x16 0x000000000020ddc1   ADDB   CL,0xffffff8b(AX)
   erealloc+0x18 0x000000000020ddc3   MOVL   p+0x0(FP),BP
   erealloc+0x1d 0x000000000020ddc8   ANDB   $0x30,AL
   erealloc+0x21 0x000000000020ddcc   ANDB   $0x8,AL
   erealloc+0x25 0x000000000020ddd0   JMP    emalloc+0x1c(SB)
   erealloc+0x2a 0x000000000020ddd5   MOVL   AX,p+0x0(FP)
   erealloc+0x2f 0x000000000020ddda   CMPL   p+0x0(FP),$0x0
   erealloc+0x34 0x000000000020dddf   JNE    erealloc+0x48(SB)
   erealloc+0x39 0x000000000020dde4   ADDB   CL,0x8930247c(BX)
   erealloc+0x3e 0x000000000020dde9   XORB   CL,0xe808247c(CX)
   erealloc+0x43 0x000000000020ddee   CALL   sysfatal(SB)
   erealloc+0x48 0x000000000020ddf3   LEAQ   p+0x0(FP),BP
   erealloc+0x4d 0x000000000020ddf8   CALL   getcallerpc(SB)
   erealloc+0x52 0x000000000020ddfd   MOVQ   AX,.safe+0x18(SP)
   erealloc+0x56 0x000000000020de01   SBBB   CL,0xffffff8b(AX)
   erealloc+0x57 0x000000000020de02   MOVQ   p+0x0(FP),BP
   erealloc+0x5b 0x000000000020de06   SUBB   CL,0xffffff8b(AX)
   erealloc+0x5f 0x000000000020de0a   ANDB   $0x18,AL
   erealloc+0x64 0x000000000020de0f   ANDB   $0x8,AL
   erealloc+0x69 0x000000000020de14   ADDB   AL,0x0(AX)
   erealloc+0x6e 0x000000000020de19   ANDB   $0x28,AL
   erealloc+0x73 0x000000000020de1e   ANDB   AL,BL

This looks very promising. Now we need to find out what code in upas generated this assembly. We have nice signposts to search for with the “sysfatal” and “getcallerpc” calls.

Searching upas for the culprit

So, grepping for those in upas/fs rapidly leads us to:

   void *
   erealloc(void *p, ulong n)
   {
          if(n == 0)
                 n = 1;
          p = realloc(p, n);
          if(!p)
                 sysfatal("realloc %lud: %r", n);
          setrealloctag(p, getcallerpc(&p));
          return p;
   }

This looks right on target. We need to dig deeper though, and find out what is invoking this. Let’s check the next location in our block hdr:

   acid: asm(0x00209358)
   imap4read+0x55 0x0000000000209358  ADDB   AL,0x0(AX)
   imap4read+0x5a 0x000000000020935d  MOVL   0x30(CX),SI
   imap4read+0x5e 0x0000000000209361  JOC    imap4read+0xa4(SB)
   imap4read+0x61 0x0000000000209364  MOVL   0x38(CX),BP
   imap4read+0x64 0x0000000000209367  MOVL   0x30(CX),SI
   imap4read+0x66 0x0000000000209369  XORB   AL,CL
   imap4read+0x69 0x000000000020936c  ADDB   $0x89,AL
   imap4read+0x6e 0x0000000000209371  CALL   erealloc(SB)
   imap4read+0x73 0x0000000000209376  MOVQ   imap+0x0(FP),CX
   imap4read+0x78 0x000000000020937b  MOVQ   AX,0x38(CX)
   imap4read+0x7d 0x0000000000209380  JPL    imap4read+0xaf(SB)
   imap4read+0x82 0x0000000000209385  JLE    imap4read+0xc8(SB)
   imap4read+0x87 0x000000000020938a  MOVL   $.string+0x496(SB),DI
   imap4read+0x8c 0x000000000020938f  MOVQ   DI,0x8(SP)
   imap4read+0x91 0x0000000000209394  CALL   imap4cmd(SB)
   imap4read+0x96 0x0000000000209399  MOVQ   imap+0x0(FP),BP
   imap4read+0x9b 0x000000000020939e  CALL   imap4resp(SB)
   imap4read+0x9e 0x00000000002093a1  <stdin>:11: (error) indir: Ȩ�

Now we are really getting somewhere. Again we grep for the CALL targets and hit paydirt in upas/fs/imap.c:

   static char*
   imap4read(Imap *imap, Mailbox *mb)
   {
          char *s;
          int i, n, c;
          Fetchi *f;
          Message *m, **ll;

          imap4cmd(imap, "status %Z (messages uidvalidity)", imap->mbox);
          if(!isokay(s = imap4resp(imap)))
                 return s;

          imap->nuid = 0;
          imap->muid = imap->nmsg;
          imap->f = erealloc(imap->f, imap->nmsg*sizeof imap->f[0]);
          if(imap->nmsg > 0){
                 imap4cmd(imap, "uid fetch 1:* (uid rfc822.size internaldate)");
                 if(!isokay(s = imap4resp(imap)))
                        return s;
          }

Even with only the vaguest understanding of assembly, we can see we have found what we are looking for. We have found the CALLs to erealloc, imap4cmd, and imap4resp that we can see in the asm. Checking the manpage for the malloc(2) functions notes that “User errors can corrupt the storage arena.” Evidently some bad data wormed its way into the erealloc call in imap4read. I will return to irc with my findings and perhaps will be given some more hints on how to proceed from this knowledge.

Follow-up: I wasn’t quite accurate in saying that bad data was in the erealloc call - the bad data was apparently a layer lower, within the imap structure, and was fixed by this commit which checks for imap->nuid being greater than imap->muid and adjusts the value accordingly.