Advanced Namespace Tools blog
14 December 2016
Hubfs multi-client debugging
While using hubfs to keep a persistent IRC session available for multiple clients, I experienced some intermittent failures. Periodically, the client I was not actively using would stop receiving messages, although it could still send data successfully. I am working on debugging this issue.
The first step is recreating the failure. The test I am using is to create a hubfs on server A, which clients B and C import. A shell running on server A writes an increasing stream of integers 1 per second, which are displayed on all 3 machines. At unpredictable intervals, one or more of the clients' output will stop being updated.
A first attempt at debugging is attaching acid to the server hubfs and trying to trace the execution as the messages are sent. I ran into a problem, because of receiving the following acid error:
<stdin>:43: (error) windir: can't write address 0x72c1: bad process or channel control request
This error only seems to occur when things are working correctly. When the bug occurs, acid doesn’t seem to produce this error. This is probably related somehow, but understanding what is going on is beyond my current expertise.
My current plan is to create a debugging version of hubfs, with copious output of exactly what it is doing as it sends and receives messages. Hubfs predecessor, Iosrv, had extensive optional debugging information, because of its more complex architecture.
Debugging prints added, bug identified
I added extensive debugging prints to the msgsend() function. I ran into a weird issue trying to use the debug hubfs in acme, which I will have to come back to later. It took a few attempts to capture an instance of the bug, but I did. The problem is in the handling of the queue of unanswered 9p read requests. Hubfs uses ring-buffer type data structures, and the logic for handling “wrap around” is tricky. The bug is occurring when multiple readers' requests are stored at the maximum end of the queue, wraparound happens, and then their requests are forgotten. The offending variables are the qrnum and qrans counter. Here is a segment of the logging showing the bug:
--msgsend 20138--
in read loop, counter == 771, h->qrans == 771, h->qrnum == 775
req found, r = h->qreqs[771], mq = r->fid->aux = e1f68
reader has read all data, mq->nxt 20a81 == h->inbuckp 20a81
in read loop, counter == 772, h->qrans == 771, h->qrnum == 775
req found, r = h->qreqs[772], mq = r->fid->aux = ef668
reader has read all data, mq->nxt 20a81 == h->inbuckp 20a81
in read loop, counter == 773, h->qrans == 771, h->qrnum == 775
req found, r = h->qreqs[773], mq = r->fid->aux = e9348
reader has read all data, mq->nxt 20a81 == h->inbuckp 20a81
in read loop, counter == 774, h->qrans == 771, h->qrnum == 775
req found, r = h->qreqs[774], mq = r->fid->aux = e68e8
reader has read all data, mq->nxt 20a81 == h->inbuckp 20a81
in read loop, counter == 775, h->qrans == 771, h->qrnum == 775
h->rstatus[0] !=WAIT
--msgsend 20138--
in read loop, counter == 1, h->qrans == 1, h->qrnum == 1
req found, r = h->qreqs[1], mq = r->fid->aux = e4408
reader has read all data, mq->nxt 20a81 == h->inbuckp 20a81
--msgsend 20138--
in read loop, counter == 1, h->qrans == 1, h->qrnum == 1
req found, r = h->qreqs[1], mq = r->fid->aux = e4408
about to respond, h->rstatus[1] = DONE
As you might be able to see, in this test with 5 simultaneous readers, 4 of them got abandoned with their messages stored in slots 771-774. (I think there is a small and irrelevant bug involving slot 775).
I can think of some quick and ugly ways to fix this, but I’d like to work out a good solution with clear logic. Probably what I should do is create a separate counter for unanswered requests, and if wraparound happens with the unanswered request counter > 0, msgsend will know to go back and take care of them.