Advanced Namespace Tools blog
5 January 2017
Weird Behavior of CFS ctl file and 9p requests for it
Background and Initial Symptoms
The Cache File Server has an option to collect statistics, which are presented via a file /cfsctl at the root of the tree. While trying to read these stats, I experienced buggy behavior:
- The data did not seem to be updating properly on successive reads
- The data display started showing corruption, with fragments of new data appearing at the end of the output
- Eventually, attempts to read from the file would produce an EOF error with no data output at all
I took a look at the code, and added some debugging prints to investigate what was going on. Here is an example of debug output from adding diagnostic prints to the cfsctl section of the genstats and rread functions in cfs.c. Each cycle beginngin with a “statlen” is from successive cats of the file after the data has changed:
statlen 1054 = p 4df8e - statbuf 4db70
cnt 8192 > statlen 1054 - off 0 setting c.rhdr.count to 1054
cnt 8192 > statlen 1054 - off 1054 setting c.rhdr.count to 0
statlen 1055 = p 4df8f - statbuf 4db70
cnt 7138 > statlen 1055 - off 1054 setting c.rhdr.count to 1
cnt 8192 > statlen 1055 - off 1055 setting c.rhdr.count to 0
statlen 1057 = p 4df91 - statbuf 4db70
cnt 7137 > statlen 1057 - off 1055 setting c.rhdr.count to 2
cnt 8192 > statlen 1057 - off 1057 setting c.rhdr.count to 0
statlen 1281 = p 4e071 - statbuf 4db70
cnt 7135 > statlen 1281 - off 1057 setting c.rhdr.count to 224
cnt 8192 > statlen 1281 - off 1281 setting c.rhdr.count to 0
statlen 1275 = p 4e06b - statbuf 4db70
cnt 6911 > statlen 1275 - off 1281 setting c.rhdr.count to -6
(int)c.rhdr.count -6 < 0, sending eof
Contrary to my expectations, statlen seemed to be behaving correctly, but the count and offset taken from the 9p request structure c.thdr (c.rhdr is the reply) looked strange to me. Why was the request corresponding to each new cat operation showing a steadily shrinking count and increasing offset?
A Partial Fix with very Strange Results
The failure to show the data correctly was mostly due to the offset parameter, because the data is sent in the reply by:
c.rhdr.data = statbuf + off;
Because the offset doesn’t return to 0 on new reads, only the final few bytes of the changed buffer are shown - with the stale data appearing before them. I decided to hack in some manipulation of the offset, and then things got really weird. Here is my hacked-up debugging version of the code for the cfsctl file in rread:
off = c.thdr.offset;
cnt = c.thdr.count;
if(statson && ctltest(mf)){
/* statsend is a variable I added to help control the output behavior */
fprint(2, "rread cfsctl:\n");
if(statsend == 0)
off = 0;
if(statsend == 1)
off = statlen;
statsend++;
if(statsend == 2)
statsend = 0;
/* The idea is that we send all the data from 0 to statlen on the first request, then, then set the offset equal to the amount read previously and send no data on the second read - the rest of the logic is unchanged save debugging prints */
if(cnt > statlen-off){
c.rhdr.count = statlen-off;
fprint(2, "cnt %d > (statlen %d - off %lld) setting c.rhdr.count to %d\n", cnt, statlen, off, c.rhdr.count);
}
else{
c.rhdr.count = cnt;
fprint(2, "cnt %d <= statlen %d - off %lld, c.rhdr.count set to cnt %d\n", cnt, statlen, off, c.rhdr.count);
}
if((int)c.rhdr.count < 0){
fprint(2, "(int)c.rhdr.count %d < 0, sendreply(eof)\n", (int)c.rhdr.count);
sendreply("eof");
return;
}
c.rhdr.data = statbuf + off;
fprint(2, "c.rhdr.data %p set from statbuf %p + off %lld\n", c.rhdr.data, statbuf, off);
sendreply(0);
return;
}
This did fix the issue partially - I now received correctly updated stats data from every read of the file, with no corruption and no eof errors. However, something even stranger (to me at least!) started happening:
All of the previous reads from the file were also printed, with the new data appended at the end
So, as the fs was used and I read from the ctl file repeatedly, the output would be like this:
Client Server
#calls Δ ms/call Δ #calls Δ ms/call Δ
1 1 0.750 0.750 1 1 0.743 0.743 Tversion
7 7 0.575 0.575 7 7 0.569 0.569 Tauth
7 7 0.893 0.893 7 7 0.888 0.888 Tattach
325 325 0.490 0.490 324 324 0.486 0.486 Twalk
147 147 0.474 0.474 146 146 0.470 0.470 Topen
764 764 0.239 0.239 18 18 3.644 3.644 Tread
16 16 5.071 5.071 16 16 5.065 5.065 Twrite
135 135 0.586 0.586 135 135 0.581 0.581 Tclunk
169 169 0.468 0.468 169 169 0.455 0.455 Tstat
11 11 ndirread
7 7 ndelegateread
0 0 ninsert
0 0 ndelete
5 5 nupdate
1716594 1716594 bytesread
4668 4668 byteswritten
0 0 bytesfromserver
3769 3769 bytesfromdirs
1712825 1712825 bytesfromcache
0 0 bytestocache
Client Server
#calls Δ ms/call Δ #calls Δ ms/call Δ
1 0 0.750 1 0 0.743 Tversion
7 0 0.575 7 0 0.569 Tauth
7 0 0.893 7 0 0.888 Tattach
327 2 0.490 0.407 325 1 0.487 0.799 Twalk
149 2 0.499 2.373 147 1 0.470 0.442 Topen
766 2 0.275 14.317 18 0 3.644 Tread
16 0 5.071 16 0 5.065 Twrite
137 2 0.620 2.962 137 2 0.616 2.955 Tclunk
169 0 0.468 169 0 0.455 Tstat
11 0 ndirread
7 0 ndelegateread
0 0 ninsert
0 0 ndelete
5 0 nupdate
1716594 0 bytesread
4668 0 byteswritten
0 0 bytesfromserver
3769 0 bytesfromdirs
1712825 0 bytesfromcache
0 0 bytestocache
Client Server
#calls Δ ms/call Δ #calls Δ ms/call Δ
1 0 0.750 1 0 0.743 Tversion
7 0 0.575 7 0 0.569 Tauth
7 0 0.893 7 0 0.888 Tattach
388 61 0.496 0.529 385 60 0.494 0.532 Twalk
152 3 0.520 1.527 149 2 0.471 0.527 Topen
773 7 0.311 4.267 20 2 3.330 0.507 Tread
17 1 5.111 5.746 17 1 4.822 0.940 Twrite
140 3 0.638 1.451 140 3 0.634 1.442 Tclunk
227 58 0.481 0.517 227 58 0.468 0.507 Tstat
13 2 ndirread
7 0 ndelegateread
0 0 ninsert
0 0 ndelete
5 0 nupdate
1736279 19685 bytesread
4693 25 byteswritten
0 0 bytesfromserver
11431 7662 bytesfromdirs
1724848 12023 bytesfromcache
25 25 bytestocache
Every time I read from the /cfsctl file after the data had been changed, the new data would be appended at the end, and this process continues arbitrarily. This seems paradoxical to me because the statbuf is a static array of 2048 bytes, and there are only 2 9p requests being fulfilled, one for a bit over 1000 bytes, and one for 0. Here are what the debugging prints look like with this version of the code (note that these offsets are fake, the actual c.thdr offset request size is huge:
rread cfsctl:
cnt 8192 > (statlen 1127 - off 1127) setting c.rhdr.count to 0
c.rhdr.data 4e067 set from statbuf 4dc00 + off 1127
genstats: statlen 1127 = p 4e067 - statbuf 4dc00
rread cfsctl:
cnt 4811 > (statlen 1127 - off 0) setting c.rhdr.count to 1127
c.rhdr.data 4dc00 set from statbuf 4dc00 + off 0
rread cfsctl:
cnt 8192 > (statlen 1127 - off 1127) setting c.rhdr.count to 0
c.rhdr.data 4e067 set from statbuf 4dc00 + off 1127
genstats: statlen 1127 = p 4e067 - statbuf 4dc00
rread cfsctl:
cnt 3684 > (statlen 1127 - off 0) setting c.rhdr.count to 1127
c.rhdr.data 4dc00 set from statbuf 4dc00 + off 0
rread cfsctl:
cnt 8192 > (statlen 1127 - off 1127) setting c.rhdr.count to 0
c.rhdr.data 4e067 set from statbuf 4dc00 + off 1127
genstats: statlen 1127 = p 4e067 - statbuf 4dc00
rread cfsctl:
cnt 2557 > (statlen 1127 - off 0) setting c.rhdr.count to 1127
c.rhdr.data 4dc00 set from statbuf 4dc00 + off 0
This data was collected in a test where I was trying to see just how far the extra buffering/appending would go - and so far I have not found a limit. At the moment, I am receiving over a megabyte of data from reads of /cfsctl:
cpu% cat /cfsctl |wc
24184 130469 1200544
All of this data is not being stored anywhere by the cfs process itself, its memory footprint is far too small:
cpu% ps -a |grep cfs
glenda 211 0:00 0:56 312K Pread cfs
Here is some of the data from the raw 9p requests printed after the convM2S in rcvmsg:
rcvmsg: count is 1430 offset is 1194602
rcvmsg: count is 8192 offset is 1195729
rcvmsg: count is 8192 offset is 1195729
rcvmsg: count is 20 offset is 323088
rcvmsg: count is 8192 offset is 1195729
rcvmsg: count is 303 offset is 1195729
rcvmsg: count is 8192 offset is 1196032
rcvmsg: count is 8192 offset is 1196032
rcvmsg: count is 20 offset is 323088
rcvmsg: count is 8192 offset is 1196032
rcvmsg: count is 8192 offset is 1196032
I have some logic to turn on/off those prints depending on whether or not the /cfsctl file is being read, so I’m not sure if the “count is 20” messages are part of the cfsctl read transaction. Complete debug code that I’m running is at http://files.9gridchan.org/cfsdebug.c and the raw logs that im extracting these debug sample outputs from is at http://files.9gridchan.org/cfsdebug.log
So What/Why/How ?
What seems like it must be happening is that the kernel is caching responses to reads of the /cfsctl file and maintaining the offset between successive invocations of cat. Somewhere in the communication chain between cat, the mount device (devmnt.c in the kernel) and the actual cfs program, things are getting confused, and the kernel must be keeping a large buffer of data which it is replaying to each new cat.
The cause of the progressively larger offsets to cat is unclear to me. I don’t understand how whatever cfs is doing wrong is causing cat/devmnt to behave in this way.
The Lightbulb Goes On
After writing up all the above, I was ready to consult the lead 9front dev, Cinap. As usual, he was able to diagnose and fix in the issue in about five minutes of total irc discussion. Ironically enough for me as a namespace fanatic, I was forgetting something about namespaces in the standard /lib/namespace file. The very first line:
mount -aC #s/boot /root $rootspec
The -C flag is the key here. As we know from man 1 mount:
-C (Only in mount.) By default, file contents are
always retrieved from the server. With this
option, the kernel may instead use a local cache
to satisfy read(5) requests for files accessible
through this mount point.
So, now things fall into place: we are seeing an interaction between a caching mechanism intended for use with fses serving static files, and the synthetic cfsctl statistics file. In addition to pointing out the cause, Cinap also had a fast and easy fix: increment the qid (unique 9p protocol file identifier) at the end of the genstats() routine:
ctlqid.vers++;
Along with a minor adjustment to when genstats() is invoked, that was all that was necessary to fix the issue with the bad behavior of cfsctl in combination with the -C flag to mount.
Lessons Learned
One of the hardest challenges in debugging is remembering everything you know - I certainly knew the -C flag existed in the sense that I had read the manpage for mount many times, and had seen it used in the mount of root in the standard namespace file. Despite this, when my debugging led me to conclude that the kernel was supplying previously cached data to a read request, the existence and relevance of the -C flag completely slipped my mind.
You might say it was a “can’t see the forest for the trees” issue - I was bogged down in the details of how cfs was handling 9p messages, and I didn’t manage to take a step back from those specifics and notice that the kernel was being specifically told to keep a read cache for the root filesystem which it was providing.