From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: Date: Wed, 12 Mar 2008 20:33:26 -0400 From: "Russ Cox" To: "Fans of the OS Plan 9 from Bell Labs" <9fans@9fans.net> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: Subject: Re: [9fans] infrequent panics Topicbox-Message-UUID: 7720263e-ead3-11e9-9d60-3106f5b1d025 This is a cool bug; thanks for the report. I love a good concurrency bug. > Wed Mar 12 13:20:01: hdr 0a110c09 00000020 f018c7ce cafebabe 20737023 7064755b > Wed Mar 12 13:20:01: tail 0a110c09 00000020 f018c7ce cafebabe 20737023 7064755b | 72657320 20726576 This says "#ps [udp server]" (I'm guessing about the last few bytes), which is the name of the dns proc as it shows up in ps. The stack trace showed this buffer being corrupt in devattach, which executes: buf = smalloc(4+strlen(spec)+1); sprint(buf, "#%C%s", tc, spec); c->path = newpath(buf); free(buf); There's already something wrong here -- if spec contains bad UTF then buf will overflow all the time. Really it should be snprint or sprint(buf, "#%C") followed by strcat. Let's ignore that, since there's no bad UTF in your string. If you poke around at how this gets called on a file like #p/1/blah, you'll find that at this point in the code, spec == up->genbuf+2. So if something else was overwriting up->genbuf at just the wrong time (between the strlen and the execution of %s) with the process name as it shows up in ps, then you'd see the corruption in the stack trace. Since up->genbuf is private to up (i.e., the current process) I wondered at first if somehow the scheduler was broken and was trying to run the same Proc on multiple CPUs, but Erik assured me the scheduler had not been touched. But aha! If you poke around in the source code you will find that there is one place where one proc uses another's genbuf: % g genbuf | grep -v 'up->genbuf' chan.c:1630: /* place final element in genbuf for e.g. exec */ devproc.c:717: j = procargs(p, p->genbuf, sizeof p->genbuf); devproc.c:723: memmove(a, &p->genbuf[offset], n); portdat.h:701: char genbuf[128]; /* buffer used e.g. for last name element from namec */ % And look, it's the code for reading #p/1/args, which is how ps figures out what to print: /sys/src/9/port/devproc.c:/^procread case Qargs: qlock(&p->debug); j = procargs(p, p->genbuf, sizeof p->genbuf); qunlock(&p->debug); if(offset >= j) return 0; if(offset+n > j) n = j-offset; memmove(a, &p->genbuf[offset], n); return n; So if you ran ps at exactly the right time (while one of the procs on the machine was opening a #-ed file name and was between strlen and sprint), you'd win the race. You might improve your chances if the machine were close to out of memory and the smalloc had to wait for more memory. Oops. Changing those lines to use up->genbuf instead of p->genbuf should eliminate the race. Russ