From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1937 Path: news.gmane.org!not-for-mail From: Charlie Brady Newsgroups: gmane.comp.sysutils.supervision.general Subject: runsv spinning 100% CPU (was Re: runsv and EAGAIN) Date: Tue, 20 Jan 2009 18:21:10 -0500 (EST) Message-ID: References: <20080712160716.GA30532@fly.srk.fer.hr> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463811723-1451312741-1232493670=:6370" X-Trace: ger.gmane.org 1232493686 20946 80.91.229.12 (20 Jan 2009 23:21:26 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 20 Jan 2009 23:21:26 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2172-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Jan 21 00:22:38 2009 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1LPPvZ-0000aH-5R for gcsg-supervision@gmane.org; Wed, 21 Jan 2009 00:22:37 +0100 Original-Received: (qmail 27313 invoked by uid 76); 20 Jan 2009 23:21:44 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 27305 invoked from network); 20 Jan 2009 23:21:44 -0000 X-X-Sender: charlieb@e-smith.charlieb.ott.istop.com In-Reply-To: Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1937 Archived-At: ---1463811723-1451312741-1232493670=:6370 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: 8BIT I've finally found time to look at this issue again, and have found a way to replicate the issue at will. To replicate, create a service directory containing an empty file supervise/control in place of the fifo. Create a symlink so that runsvdir spawns a new runsv process. runsv will then behave as shown in the strace plus commentary shown below. The unexpected behaviour comes from a combination of two factors - supervise/control is a file and not a fifo, and linux poll() sets POLLIN for a regular file at EOF: http://www.greenend.org.uk/rjk/2001/06/poll.html IMO this is an error condition which runsv should handle more gracefully. The question is - how should the error be handled or corrected? One way would be for runsv to replace any existing supervise/control at startup unless it is a fifo. On Sat, 12 Jul 2008, Charlie Brady wrote: > On Sat, 12 Jul 2008, Dražen Kačar wrote: > >> Charlie Brady wrote: >> > > Here's a sample strace: >> > > >> > > rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0 >> > > rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0 >> > > poll([{fd=3, events=POLLIN}, {fd=9, events=POLLIN, revents=POLLIN}, >> > > {fd=11, events=POLLIN}], 3, 1000020) = 1 >> >> You only have POLLIN in revents for fd 9. Also, the return value is 1, >> meaning that only one file descriptor has readable data. >> >> > > rt_sigprocmask(SIG_BLOCK, [TERM], NULL, 8) = 0 >> > > rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 >> > > read(3, 0xbfa256c3, 1) = -1 EAGAIN (Resource >> > > temporarily >> > > unavailable) >> > > waitpid(-1, 0xbfa256bc, WNOHANG) = 0 >> > > read(9, "", 1) = 0 >> > > read(11, 0xbfa256c3, 1) = -1 EAGAIN (Resource >> > > temporarily >> > > unavailable) >> > > gettimeofday({1215792640, 591768}, NULL) = 0 >> > ... >> >> > So poll i saying that data is available, and read is saying that it >> > isn't. >> > Is anyone else confused? >> >> poll() is saying that the data is available on fd 9 and read() on fd 9 >> does not return an error. The other two file descriptors should have not >> been read. > > Thanks. I didn't read the strace at all carefully, did I? > > I see in iopause() that poll() is called without checking the return status. > > fd 3 is runsv's selfpipe. fd 9 is the control fifo and fd 11 is the log > control fifo. Poll tells us that fd 9 had data to read, but then read > returned 0. > > I'd better check (on Monday) to see whether I somehow have multiple runsv > processes in the same directory... > > Here's the relevant part of runsv.c: > > ... > sig_unblock(sig_term); > sig_unblock(sig_child); > iopause(x, 2 +haslog, &deadline, &now); > sig_block(sig_term); > sig_block(sig_child); > > while (read(selfpipe[0], &ch, 1) == 1) > ; > for (;;) { > int child; > int wstat; > > child =wait_nohang(&wstat); > if (!child) break; > if ((child == -1) && (errno != error_intr)) break; > if (child == svd[0].pid) { > svd[0].pid =0; > pidchanged =1; > svd[0].wstat =wstat; > svd[0].ctrl &=~C_TERM; > if (svd[0].state != S_FINISH) > if ((fd =open_read("finish")) != -1) { > close(fd); > svd[0].state =S_FINISH; > update_status(&svd[0]); > continue; > } > svd[0].state =S_DOWN; > taia_uint(&deadline, 1); > taia_add(&deadline, &svd[0].start, &deadline); > taia_now(&svd[0].start); > update_status(&svd[0]); > if (taia_less(&svd[0].start, &deadline)) sleep(1); > } > if (haslog) { > if (child == svd[1].pid) { > svd[1].pid =0; > pidchanged =1; > svd[1].state =S_DOWN; > svd[1].ctrl &=~C_TERM; > taia_uint(&deadline, 1); > taia_add(&deadline, &svd[1].start, &deadline); > taia_now(&svd[1].start); > update_status(&svd[1]); > if (taia_less(&svd[1].start, &deadline)) sleep(1); > } > } > } > if (read(svd[0].fdcontrol, &ch, 1) == 1) ctrl(&svd[0], ch); > if (haslog) > if (read(svd[1].fdcontrol, &ch, 1) == 1) ctrl(&svd[1], ch); > > ... > ---1463811723-1451312741-1232493670=:6370--