From mboxrd@z Thu Jan 1 00:00:00 1970 From: arisawa Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-Id: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> Date: Wed, 17 Feb 2016 00:52:37 +0900 To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Subject: [9fans] file descriptor leak Topicbox-Message-UUID: 851bad0c-ead9-11e9-9d60-3106f5b1d025 Hello, I have observed warning messages from dns server: dns 30792: warning process exceeded 100 file descriptors dns 30888: warning process exceeded 200 file descriptors =E2=80=A6 probably the file descriptor leak comes from dnresolve.c udpquery(Query *qp, char *mntpt, int depth, int patient, int inns) { =E2=80=A6 msg =3D system(open("/dev/null", ORDWR), = "outside"); =E2=80=A6 } char * system(int fd, char *cmd) { int pid, p, i; static Waitmsg msg; if((pid =3D fork()) =3D=3D -1) sysfatal("fork failed: %r"); else if(pid =3D=3D 0){ dup(fd, 0); close(fd); for (i =3D 3; i < 200; i++) close(i); /* don't leak fds */ execl("/bin/rc", "rc", "-c", cmd, nil); sysfatal("exec rc: %r"); } for(p =3D waitpid(); p >=3D 0; p =3D waitpid()) if(p =3D=3D pid) return msg.msg; return "lost child"; } fd is lost if pid > 0 my server is running on 9front. however both 9atom and bell-labs use = same routine. Kenji Arisawa From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> References: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> Date: Tue, 16 Feb 2016 10:56:55 -0500 Message-ID: From: Jacob Todd To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=001a11348e36c5754c052be52e31 Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 852045ec-ead9-11e9-9d60-3106f5b1d025 --001a11348e36c5754c052be52e31 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I've had this problem too, I have yet to resolve it. On Feb 16, 2016 10:54 AM, "arisawa" wrote: > Hello, > > I have observed warning messages from dns server: > dns 30792: warning process exceeded 100 file descriptors > dns 30888: warning process exceeded 200 file descriptors > =E2=80=A6 > > probably the file descriptor leak comes from dnresolve.c > > udpquery(Query *qp, char *mntpt, int depth, int patient, int inns) > { > =E2=80=A6 > msg =3D system(open("/dev/null", ORDWR), "outside= "); > =E2=80=A6 > } > > char * > system(int fd, char *cmd) > { > int pid, p, i; > static Waitmsg msg; > > if((pid =3D fork()) =3D=3D -1) > sysfatal("fork failed: %r"); > else if(pid =3D=3D 0){ > dup(fd, 0); > close(fd); > for (i =3D 3; i < 200; i++) > close(i); /* don't leak fds */ > execl("/bin/rc", "rc", "-c", cmd, nil); > sysfatal("exec rc: %r"); > } > for(p =3D waitpid(); p >=3D 0; p =3D waitpid()) > if(p =3D=3D pid) > return msg.msg; > return "lost child"; > } > > fd is lost if pid > 0 > > my server is running on 9front. however both 9atom and bell-labs use same > routine. > > Kenji Arisawa > > > > --001a11348e36c5754c052be52e31 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

I've had this problem too, I have yet to resolve it.

On Feb 16, 2016 10:54 AM, "arisawa" &l= t;arisawa@ar.aichi-u.ac.jp&= gt; wrote:
Hello,
I have observed warning messages from dns server:
dns 30792: warning process exceeded 100 file descriptors
dns 30888: warning process exceeded 200 file descriptors
=E2=80=A6

probably the file descriptor leak comes from dnresolve.c

udpquery(Query *qp, char *mntpt, int depth, int patient, int inns)
{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =E2=80=A6
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 msg =3D system(open("/dev/null", ORDWR), "outside= ");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =E2=80=A6
}

char *
system(int fd, char *cmd)
{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 int pid, p, i;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 static Waitmsg msg;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 if((pid =3D fork()) =3D=3D -1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sysfatal("fork= failed: %r");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 else if(pid =3D=3D 0){
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 dup(fd, 0);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 close(fd);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for (i =3D 3; i <= ; 200; i++)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 close(i);=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= /* don't leak fds */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 execl("/bin/rc= ", "rc", "-c", cmd, nil);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sysfatal("exec= rc: %r");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 for(p =3D waitpid(); p >=3D 0; p =3D waitpid= ())
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if(p =3D=3D pid) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 return msg.msg;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return "lost child";
}

fd is lost if pid > 0

my server is running on 9front. however both 9atom and bell-labs use same r= outine.

Kenji Arisawa



--001a11348e36c5754c052be52e31-- From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: To: 9fans@9fans.net Date: Tue, 16 Feb 2016 18:42:02 +0200 From: lucio@proxima.alt.za In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 8524728e-ead9-11e9-9d60-3106f5b1d025 > I've had this problem too, I have yet to resolve it. You could start by tossing the very first and totally superfluous "else". to simplify things. Then, it would be tempting to take the dup(fd,0); close(fd); out to before the if(pid==0)... Well spotted, Kenji. Let's hope my shot in the dark is half as good :-) Lucio. From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: Date: Tue, 16 Feb 2016 12:05:59 -0500 Message-ID: From: Jacob Todd To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=089e01176549bfcf5a052be625eb Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 85285f0c-ead9-11e9-9d60-3106f5b1d025 --089e01176549bfcf5a052be625eb Content-Type: text/plain; charset=UTF-8 what's your fucking problem? --089e01176549bfcf5a052be625eb Content-Type: text/html; charset=UTF-8

what's your fucking problem?

--089e01176549bfcf5a052be625eb-- From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: Date: Tue, 16 Feb 2016 17:17:33 +0000 Message-ID: From: Charles Forsyth To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=047d7b5d869522aa39052be64fdb Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 852d323e-ead9-11e9-9d60-3106f5b1d025 --047d7b5d869522aa39052be64fdb Content-Type: text/plain; charset=UTF-8 On 16 February 2016 at 16:42, wrote: > Then, it would be tempting to take the > > dup(fd,0); close(fd); > > out to before the if(pid==0)... > the idea is to have fd (/dev/null in this case) be standard input in the new process, so it needs to follow the pid==0 test. the "outside" command is one that wasn't distributed (I assume it bound a separate #I interface on /net.alt and set it up), so the code currently probably doesn't do anything useful elsewhere. > probably the file descriptor leak comes from dnresolve.c you can cat /proc/$dnspid/fd where dnspid is the process id one or more of the active dns processes, to see which files are open, after the message appears. if there are many /dev/null open, that suggests your idea was right. i think you're right that it leaks an fd to /dev/null in that system call, so it should instead open /dev/null separately and assign fd before the call and close it afterwards. even so, i wonder if that's really what's happening in every case of "more than N fds", because the call to outside is only needed in the case that the udp under /net.alt is being used and an open there has failed. still, looking at the /proc/N/fd file should help decide that. --047d7b5d869522aa39052be64fdb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

= On 16 February 2016 at 16:42, <lucio@proxima.alt.za> wro= te:
Then, it would be tempting to take the

=C2=A0 =C2=A0 =C2=A0 =C2=A0 dup(fd,0); close(fd);

out to before the if(pid=3D=3D0)...

the ide= a is to have fd (/dev/null in this case) be standard input in the new proce= ss,
so it needs to follow the pid=3D=3D0 te= st.

th= e "outside" command is one that wasn't distributed (I assume = it bound a separate #I interface
on /net.al= t and set it up), so the code currently probably doesn't do anything us= eful elsewhere.


p= robably the file descriptor leak comes from dnresolve.c
=
you can cat /proc/$dnspid/fd
where dnspid is the process id one or more of the active dns processes,=
to see which files are open, after the mes= sage appears.
if there are many /dev/null o= pen, that suggests your idea was right.
i think you're right that it leaks an= fd to /dev/null in that system call, so
it= should instead open /dev/null separately and assign fd before the call and= close it afterwards.

even so, i wonder if that's really what's happenin= g in every case of "more than N fds", because
the call to outside is only needed in the case that the udp un= der /net.alt is being
used and an open ther= e has failed. still, looking at the /proc/N/fd file should help decide that= .
--047d7b5d869522aa39052be64fdb-- From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <21e3f8eee50170d6fcd21c43384eba04@felloff.net> Date: Tue, 16 Feb 2016 19:01:38 +0100 From: cinap_lenrek@felloff.net To: 9fans@9fans.net In-Reply-To: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 853366cc-ead9-11e9-9d60-3106f5b1d025 its worse. the static msg.msg is useless, its never set anywhere. the interface of system() makes no sense. its only used for running "outside" and the parent proc doesnt need the fd to /dev/null, it could as well just open it in the child like: close(0); open("/dev/null", OREAD); the caller goes at some lengths queueing processes waiting for the remount to complete, but does it in two levels. thers a sleep and a qlock. limitation on the number of processes that can be queued on a qlock()? also, when fork() returns -1, it kills the parent. the whole thing about remounting /net.alt seems wrong. it probably fixed some issue at the labs with /net.alt being imported from some other machine that kept rebooting or something. after spending 5 minutes writing the code fixing all these issues mentiond above, i'll just throw it all away and delete the whole remounting logic for /net.alt in 9front. -- cinap From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Tue, 16 Feb 2016 13:05:23 -0800 To: 9fans@9fans.net Message-ID: <57d7fb61aff4122ba314caf431c07182@lilly.quanstro.net> In-Reply-To: <21e3f8eee50170d6fcd21c43384eba04@felloff.net> References: <21e3f8eee50170d6fcd21c43384eba04@felloff.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 8540e59a-ead9-11e9-9d60-3106f5b1d025 > limitation on the number of processes that can be queued on a qlock()? in user space, yes. - erik From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <21e3f8eee50170d6fcd21c43384eba04@felloff.net> References: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> <21e3f8eee50170d6fcd21c43384eba04@felloff.net> Date: Tue, 16 Feb 2016 21:16:27 +0000 Message-ID: From: Charles Forsyth To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=001a11469c6284482e052be9a584 Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 8544f27a-ead9-11e9-9d60-3106f5b1d025 --001a11469c6284482e052be9a584 Content-Type: text/plain; charset=UTF-8 On 16 February 2016 at 18:01, wrote: > and the parent proc doesnt need the fd to /dev/null, it could as well just > open it in the child like: > > close(0); open("/dev/null", OREAD); > > There's no harm in making and using a more general function, even in a specific way, so that part's ok. The caller just needs to play its part properly. after spending 5 minutes writing the code fixing all these issues mentiond > above, i'll just throw it all away and delete the whole remounting logic > for /net.alt in 9front. It's often better to use the Erlang fail-fast ("just fail") and restart approach for persistent services. More important would be to look at /proc/N/fd on a failing system. I've a feeling that the system/outside stuff isn't actually the problem, since I've seen the diagnostic on a system that wasn't using /net.alt. In that case, the problem (as I remember it) was that an Internet link further on was down, so no messages got through to remote DNS, and file descriptors were building up in slave processes waiting for replies on /net/udp. Once the link was up, it went back to normal. --001a11469c6284482e052be9a584 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

= On 16 February 2016 at 18:01, <cinap_lenrek@felloff.net> wrote:
and the parent proc doesnt need the fd to /dev/null, it could as= well just
open it in the child like:

close(0); open("/dev/null", OREAD);


There's no harm in making and using a = more general function, even in a specific way, so that part's ok.
=
The caller just needs to play its part properly.=

after spending 5 minutes wri= ting the code fixing all these issues mentiond
above, i'll just throw it a= ll away and delete the whole remounting logic
for /net.alt in 9front.

= It's often better to use the Erlang fail-fast ("just fail") a= nd restart approach for persistent services.

More important would be to look at /= proc/N/fd on a failing system.
I've a f= eeling that the system/outside stuff isn't actually the problem,
<= div class=3D"gmail_extra">since I've seen the diagnostic on a system th= at wasn't using /net.alt.
In that case,= the problem (as I remember it) was that an Internet link further on was do= wn,
so no messages got through to remote DN= S, and file descriptors were building up in slave processes
waiting for replies on /net/udp. Once the link was up,= it went back to normal.
--001a11469c6284482e052be9a584-- From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> References: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> Date: Tue, 16 Feb 2016 22:24:08 +0000 Message-ID: From: Charles Forsyth To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=001a11469c6288066f052bea9704 Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 8548e718-ead9-11e9-9d60-3106f5b1d025 --001a11469c6288066f052bea9704 Content-Type: text/plain; charset=UTF-8 On 16 February 2016 at 15:52, arisawa wrote: > > I have observed warning messages from dns server: > dns 30792: warning process exceeded 100 file descriptors > dns 30888: warning process exceeded 200 file descriptors > It's worth noting that this message doesn't necessarily mean you've got a file descriptor leak. It might, but at the start it just means that a process (or process group sharing a file descriptor group) has many file descriptors open. That could happen if a server with many clients has a file descriptor per client and then client requests open some more. ndb/dns in particular will try user-level requests in parallel, and those in turn can lead to several concurrent queries to various name servers at a given level (root itself has 13). Web browser clients will fetch page elements concurrently. That's why it's useful to check the /proc/N/fd file to try to see what they are. (Not just for ndb/dns, but for other applications that provoke the message.) Arguably, if you're using the system in a real, Internet-facing application, the warning message might be obsolete, compared to the time when even 100 clients was a big network. --001a11469c6288066f052bea9704 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

= On 16 February 2016 at 15:52, arisawa <arisawa@ar.aichi-u.ac.jp= > wrote:

I have observed warning messages from dns server:
dns 30792: warning process exceeded 100 file descriptors
dns 30888: warning process exceeded 200 file descriptors

It's worth noting that this message doesn't necessarily mea= n you've got a file descriptor leak.
It= might, but at the start it just means that a process (or process group sha= ring a file descriptor group) has many file descriptors open.
That could happen if a server with many clients has a fi= le descriptor per client and then client requests open some more.

ndb/dns in part= icular will try user-level requests in parallel, and those in turn can lead= to several concurrent
queries to various n= ame servers at a given level (root itself has 13). Web browser clients will= fetch page
elements concurrently. That'= ;s why it's useful to check the /proc/N/fd file to try to see what they= are.
(Not just for ndb/dns, but for other = applications that provoke the message.)
Arg= uably, if you're using the system in a real, Internet-facing applicatio= n, the warning message might
be obsolete, c= ompared to the time when even 100 clients was a big network.
--001a11469c6288066f052bea9704-- From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) From: arisawa In-Reply-To: Date: Wed, 17 Feb 2016 10:13:49 +0900 Content-Transfer-Encoding: quoted-printable Message-Id: References: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 854ce7aa-ead9-11e9-9d60-3106f5b1d025 the logic (code? usage?) of system(int fd, char *cmd) is ugly. thanks for fixing, cinap. charies may be speaking right point, because the dns server is running = only for my home use. only my family uses the dns service. my home network is very simple.I = have never used /net.alt for that. the dns had been running on file server and sometimes (once a month or = so) failed into trouble. a few weeks ago, I separated it from file server so that I can see the = trouble really comes from dns. the contents of /proc/N/fs are shown below. it seems /dev/null is not the criminal as charies suggested. how to find fd leakage? thanks in advance io% ps|grep dns arisawa 248 0:00 0:00 348K Await dns arisawa 249 0:00 0:00 14584K Pread dns arisawa 250 0:12 0:12 14576K Pread dns io% cat /proc/^(248 249 250)^/fd /usr/arisawa 0 r M 8 (0000000000000005 0 00) 8192 1 /dev/cons 1 w c 0 (0000000000000002 0 00) 0 1 #c/cons 2 w c 0 (0000000000000002 0 00) 0 193 #c/cons 3 r c 0 (000000000000000f 0 00) 0 4 /dev/random 4 w M 22 (000000000000ce7e 54050 40) 8192 1098023 /sys/log/dns 5 r c 0 (0000000000000001 0 00) 0 6873616 /dev/bintime 6 r M 22 (000000000000929e 3098 00) 8192 10949 /lib/ndb/local 7 r M 22 (000000000000929b 47 00) 8192 10242 /lib/ndb/common 8 r I 0 (0000000000000004 3 00) 0 127 /net/ndb 9 rw | 0 (0000000000000fc1 0 00) 65536 3954 #|/data 11 w s 0 (0000000000000009 0 00) 0 2 #s/dns /usr/arisawa 0 r M 8 (0000000000000005 0 00) 8192 1 /dev/cons 1 w c 0 (0000000000000002 0 00) 0 1 #c/cons 2 w c 0 (0000000000000002 0 00) 0 193 #c/cons 3 r c 0 (000000000000000f 0 00) 0 4 /dev/random 4 w M 22 (000000000000ce7e 54050 40) 8192 1098023 /sys/log/dns 5 r c 0 (0000000000000001 0 00) 0 6873816 /dev/bintime 6 r M 22 (000000000000929e 3098 00) 8192 10949 /lib/ndb/local 7 r M 22 (000000000000929b 47 00) 8192 10242 /lib/ndb/common 8 r I 0 (0000000000000004 3 00) 0 127 /net/ndb 9 rw | 0 (0000000000000fc1 0 00) 65536 3954 #|/data 11 w s 0 (0000000000000009 0 00) 0 2 #s/dns 12 rw I 0 (000000000002002d 0 00) 0 3362 /net/udp/1/data /usr/arisawa 0 r M 8 (0000000000000005 0 00) 8192 1 /dev/cons 1 w c 0 (0000000000000002 0 00) 0 1 #c/cons 2 w c 0 (0000000000000002 0 00) 0 193 #c/cons 3 r c 0 (000000000000000f 0 00) 0 4 /dev/random 4 w M 22 (000000000000ce7e 54050 40) 8192 1098023 /sys/log/dns 5 r c 0 (0000000000000001 0 00) 0 6873824 /dev/bintime 6 r M 22 (000000000000929e 3098 00) 8192 10949 /lib/ndb/local 7 r M 22 (000000000000929b 47 00) 8192 10242 /lib/ndb/common 8 r I 0 (0000000000000004 3 00) 0 127 /net/ndb 9 rw | 0 (0000000000000fc1 0 00) 65536 3954 #|/data 11 w s 0 (0000000000000009 0 00) 0 2 #s/dns 12 rw I 0 (000000000002002d 0 00) 0 3362 /net/udp/1/data io%=20 > 2016/02/17 7:24=E3=80=81Charles Forsyth = =E3=81=AE=E3=83=A1=E3=83=BC=E3=83=AB=EF=BC=9A >=20 >=20 > On 16 February 2016 at 15:52, arisawa = wrote: >=20 > I have observed warning messages from dns server: > dns 30792: warning process exceeded 100 file descriptors > dns 30888: warning process exceeded 200 file descriptors >=20 > It's worth noting that this message doesn't necessarily mean you've = got a file descriptor leak. > It might, but at the start it just means that a process (or process = group sharing a file descriptor group) has many file descriptors open. > That could happen if a server with many clients has a file descriptor = per client and then client requests open some more. >=20 > ndb/dns in particular will try user-level requests in parallel, and = those in turn can lead to several concurrent > queries to various name servers at a given level (root itself has 13). = Web browser clients will fetch page > elements concurrently. That's why it's useful to check the /proc/N/fd = file to try to see what they are. > (Not just for ndb/dns, but for other applications that provoke the = message.) > Arguably, if you're using the system in a real, Internet-facing = application, the warning message might > be obsolete, compared to the time when even 100 clients was a big = network. From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <6b117ce996cbffe421ed47e6ddfdc965@felloff.net> Date: Wed, 17 Feb 2016 02:22:06 +0100 From: cinap_lenrek@felloff.net To: 9fans@9fans.net In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 8550e2d8-ead9-11e9-9d60-3106f5b1d025 looks good. thanks for reporting mr. arisawa. -- cinap From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <8607db9c49d8dc4c1073ac90653f6466@felloff.net> Date: Wed, 17 Feb 2016 03:19:19 +0100 From: cinap_lenrek@felloff.net To: 9fans@9fans.net In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 8554de42-ead9-11e9-9d60-3106f5b1d025 > It's often better to use the Erlang fail-fast ("just fail") and restart > approach for persistent services. indeed. ndb/dns also has this "restart" ctl message where it restarts the server part but keeps the 9p pipe posted. tho that kills all the fid's and new fids starts as "dummy" user owned file, not a directory, so after restart /net/dns vanishes. doing some further experiments, it appears that when i make new fids start out as directories it works. :-) ndb/cs also has some logic where it remounts /srv/dns when opening /net/dns fails. -- cinap From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Sun, 21 Feb 2016 21:18:26 -0800 To: 9fans@9fans.net Message-ID: In-Reply-To: References: <8FB7CBFD-7334-4F9F-8C71-571DEF9FAD31@ar.aichi-u.ac.jp> <21e3f8eee50170d6fcd21c43384eba04@felloff.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] file descriptor leak Topicbox-Message-UUID: 87074216-ead9-11e9-9d60-3106f5b1d025 On Tue Feb 16 13:19:29 PST 2016, charles.forsyth@gmail.com wrote: > On 16 February 2016 at 18:01, wrote: > > > and the parent proc doesnt need the fd to /dev/null, it could as well just > > open it in the child like: > > > > close(0); open("/dev/null", OREAD); > > > > > There's no harm in making and using a more general function, even in a > specific way, so that part's ok. > The caller just needs to play its part properly. > > after spending 5 minutes writing the code fixing all these issues mentiond > > above, i'll just throw it all away and delete the whole remounting logic > > for /net.alt in 9front. > > > It's often better to use the Erlang fail-fast ("just fail") and restart > approach for persistent services. > > More important would be to look at /proc/N/fd on a failing system. > I've a feeling that the system/outside stuff isn't actually the problem, > since I've seen the diagnostic on a system that wasn't using /net.alt. > In that case, the problem (as I remember it) was that an Internet link > further on was down, > so no messages got through to remote DNS, and file descriptors were > building up in slave processes > waiting for replies on /net/udp. Once the link was up, it went back to > normal. we saw this a lot at coraid, but never did catch a smoking-gun process. i don't recall a perfect correlation to internet down. - erik