From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <5d58567057803b04e6d8309c8e97e385@rei2.9hal> Date: Sat, 4 Aug 2012 06:00:40 +0200 From: cinap_lenrek@gmx.de To: 9fans@9fans.net MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: [9fans] bind /net Topicbox-Message-UUID: a7c6afde-ead7-11e9-9d60-3106f5b1d025 Recently helped debugging a strange plan9 server problem. The machine being a cpu/auth/file server basicly doing everything from serving http with rc-httpd, accepting mail, serving dns and running a bunch of cronjobs doing various things. the machine is quite busy. It worked quite well for a some time. Then, it would stop accepting cpu logins. The clients cpu process would just hang there. Http would continue serve fine for a while until that will stop working too and finally, the machine will lockup and reboot. This happend like every 2 days or so. After some time, we where able to get a picture of what seemed to going on. There would be many processes blocked opening /mnt/factotum/rpc. Trying to ls /mnt will hang the ls... The machine would slowly accumulate locked up processes until it reached the 2k process limit... Problem was that factotum seemed busy in some auth protocol. (this really sucks. factotum is mounted directly on /mnt instead of /mnt/factotum and is single threaded so when its doing some auth business, noone can walk /mnt... this can even cause deadlock with authsrv which tries to access /mnt/keys on the same machine... but thats a different thing...) But there was no tcp567 or authsrv processes arround (the machine is itself an auth server). Netstat showed 2 established port 567 (ticket) connections. one for the outgoing one (to itself) and a incoming one (from itself). So where was that authsrv process? We greped for these 2 tcp connections in /proc/*/fd and turned out that the incoming one was showing up in the filedescriptor table of *exportfs* processes that where used to import /net from that machine instead of any authsrv. How was this possible? A terminal that was importing /net from this machine used to run aux/listen1 -t to run some local service prior importing /net in the same namespace. Why is this a problem? Well, the -t option causes listen1 to not fork its namespace so it will notice when we later overmount /net. On startup, it will succeed announcing the port on the original /net and start listening. Then, the parent process will change the /net under its foot. If a new connection comes in and listen1 will try to accept it and open its data file, it will grab some random connection on the *servers* /net instead of the one it was originaly listening on! We greped for the mysterious ticket connection path on the terminal and found it as the stdin of the completly unrelated "local" service on that terminal. And its /proc/xxx/ns file confirmed it was using the remote /net. Killing that process immidiately made the server unblock itself and continue normal operation. So dont do this at home kids. use /net.alt or face the consequences. -- cinap