From mboxrd@z Thu Jan  1 00:00:00 1970
From: erik quanstrom <quanstro@quanstro.net>
Date: Sun, 17 Nov 2013 18:40:57 -0500
To: 9fans@9fans.net
Message-ID: <dc3acb0e8b875c14223d5e5bd72262c1@mikro>
In-Reply-To: <71F713A4-13CE-424C-B148-7F0238DB9E57@corpus-callosum.com>
References: <71F713A4-13CE-424C-B148-7F0238DB9E57@corpus-callosum.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [9fans] arm & httpd
Topicbox-Message-UUID: 8ab1d670-ead8-11e9-9d60-3106f5b1d025

On Sun Nov 17 17:32:22 EST 2013, jas@corpus-callosum.com wrote:
> Has anyone else experienced new builds of the sources arm tree getting =
hung up with semacquire?
>=20
>  99:     httpd pc     cac0 dbgpc     cac0  Semacquire (Wakeme) ut 1 st =
2 bss 168000 qpc 608157d8 nl 0 nd 0 lpc 608758c4 pri 10
>=20
>=20
>=20
> acid: lstk()
> semacquire()+0xc /sys/src/libc/9syscall/semacquire.s:6
> lock(l=3D0x31208)+0x20 /sys/src/libc/port/lock.c:10
> plock()+0x8 /sys/src/libc/port/malloc.c:80
> 	pv=3D0x31208
> poolalloc(p=3D0x35a24,n=3D0x2c)+0xc /sys/src/libc/port/pool.c:1223
> 	v=3D0xd970
> mallocz(size=3D0x24,clr=3D0x1)+0x18 /sys/src/libc/port/malloc.c:221
> 	v=3D0x5ffffd39
> getnetconninfo(fd=3D0xffffffff,dir=3D0x5ffffeec)+0x78 /sys/src/libc/9sy=
s/getnetconninfo.c:59
> 	path=3D0x0
> 	nci=3D0xb
> 	spec=3D0x0
> 	d=3D0x0
> 	netname=3D0x28
> dolisten(address=3D0xd16dc)+0x134 /sys/src/cmd/ip/httpd/httpd.c:291
> 	spotchk=3D0x1
> 	dir=3D0x74656e2f
> 	ctl=3D0xa
> 	ndir=3D0x74656e2f
> 	nctl=3D0xb
> 	swamped=3D0x0
> 	nci=3D0x161c40
> 	data=3D0x313aa
> 	conn=3D0x73
> 	scheme=3D0xd16e6
> 	c=3D0x38898
> 	t=3D0x5ffffeb4
> 	ok=3D0xa284
> main(argc=3D0x0,argv=3D0x5fffff9c)+0x1c0 /sys/src/cmd/ip/httpd/httpd.c:=
138
> 	address=3D0x38846
> 	_argc=3D0x0
> 	_args=3D0x0
> _main+0x28 /sys/src/libc/arm/main9.s:19
>=20
>=20
> I see this on the second http request, the first completes successfully=
, and don=E2=80=99t yet know if it=E2=80=99s a dns configuration error or=
 something else.

this is clearly a case of deadlock.

on each allocation the pool library locks the pool lock.  for
the duration, and releases it before returning.  for some reason,
the pool lock already appears locked, you go to the contended
case, which in the standard distribution calls semacquire, and
wait forever.

so there are just a few possibilities
1.  either the code was always broken, and the old locking scheme
got lucky every time.  (i don't think this is likely.)
2.  there's a bug in implementation of lock.
3.  there is a bug in locking that's been introduced that's architecture-
specific.

i haven't been using the semaphore-based locks because they are slow.
this is because wakeup() takes about 100-1000x as long as sleep(0)
which is just sched(), and this is hard to make up without doing some
hard thinking that hasn't been done yet.  even better schedulers don't
fully fix this.

but still, were i a betting man, my money would be on door #3.

- erik