From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Fri,  9 May 1997 02:53:25 +1000
From: David Hogan dhog@lore.plan9.cs.su.oz.au
Subject: calling sleep() while holding lock()
Topicbox-Message-UUID: 5a1d34b0-eac8-11e9-9e20-41e7f4b1d025
Message-ID: <19970508165325.Nq-6UWsyjVLbhs8Z8BK3WMm4fTX3faSlfbiApFof9kw@z>

David Butler wrote:
> It would seem to me that a process should not call sleep() holding
> a spinlock, even though that seems to be happening.

This seems reasonable to me.

> I changed taslock to increment and decrement the hasspin flag instead of just
> setting it and clearing it.  It is reasonable to have many locks.
> (There is also a problem with the lock being dropped before the hasspin
> was modified that I fixed.  I also temporarly removed the hasspin clear
> from clock.c)

> I then added a print in sleep to print the pid and hasspin counter if
> hasspin > 0.  It happens alot and pretty early in the boot phase.

Good coding.  So now the question remains: why is this behaviour
occuring?  One possibility is that we take a fault while holding
the lock, and we then have to sleep until the memory gets paged in.
Alberto Nava found a place in the kernel where this is happening, and
I'm sure there must be others.

It's not that good to take a fault while holding a spinlock; at
minimum, there will be a loss of efficiency.  In the worst case,
the kernel may deadlock or panic.  Code which allows this to happen
should be tracked down, and changed to either use a local buffer
in the critical section, or else verify that the memory is writable
first...

You should add another print to the fault handler, so that you can
see which of the sleeps are caused by faults, and which aren't.
You might want to record the caller PC of the most recent spinlock,
and print that as well.  This will enable locating which parts of
the kernel are behaving this way.

When you've got a list of PC values, use acid to find the file &
line number for each, and post them!  I for one would be interested
in this data.

> I'm doing this trying to find the cause of my earlier message about
> checksum errors on the ethernet.  I am looking for places where spinlocks
> are being held for long times and next where interrupts are masked
> too long.

I've noticed that the Plan 9 kernel does go through some quite long
periods at high IPL.  During these, it is possible to lose serial
characters at a mere 9600 baud :-(  Any insight into why this
happens would be appreciated.  I was going to add some code to the
kernel to keep a journaling buffer of (PC, microsecond) pairs recorded
at strategic points in the kernel (such as every call to splhi &co),
but I never got around to it.  I may yet do this, now that I have
a decent machine at home and spare time on weekends...

> Before I go much further, I wanted to check on this behavior.

The less time spent holding spinlocks, the better.  Your mission,
if you choose to accept it, is to obtain the release of the lost
CPU cycles.  If you are caught, Dennis will disavow all knowledge
of your actions.

> Thanks for any info.

You're welcome.