9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] strange behaviour of ps under load
@ 2009-07-20  6:59 gdiaz
  2009-07-20  9:23 ` Sape Mullender
  2009-07-20 12:26 ` erik quanstrom
  0 siblings, 2 replies; 5+ messages in thread
From: gdiaz @ 2009-07-20  6:59 UTC (permalink / raw)
  To: 9fans

hello

today i found 9grid plan9 under heavy load, stats reports load ~2000, syscall ~60000, context ~22000, i was trying to discover which proc has gone crazy, but i can't even complete a ps. I can do other operations, such as sending this email over drawterm, run stats, netstat, read the logs, etc. but i can't run ps, or any other /proc related tool, i can't kill/Kill/slay anything.

I can ls /proc

cpu% ls -l | wc -l
    573

something like

cpu% for(i in `{ls}) {echo -n 'PID ' $i 'has status. . . '; cat $i/status  | wc -c }
[....]
PID  1944693 has status. . .     176
PID  1944698 has status. . .     176
PID  1944699 has status. . .     176
PID  1944700 has status. . .     176
PID  1944707 has status. . .

and here ends, i can't know which process is that nor kill it.

I can ls it:
cpu% ls -l /proc/1944707/
--rw-rw---- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/args
--rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/ctl
--r--r--r-- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/fd
--rw-r----- p 0 offending_user bootes 108 Dec  1  2008 /proc/1944707/fpregs
--r--r----- p 0 offending_user bootes  76 Dec  1  2008 /proc/1944707/kregs
--rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/mem
--rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/note
--rw-rw-r-- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/noteid
--rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/notepg
--r--r--r-- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/ns
--r--r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/proc
--r--r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/profile
--rw-r----- p 0 offending_user bootes  76 Dec  1  2008 /proc/1944707/regs
--r--r--r-- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/segment
--r--r--r-- p 0 offending_user bootes 176 Dec  1  2008 /proc/1944707/status
--rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/text
--r--r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/wait

 i can't either chmod those files. (is that date normal? seems all /proc is with that date :?)

any tip on how to solve this without rebooting?

thanks!!

gabi




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] strange behaviour of ps under load
  2009-07-20  6:59 [9fans] strange behaviour of ps under load gdiaz
@ 2009-07-20  9:23 ` Sape Mullender
  2009-07-22 15:53   ` erik quanstrom
  2009-07-22 16:01   ` gdiaz
  2009-07-20 12:26 ` erik quanstrom
  1 sibling, 2 replies; 5+ messages in thread
From: Sape Mullender @ 2009-07-20  9:23 UTC (permalink / raw)
  To: 9fans

Very odd. There are no blocking operations in reading the status file in /proc.
The only thing I can think of is that you have something bound (snapfs?) in /proc
and that you're hanging on a stale mount point.

Is the clock set properly on that machine?

	Sape

> hello
>
> today i found 9grid plan9 under heavy load, stats reports load ~2000, syscall ~60000, context ~22000, i was trying to discover which proc has gone crazy, but i can't even complete a ps. I can do other operations, such as sending this email over drawterm, run stats, netstat, read the logs, etc. but i can't run ps, or any other /proc related tool, i can't kill/Kill/slay anything.
>
> I can ls /proc
>
> cpu% ls -l | wc -l
>     573
>
> something like
>
> cpu% for(i in `{ls}) {echo -n 'PID ' $i 'has status. . . '; cat $i/status  | wc -c }
> [....]
> PID  1944693 has status. . .     176
> PID  1944698 has status. . .     176
> PID  1944699 has status. . .     176
> PID  1944700 has status. . .     176
> PID  1944707 has status. . .
>
> and here ends, i can't know which process is that nor kill it.
>
> I can ls it:
> cpu% ls -l /proc/1944707/
> --rw-rw---- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/args
> --rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/ctl
> --r--r--r-- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/fd
> --rw-r----- p 0 offending_user bootes 108 Dec  1  2008 /proc/1944707/fpregs
> --r--r----- p 0 offending_user bootes  76 Dec  1  2008 /proc/1944707/kregs
> --rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/mem
> --rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/note
> --rw-rw-r-- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/noteid
> --rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/notepg
> --r--r--r-- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/ns
> --r--r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/proc
> --r--r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/profile
> --rw-r----- p 0 offending_user bootes  76 Dec  1  2008 /proc/1944707/regs
> --r--r--r-- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/segment
> --r--r--r-- p 0 offending_user bootes 176 Dec  1  2008 /proc/1944707/status
> --rw-r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/text
> --r--r----- p 0 offending_user bootes   0 Dec  1  2008 /proc/1944707/wait
>
>  i can't either chmod those files. (is that date normal? seems all /proc is with that date :?)
>
> any tip on how to solve this without rebooting?
>
> thanks!!
>
> gabi




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] strange behaviour of ps under load
  2009-07-20  6:59 [9fans] strange behaviour of ps under load gdiaz
  2009-07-20  9:23 ` Sape Mullender
@ 2009-07-20 12:26 ` erik quanstrom
  1 sibling, 0 replies; 5+ messages in thread
From: erik quanstrom @ 2009-07-20 12:26 UTC (permalink / raw)
  To: 9fans

On Mon Jul 20 04:11:39 EDT 2009, gdiaz@9grid.es wrote:
> hello
>
> today i found 9grid plan9 under heavy load, stats reports load ~2000, syscall ~60000, context ~22000, i was trying to discover which proc has gone crazy, but i can't even complete a ps. I can do other operations, such as sending this email over drawterm, run stats, netstat, read the logs, etc. but i can't run ps, or any other /proc related tool, i can't kill/Kill/slay anything.
>
> I can ls /proc
>
> cpu% ls -l | wc -l
>     573

i've seen this problem before, without an unreasonable load.
it happened on a dual-processor machine.  since procopen()
requires p->debug, it stands to reason that somebody's got
p->debug qlocked and won't let go.

this is an old problem.  here are russ' thoughts
from 7 years ago.

http://9fans.net/archive/2002/02/350
http://9fans.net/archive/2002/02/360

- erik



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] strange behaviour of ps under load
  2009-07-20  9:23 ` Sape Mullender
@ 2009-07-22 15:53   ` erik quanstrom
  2009-07-22 16:01   ` gdiaz
  1 sibling, 0 replies; 5+ messages in thread
From: erik quanstrom @ 2009-07-22 15:53 UTC (permalink / raw)
  To: 9fans

On Wed Jul 22 11:49:54 EDT 2009, sape@plan9.bell-labs.com wrote:
> Very odd. There are no blocking operations in reading the status file in /proc.
> The only thing I can think of is that you have something bound (snapfs?) in /proc
> and that you're hanging on a stale mount point.
>
> Is the clock set properly on that machine?

but there is a qlock of p->debug in procopen:
/n/sources/plan9/sys/src/9/port/devproc.c:368
	qlock(&p->debug);

- erik



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] strange behaviour of ps under load
  2009-07-20  9:23 ` Sape Mullender
  2009-07-22 15:53   ` erik quanstrom
@ 2009-07-22 16:01   ` gdiaz
  1 sibling, 0 replies; 5+ messages in thread
From: gdiaz @ 2009-07-22 16:01 UTC (permalink / raw)
  To: 9fans

hello

I already rebooted the machine so I can't check this anymore on a live system, but as you said, the clock was out of sync 10 minutes from what it sould, I activated timesync again ( /proc still shows the same incorrect date).

Also i did a snap of a broken proc a day before I found the machine on that state, but when i did it, the system was behaving properly, I can't remember anyting unusual. If it happens again, I'll post to the list again,

Also, something important I forgot to mention, this is a vmware machine. vmwarefs isn't running.

May be the next time i could have enough time to try to acid the kernel. . . .

thanks

gabi



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-07-22 16:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-20  6:59 [9fans] strange behaviour of ps under load gdiaz
2009-07-20  9:23 ` Sape Mullender
2009-07-22 15:53   ` erik quanstrom
2009-07-22 16:01   ` gdiaz
2009-07-20 12:26 ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).