From mboxrd@z Thu Jan  1 00:00:00 1970
From: bakul@bitblocks.com (Bakul Shah)
Date: Fri, 01 Dec 2017 17:40:32 -0800
Subject: [TUHS] signals and blocked in I/O
In-Reply-To: Your message of "Fri, 01 Dec 2017 16:48:50 -0800."
 <20171202004850.GB24335@mcvoy.com>
References: <CAEoi9W5mFaQjuGAqSQZSusuyj2uyUX4PvsKVq8bOuM5219E1tQ@mail.gmail.com>
 <CAC20D2NcnW1YHdd5+3Y8yQqghGeLMvy8urDV7R5GevcLpizLew@mail.gmail.com>
 <20171201161810.GM3924@mcvoy.com>
 <CANCZdfpYWbLdkx4fNURFnLM5VGt+qDcnnMyHQ+M4iwAziS2a9A@mail.gmail.com>
 <20171201172603.GO3924@mcvoy.com>
 <BADECBD9-D2B4-4788-8308-87EBF93D555A@bitblocks.com>
 <20171201223859.GX3924@mcvoy.com>
 <20171201230302.0DC351FA41@orac.inputplus.co.uk>
 <20171201230934.GA24335@mcvoy.com>
 <20171201234230.F33D4156E523@mail.bitblocks.com>
 <20171202004850.GB24335@mcvoy.com>
Message-ID: <20171202014047.E5E0C156E523@mail.bitblocks.com>

On Fri, 01 Dec 2017 16:48:50 -0800 Larry McVoy <lm at mcvoy.com> wrote:
Larry McVoy writes:
> On Fri, Dec 01, 2017 at 03:42:15PM -0800, Bakul Shah wrote:
> > On Fri, 01 Dec 2017 15:09:34 -0800 Larry McVoy <lm at mcvoy.com> wrote:
> > Larry McVoy writes:
> > > On Fri, Dec 01, 2017 at 11:03:02PM +0000, Ralph Corderoy wrote:
> > > > Hi Larry,
> > > > 
> > > > > > So OOM code kills a (random) process in hopes of freeing up some
> > > > > > pages but if this process is stuck in diskIO, nothing can be freed
> > > > > > and everything grinds to a halt.
> > > > >
> > > > > Yep, exactly.
> > > > 
> > > > Is that because the pages have been dirty for so long they've reached
> > > > the VM-writeback timeout even though there's no pressure to use them fo
> r
> > > > something else?  Or has that been lengthened because you don't fear
> > > > power loss wiping volatile RAM?
> > > 
> > > I'm tinkering with the pageout daemon so I'm trying to apply memory
> > > pressure.  I have 10 25GB processes (25GB malloced) and the processes jus
> t
> > > walk the memory over and over.  This is on a 256GB main memory machine
> > > (2 socket haswell, 28 cpus, 28 1TB SSDs, on loan from Netflix).
> > 
> > How many times do processes walk their memory before this condition
> > occurs? 
> 
> Until free memory goes to ~0.  That's the point, I'm trying to 
> improve things when there is too much pressure on memory.

You said 10x25GB but you have 256GB. So there is still
6GB left...

> 
> > So what may be happening is that a process references a page,
> > it page faults, the kernel finds its phys page has been paged
> > out, so it looks for a free page and once a free page is
> > found, the process will block on page in. Or if there is no
> > free page, it has to wait until some other dirty page is paged
> > out (but this would be a different wait queue).  As more and
> > more processes do this, the system runs out of all free pages.
> 
> Yeah.
> 
> > Can you find out how many processes are waiting under what
> > conditions, how long they wait and how these queue lengths are
> > changing over time?  
> 
> So I have 10 processes, they all run until the system starts to
> thrash, then they are all in wait mode for memory but there isn't
> any (and there is no swap configured).
> 
> The fundamental problem is that they are sleeping waiting for memory to
> be freed.  They are NOT in I/O mode, there is no DMA happening, this is
> main memory, it is not backed by swap, there is no swap.  So they are
> sleeping waiting for the pageout daemon to free some memory.  It's not
> going to free their memory because there is no place to stash (no swap).
> So it's trying to free other memory.

This confuses me. Before I make more false assumptions,
can you show the code?

> The real question is where did they go to sleep and why did they sleep
> without PCATCH on?  If I can find that place where they are trying to
> alloc a page and failed and they go to sleep there, I could either

Can you use kgdb to find out where they sleep?

> a) commit seppuku because we are out of memory and I'm part of the problem
> b) go into a sleep / wakeup / check signals loop
> 
> I am reminded by you all that we ask the process to do it to itself but
> there does seem to be a way to sleep and respect signals, the tty stuff
> does that.  So if I can find this place, determine that I'm just asking
> for memory, not I/O, and sleep with PCATCH on then I might be golden.

Won't kgdb tell you? Or you can insert printfs. You should also print
something in your program as it walks a few pages and see if this happens
at almost the same time (when all pages are used up).

tty code probably assumes memory shortfall is a short term issue
(which is likely). Not the case with your test (unless I misguessed).

> Where "golden" means I can kill the process and the OOM thread could do
> it for me.
> 
> Thoughts?

Now I am starting to think this happens as soon as all the
phys pages are used up. But looking at your program would
help.

[removd cc: TUHS]