From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Thu, 19 Jun 2014 12:01:10 -0400 To: 9fans@9fans.net Message-ID: <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: [9fans] cache lines, and 60000 cycles of doom Topicbox-Message-UUID: fb7a1034-ead8-11e9-9d60-3106f5b1d025 i'm seeing some mighty interesting timing on my intel ivy bridge. i found a bug in the file server aoe implementation (can't happen if you're using the uniprocessor x86 version) that happens because the Srb is freed before wakeup completes. to solve this there is some code that sets the state (this is from ken's ancient scheduler, by way of sape) wakeup(&srb); srb->state = Free; code that receives it is like this sleep(&srb, srbdone, srb); cycles(&t0); for(n = 0; srb->state != Free; n++){ if(srb->wmach == m->machno) sched(); else monmwait(&srb->state, Alloc); } cycles(&t1); free(srb); the astounding thing is that t1-t0 is often ~ 60,000 cycles. it only hits a small fraction of the time, and the average is much lower. but that just blows the mind. 60000 cycles! (other versions with sched were much worse.) as far as i can tell, there are no funny bits in the scheduler that can cause this, and no wierd scheduling is going on. i'm baffled. - erik From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> References: <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> Date: Fri, 20 Jun 2014 01:03:05 -0400 Message-ID: From: "Devon H. O'Dell" To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=047d7b3a81e8a17c5204fc3d6881 Subject: Re: [9fans] cache lines, and 60000 cycles of doom Topicbox-Message-UUID: fb8e6f84-ead8-11e9-9d60-3106f5b1d025 --047d7b3a81e8a17c5204fc3d6881 Content-Type: text/plain; charset=UTF-8 Weird. I assume cycles is using rdtsc or rdtscp. Perhaps some of it is due to a combination of contention and rdtsc(p) being serializing instructions? On Jun 19, 2014 12:04 PM, "erik quanstrom" wrote: > i'm seeing some mighty interesting timing on my intel ivy bridge. > i found a bug in the file server aoe implementation (can't happen > if you're using the uniprocessor x86 version) that happens because > the Srb is freed before wakeup completes. to solve this there is > some code that sets the state (this is from ken's ancient scheduler, > by way of sape) > > wakeup(&srb); > srb->state = Free; > > code that receives it is like this > > sleep(&srb, srbdone, srb); > cycles(&t0); > for(n = 0; srb->state != Free; n++){ > if(srb->wmach == m->machno) > sched(); > else > monmwait(&srb->state, Alloc); > } > cycles(&t1); > free(srb); > > the astounding thing is that t1-t0 is often ~ 60,000 cycles. > it only hits a small fraction of the time, and the average is > much lower. but that just blows the mind. 60000 cycles! > > (other versions with sched were much worse.) > > as far as i can tell, there are no funny bits in the scheduler that > can cause this, and no wierd scheduling is going on. > > i'm baffled. > > - erik > > --047d7b3a81e8a17c5204fc3d6881 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Weird. I assume cycles is using rdtsc or rdtscp. Perhaps som= e of it is due to a combination of contention and rdtsc(p) being serializin= g instructions?

On Jun 19, 2014 12:04 PM, "erik quanstrom&q= uot; <quanstro@quanstro.net= > wrote:
i'm seeing some mighty interesting timing on my intel ivy bridge.
i found a bug in the file server aoe implementation (can't happen
if you're using the uniprocessor x86 version) that happens because
the Srb is freed before wakeup completes. =C2=A0to solve this there is
some code that sets the state (this is from ken's ancient scheduler, by way of sape)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 wakeup(&srb);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 srb->state =3D Free;

code that receives it is like this

=C2=A0 =C2=A0 =C2=A0 =C2=A0 sleep(&srb, srbdone, srb);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 cycles(&t0);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 for(n =3D 0; srb->state !=3D Free; n++){
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if(srb->wmach = =3D=3D m->machno)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 sched();
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 else
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 monmwait(&srb->state, Alloc);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 cycles(&t1);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 free(srb);

the astounding thing is that t1-t0 is often ~ 60,000 cycles.
it only hits a small fraction of the time, and the average is
much lower. =C2=A0but that just blows the mind. =C2=A060000 cycles!

(other versions with sched were much worse.)

as far as i can tell, there are no funny bits in the scheduler that
can cause this, and no wierd scheduling is going on.

i'm baffled.

- erik

--047d7b3a81e8a17c5204fc3d6881-- From mboxrd@z Thu Jan 1 00:00:00 1970 To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> In-reply-to: Your message of "Thu, 19 Jun 2014 12:01:10 EDT." <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> References: <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> Date: Thu, 19 Jun 2014 22:09:45 -0700 From: Bakul Shah Message-Id: <20140620050945.6C4E7B827@mail.bitblocks.com> Subject: Re: [9fans] cache lines, and 60000 cycles of doom Topicbox-Message-UUID: fb925932-ead8-11e9-9d60-3106f5b1d025 On Thu, 19 Jun 2014 12:01:10 EDT erik quanstrom wrote: > i'm seeing some mighty interesting timing on my intel ivy bridge. > i found a bug in the file server aoe implementation (can't happen > if you're using the uniprocessor x86 version) that happens because > the Srb is freed before wakeup completes. to solve this there is > some code that sets the state (this is from ken's ancient scheduler, > by way of sape) > > wakeup(&srb); > srb->state = Free; > > code that receives it is like this > > sleep(&srb, srbdone, srb); > cycles(&t0); > for(n = 0; srb->state != Free; n++){ > if(srb->wmach == m->machno) > sched(); > else > monmwait(&srb->state, Alloc); > } > cycles(&t1); > free(srb); > > the astounding thing is that t1-t0 is often ~ 60,000 cycles. > it only hits a small fraction of the time, and the average is > much lower. but that just blows the mind. 60000 cycles! > > (other versions with sched were much worse.) > > as far as i can tell, there are no funny bits in the scheduler that > can cause this, and no wierd scheduling is going on. > > i'm baffled. Could there've been a context switch? From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Fri, 20 Jun 2014 07:45:30 -0400 To: 9fans@9fans.net Message-ID: <4a81bb5db0670710c6126ad8f7f22d4b@brasstown.quanstro.net> In-Reply-To: <20140620050945.6C4E7B827@mail.bitblocks.com> References: <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> <20140620050945.6C4E7B827@mail.bitblocks.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] cache lines, and 60000 cycles of doom Topicbox-Message-UUID: fc131b26-ead8-11e9-9d60-3106f5b1d025 > > the astounding thing is that t1-t0 is often ~ 60,000 cycles. > > it only hits a small fraction of the time, and the average is > > much lower. but that just blows the mind. 60000 cycles! > > > > (other versions with sched were much worse.) > > > > as far as i can tell, there are no funny bits in the scheduler that > > can cause this, and no wierd scheduling is going on. > > > > i'm baffled. > > Could there've been a context switch? the file server does not have context switches. - erik From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Fri, 20 Jun 2014 07:50:40 -0400 To: 9fans@9fans.net Message-ID: <1b3bfeff9117336767b881178d1f8b6a@brasstown.quanstro.net> In-Reply-To: References: <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] cache lines, and 60000 cycles of doom Topicbox-Message-UUID: fc1e587e-ead8-11e9-9d60-3106f5b1d025 On Fri Jun 20 01:04:20 EDT 2014, devon.odell@gmail.com wrote: > Weird. I assume cycles is using rdtsc or rdtscp. Perhaps some of it is due > to a combination of contention and rdtsc(p) being serializing instructions? > On Jun 19, 2014 12:04 PM, "erik quanstrom" wrote: other than the code i posted, nobody else touching the Srb, and it's bigger than a cacheline. why would serialization cause a big issue? - erik From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <1b3bfeff9117336767b881178d1f8b6a@brasstown.quanstro.net> References: <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> <1b3bfeff9117336767b881178d1f8b6a@brasstown.quanstro.net> Date: Fri, 20 Jun 2014 08:47:15 -0400 Message-ID: From: "Devon H. O'Dell" To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=UTF-8 Subject: Re: [9fans] cache lines, and 60000 cycles of doom Topicbox-Message-UUID: fc291066-ead8-11e9-9d60-3106f5b1d025 2014-06-20 7:50 GMT-04:00 erik quanstrom : > On Fri Jun 20 01:04:20 EDT 2014, devon.odell@gmail.com wrote: > >> Weird. I assume cycles is using rdtsc or rdtscp. Perhaps some of it is due >> to a combination of contention and rdtsc(p) being serializing instructions? I forget that rdtsc isn't, and one uses cpuid to get that behavior. >> On Jun 19, 2014 12:04 PM, "erik quanstrom" wrote: > > other than the code i posted, nobody else touching the Srb, > and it's bigger than a cacheline. > > why would serialization cause a big issue? It disables out-of-order execution by the processor, so there's a pipeline stall. There's overhead to calling the tsc instructions, but not that much. Does `srb->wmach != m->machno` imply that t0 and t1 could be run on different CPUs? TSC is synchronized between cores (unless someone does wrmsr), but if you bounce to another processor, there's no guarantee. Perhaps the difference between when the CPUs came online was on the order of 60k cycles. No clue how cheap sched() is these days. I should probably start reading the code again before I reply to these things. Sorry. --dho > - erik > From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Fri, 20 Jun 2014 09:07:11 -0400 To: 9fans@9fans.net Message-ID: In-Reply-To: References: <399b6c5c08b7aeab53f52ce0414714bf@brasstown.quanstro.net> <1b3bfeff9117336767b881178d1f8b6a@brasstown.quanstro.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] cache lines, and 60000 cycles of doom Topicbox-Message-UUID: fc2ec3bc-ead8-11e9-9d60-3106f5b1d025 > It disables out-of-order execution by the processor, so there's a > pipeline stall. we know there's going to be a stall already, since we can't get the cacheline we're looking for. > There's overhead to calling the tsc instructions, but not that much. > > Does `srb->wmach != m->machno` imply that t0 and t1 could be run on > different CPUs? TSC is synchronized between cores (unless someone does > wrmsr), but if you bounce to another processor, there's no guarantee. > Perhaps the difference between when the CPUs came online was on the > order of 60k cycles. No clue how cheap sched() is these days. srb->wmach is the wakeup mach, m->machno is us. and i just realized that that deadlock prevention is not necessary. if srb->wmach == m->machno, then wakeup has cleared the hazard. i've rewritten the function as so static void srbfree(Srb *srb) { while(monmwait(&srb->state, Alloc) == Alloc) {} mbfree(srb->msgbuf); } by the way, to head off the next speculation, i tried wrapping the wakeup in splhi(), but that made no difference. the waker is not being scheduled. by the way, this is ken's kernel on amd64: /n/atom/plan9/sys/src/fs the development version has drifted a bit, but i can give you access if you're interested. - erik