From mboxrd@z Thu Jan  1 00:00:00 1970
Message-Id: <30A0D4B5-1AAB-4D95-9B9F-FD09CB796E6D@bitblocks.com>
From: Bakul Shah <bakul@bitblocks.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
In-Reply-To: <f7a852a029ef39ffa039b8ff51293722@brasstown.quanstro.net>
Content-Type: text/plain;
	charset=us-ascii;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (iPhone Mail 7E18)
Date: Sat,  7 May 2011 12:33:50 -0700
References: <f7a852a029ef39ffa039b8ff51293722@brasstown.quanstro.net>
Subject: Re: [9fans] _xinc vs ainc
Topicbox-Message-UUID: e0c9cc22-ead6-11e9-9d60-3106f5b1d025

On May 7, 2011, at 6:05 AM, erik quanstrom <quanstro@quanstro.net>
wrote:

> i'm confused by the recent change to the thread library.
> the old code was simply to do a locked incl.  the new code
> does a locked exchange /within a loop/ until it's seen that
> nobody else has updated the value at the same time, thus
> insuring that the value has indeed been updated.
>
> since the expensive operation is the MESI(F) negotiation
> behind the scenes to get exclusive access to the cacheline,
> i don't understand the motiviation is for replacing _xinc
> with ainc.  since ainc can loop on an expensive lock instruction.
>
> that is, i think the old version was wait free, and the new version
> is not.
>
> can someone explain what i'm missing here?

> thanks!
>
> - erik
>
> ----
>
> TEXT    _xinc(SB),$0    /* void _xinc(long *); */
>
>    MOVL    l+0(FP),AX
>    LOCK
>    INCL    0(AX)
>    RET
>
> ----
>
> TEXT ainc(SB), $0    /* long ainc(long *); */
>    MOVL    addr+0(FP), BX
> ainclp:
>    MOVL    (BX), AX
>    MOVL    AX, CX
>    INCL    CX
>    LOCK
>    BYTE    $0x0F; BYTE $0xB1; BYTE $0x0B    /* CMPXCHGL CX, (BX) */
>    JNZ    ainclp
>    MOVL    CX, AX
>    RET
>

Just guessing. May be the new code allows more concurrency? If the
value is not in the processor cache, will the old code block other
processors for much longer? The new code forces caching with the first
read so may be high likelyhood cmpxchg will finish faster. I haven't
studied x86 cache behavior so this guess could be completely wrong.
Suggest asking on comp.arch where people like Andy Glew can give you a
definitive answer.