From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.2 Received: from mother.openwall.net (mother.openwall.net [195.42.179.200]) by inbox.vuxu.org (OpenSMTPD) with SMTP id c41249ce for ; Mon, 9 Mar 2020 18:15:16 +0000 (UTC) Received: (qmail 16274 invoked by uid 550); 9 Mar 2020 18:15:14 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 16256 invoked from network); 9 Mar 2020 18:15:14 -0000 X-Virus-Scanned: by amavisd-new-2.10.1 (20141025) (Debian) at wwcom.ch To: musl@lists.openwall.com References: <41ea935d-39e4-1460-e502-5c82d7dd6a4d@wwcom.ch> <20200309171227.GY11469@brightrain.aerifal.cx> From: Pirmin Walthert Message-ID: <82b69741-72e6-ab53-c523-ce4e1e7dc98e@wwcom.ch> Date: Mon, 9 Mar 2020 19:14:59 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0 MIME-Version: 1.0 In-Reply-To: <20200309171227.GY11469@brightrain.aerifal.cx> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Subject: Re: [musl] Re: FYI: some observations when testing next-gen malloc Am 09.03.20 um 18:12 schrieb Rich Felker: > On Mon, Mar 09, 2020 at 05:49:02PM +0100, Pirmin Walthert wrote: >> Dear Rich, >> >> First of all many thanks for your brilliant C library. >> >> As I do not know whether the musl mailinglist is already the right >> place to discuss the next-gen malloc module, I decided to send you >> my observations directly. > It is, so I'm cc'ing the list now. > >> I'd like to mention that I am not yet entirely sure whether the >> following is a problem with the new malloc code or with asterisk >> itself but maybe you can already keep the following in the back of >> your head if someone else is reporting similar behavior with a >> different application: >> >> We use asterisk (16.7) in a musl libc based distribution and for >> some operations asterisk forks (in a thread) the main process to >> execute a system command. When using libmallocng.so (newest version >> with "fix race condition in lock-free path of free" applied, but >> already without that change) some of these forked child processes >> will hang during a call to pthread_mutex_unlock. >> >> Unfortunatelly the backtrace is not of much help I guess, but the >> child process always seems to hang on pthread_mutex_unlock. So >> something seems to happen with the mutex on fork: >> >> #0  0x00007f2152a20092 in pthread_mutex_unlock () from >> /lib/ld-musl-x86_64.so.1 >> No symbol table info available. >> #1  0x0000000000000008 in ?? () >> No symbol table info available. >> #2  0x0000000000000000 in ?? () >> No symbol table info available. >> >> I will for sure try to dig into this further. For the moment the >> only thing I know is that I did not yet observe this on any of the >> several hundred systems with musl 1.1.23 (same asterisk version), >> not on any of the around 5 with 1.2.0 (same asterisk version, old >> malloc) but quite frequently on the two systems with 1.1.24 and >> libmallocng.so. > This is completely expected and should happen with old or new malloc. > I'm surprised you haven't hit it before. After a multithreaded process > calls fork, the child inherits a state where locks may be permanently > held. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html > > - A process shall be created with a single thread. If a > multi-threaded process calls fork(), the new process shall > contain a replica of the calling thread and its entire address > space, possibly including the states of mutexes and other > resources. Consequently, to avoid errors, the child process may > only execute async-signal-safe operations until such time as one > of the exec functions is called. > > It's not described very rigorously, but effectively it's in an async > signal context and can only call functions which are AS-safe. > > A future version of the standard is expected to drop the requirement > that fork itself be async-signal-safe, and may thereby add > requirements to synchronize against some or all internal locks so that > the child can inherit a working context. But the right solution here is > always to stop using fork without exec. > > Rich Well, I have now changed the code a bit to make sure that no async-signal-unsafe command is being executed before execl. Things I've removed: a call to cap_from_text, cap_set_proc and cap_free has been removed as well as sched_setscheduler. Now the only thing being executed before execl in the child process is closefrom() However I got a hanging process again: (gdb) bt full #0  0x00007f42f649c6da in __syscall_cp_c () from /lib/ld-musl-x86_64.so.1 No symbol table info available. #1  0x0000000000000000 in ?? () No symbol table info available. Best regards, Pirmin