Dominique MARTINET wrote on Wed, Jan 25, 2023 at 03:48:37PM +0900: > I'll add a circular buffer to log things like the active[0] at entry and > its mask values, then set my board up to reproduce again, which will > probably bring us to next Monday. I've reproduced with that, it seems to confirm that we entered try_avail() with m->avail == 0 and the next element had freed == 0... (format: '__func__ (__LINE__): ', m is printed with %p, masks with %x -- lines moved due to the debug statements, I've attached both the patch and full log to this mail for history, however ugly the code is) In particular, m->next is logged as identical to m here, but when looking at gdb "almost immediately" after we can see that m->next isn't m anymore: ---- alloc_slot (324): 0x2436f40: avail 0, freed 0 try_avail (145): new m: 0x2436f88, avail 3ffffffe, freed 0 try_avail (171): mask 0, mem active_idx: 29, m/m->next 0x2436f88/0x2436f88 try_avail (178): BUGGED (gdb) p (*pm) $6 = (struct meta *) 0x2436f88 (gdb) p (*pm)->next $8 = (struct meta *) 0x2436ee0 ---- This is on a single core arm board (i.MX6 ULL), so there should be no room for cache problems, and there aren't any thread, but... openrc handles SIGCHLD, and I just confirmed it calls free() in its signal handler..... Since malloc/free aren't signal-safe, that explains everything we've seen and it's a bug I can now fix in openrc (also quite recomforting to confirm this isn't a musl bug) Thank you for your help! -- Dominique