On Sat, Mar 12, 2022 at 9:11 AM Gavin Howard wrote: > > This is why condition variables necessarily have an associated > > predicate (in your case, involving your flag and possibly other > > state). You can *never* just do pthread_cond_wait. The only correct > > usage is: > > > > while (!predicate(...)) pthread_cond_wait(mtx,cnd); > > > > Correct use of condition variables ensures that you see any relevant > > changes to the state the predicate is evaluated on, without requiring > > you to explicitly work out a way to do the same thing with semaphores > > or other lower-level primitives. It does not matter whatsoever whether > > there are already waiters when you call signal. If not, they'll > > necessarily see the state change *before* they wait. The state change > > cannot happen between evaluation of the predicate and the wait because > > the calling thread holds the mutex. > > Ah, I misunderstood. > > > > What this means is that if Main Thread #2 is blocked on waitpid(), then > > > if another thread creates a child and signals the condition variable, > > > then after Main Thread #2 returns from waitpid(), it will block on the > > ^^^^^^^^^^^^^^^^^^^^ > > > condition variable. If another thread creates another child, sure, it > > ^^^^^^^^^^^^^^^^^^ > > > > No it won't, because it evaluates the predicate that tells it there's > > something to do before calling wait. If you're not doing that, you're > > using cond vars in a fundamentally wrong way. > > I have now implemented the system using the mutex and changing the > semaphore to a condition variable with the flag and two other variables > for how many children there are versus how many children have been > reaped. We'll see if anything shows up while I run it over and over > again. > > > No, this is absolutely not what's happening. Neither the processor nor > > the compiler (in general the latter is at least as much of a concern > > if not more when you have missing/incorrect synchronization) can > > reorder things like that. > > Good to know, thank you. > > > On some level they're like any other external function call to a > > function whose definition the layer optimizing the caller can't see: > > they must be assumed to be able to store to or load from any memory > > whose address has been exposed, and thus cannot be reordered > > whatsoever. > > Okay, I wondered if that might be the case. > > > Your problem is not mysterious reordering. Your problem is in your > > program's logic somewhere. Please use the debugging tools at your > > disposal, especially strace which will probably reveal the problem to > > you right away (by letting you see the exact sequence of events that > > happened and trace through it to figure out where it mismatches your > > expectation). > > I did use strace. It revealed nothing out of the ordinary. That's why I > did not mention it, but I probably should have. I've also used GDB to > inspect the core dumps. I did try. > you might find https://clang.llvm.org/docs/ThreadSanitizer.html useful for bugs like this. > Perhaps while I'm learning and making a fool of myself, I'll mention my > problem with rwlocks. > > The relevant code is: > > ``` > do > { > bool rel = (strchr((char*) cmd->a, '/') != NULL); > > cont = false; > > // We only need to do something when the cmd is not a relative path. > if (!rel) > { > s = y_strucon_rwlock_rdlock(&r->env.lock); > if (y_err(s != y_STATUS_SUCCESS)) goto err; > > exists = y_map_existsStrings_v(&r->env.exec_map, (char*) cmd->a, > &res); > > // We have to hold the lock until we have copied the result > because it > // could be moved by a write to the map. > if (exists) > { > // Just move the value from res to cmd. I can do this because > // the string in res is heap allocated and is not affected by > // edits to the map. > cmd->len = res->len; > cmd->a = res->a; > > // Release the lock as soon as possible. > y_strucon_rwlock_rdunlock(&r->env.lock); > } > else > { > // Release the lock as soon as possible. > y_strucon_rwlock_rdunlock(&r->env.lock); > > > > s = y_strucon_rwlock_wrlock(&r->env.lock); > if (y_err(s != y_STATUS_SUCCESS)) > { > y_str_free(&str); > goto err; > } > > // Make sure someone didn't get there first. > if (!y_map_existsStrings(&r->env.exec_map, (char*) cmd->a)) > { > s = y_map_insertStrings(&r->env.exec_map, (char*) cmd->a, > (char*) str.a); > if (s == y_STATUS_ELEM_EXISTS) > { > y_panica("Element already exists"); > } > } > > y_strucon_rwlock_wrunlock(&r->env.lock); > > // Always free first. > y_str_free(&str); > y_stackpool_free(pool); > > if (y_err(s != y_STATUS_SUCCESS)) goto err; > > cont = true; > } > } > } > while (cont); > > err: > > ``` > > Besides initialization (before any other threads are created) and > destruction (after all other threads are joined), these are the only > references to the map in question. > > `y_strucon_rwlock_*` functions are just wrappers around POSIX rwlocks. > > In this case, I'm not doing anything fancy; it's just rwlocks. However, > that abort about the element already existing will be triggered > consistently within about 5 minutes, both on musl and glibc, so probably > PEBKAC. > > If I change the read lock near the beginning with a write lock, I still > get the same issue. However, if I change the rwlock for a mutex, and > update all locking and unlocking to match, I don't get the issue. > > In this case, the strace shows nothing out of the ordinary that I can > see. > > Yet I can't see how I am doing anything wrong. I've double checked that > y_map_existsStrings{,_v} do not edit the map at all. > > So where's my stupidity on this one? > > Gavin Howard >