On 23 April 2015 at 10:23, Laurent Bercot wrote: > On 23/04/2015 06:24, Jean-Marc Pigeon wrote: > >> Think about this, you write an application working perfectly right, >> but 1 in 1000000 you reach something not trapped by low level and >> once in while the application (in production for month) just stop >> to work because "unexpected" within musl... >> > > And why do you think the problem exists in the first place ? > Because other libcs were defensive and failed to fail early, so the > bug was never discovered until now. Your application is not working > perfectly right - it is buggy, and it *should* fail. musl is giving > developers a gift that other libcs do not: it helps them debug. > > > (so someone will propose to set a cron to automatically restart this >> unreliable daemon, hmmm...) >> > > You want to be defensive, well, yeah, this is the place to be > defensive. Until the bug is found and fixed, at least the daemon is > kind of providing service. > > Raphael says this behaviour is wrong for the same reason that > silently failing is wrong, but I disagree. First, restarting crashing > daemons is not silent at all, a crash is always a loud warning and > can hardly be ignored; and second, restarting a process is not > continuing it. A process can always be restarted from a clean state > and work in a predictable way until it trips the bug again, whereas > silently ignoring UB makes the process unpredictable for the rest of > its lifetime. Yes and no. Crashes are loud and noisy, and should immediately trigger alerts, but without intimate knowledge of the application and the cause of the fault, auto-restarting is risky. In my operational experience, it's usually been a hack employed by incompetent sysadmins (no names, no pack drill, but one large government dept comes to mind). If you have knowledge of your daemon processes, then you could if:- - you know they are idempotent or do not have persistent state (eg DNS caches) - they're essential system services (definitions might vary, but I'd have ssh for geographically remote boxes here) That said, stuff that has complex state really shouldn't be restarted without *investigation* - message brokers, relational database titans, cluster HA set ups, etc. The worst outage of my career was a terracotta cluster that had suffered from a split brain. Restarting it naively caused it to _delete_ the only remaining good state. Is this your 'clean state' caveat above? > > > Far better to return "trouble" status, then it is to the application >> to decide what must be done in context, as ignore, override, bypass, >> crash, etc. >> > > What "trouble" status do you return when a function dereferences a > NULL pointer ? This is exactly what's happening here. Passing NULL > to setenv is as incorrect as dereferencing NULL, and should result > in the same behaviour. > > > A sensible policy in case of UB would be for such low level code to >> swallow the problem, (protect the hardware and keep the program >> running as much as possible). >> > > The language you want is Javascript, not C. > > > As reported, the crashing application is hwclock, (util-linux-2.26), >> this a kind of code in the field for a very very long time, so the >> library (glibc and old libc) used for linux over the years defined an >> expected behavior to this "UB". >> > > And this is why musl is so much better. If glibc and uclibc devs > hadn't been so complacent, the bug wouldn't have lived for so long. > > > Crashing is not an option for code pertaining to musl/libc layer. >> > > It definitely is. You don't want your program to crash ? Don't > invoke UB. > If you want to be "safe", you can ignore SIGSEGV at the start of > all your applications - it will be the exact same thing as what you > are asking. Your daemons will live longer, I guarantee it. > > > (:-} why bother to return an error, just crash for all >> problems in open, close, write, etc. just bringing the crashing >> concept to the extreme :-}). >> > > Straw man. You know as well as we do the difference between a > programming error and a run-time error. > > > My experience (for a long time now) about writing complex daemon >> running for months/year, it is not that straightforward (may >> be for a simple application it is) >> > > And mine is that it is. We're evens, now please let's stop bringing > up anecdotal evidence. > > -- > Laurent > >