On 23 April 2015 at 10:23, Laurent Bercot <ska-dietlibc@skarnet.org> wrote:
On 23/04/2015 06:24, Jean-Marc Pigeon wrote:
Think about this, you write an application working perfectly right,
but 1 in 1000000 you reach something not trapped by low level and
once in while the application (in production for month) just stop
to work because "unexpected" within musl...

 And why do you think the problem exists in the first place ?
Because other libcs were defensive and failed to fail early, so the
bug was never discovered until now. Your application is not working
perfectly right - it is buggy, and it *should* fail. musl is giving
developers a gift that other libcs do not: it helps them debug.


(so someone will propose to set a cron to automatically restart this
unreliable daemon, hmmm...)

 You want to be defensive, well, yeah, this is the place to be
defensive. Until the bug is found and fixed, at least the daemon is
kind of providing service.

 Raphael says this behaviour is wrong for the same reason that
silently failing is wrong, but I disagree. First, restarting crashing
daemons is not silent at all, a crash is always a loud warning and
can hardly be ignored; and second, restarting a process is not
continuing it. A process can always be restarted from a clean state
and work in a predictable way until it trips the bug again, whereas
silently ignoring UB makes the process unpredictable for the rest of
its lifetime.  

Yes and no. Crashes are loud and noisy, and should immediately trigger alerts, but without intimate knowledge of the application and the cause of the fault, auto-restarting is risky. In my operational experience, it's usually been a hack employed by incompetent sysadmins (no names, no pack drill, but one large government dept comes to mind). If you have knowledge of your daemon processes, then you could if:-
- you know they are idempotent or do not have persistent state (eg DNS caches)
- they're essential system services (definitions might vary, but I'd have ssh for geographically remote boxes here)
 That said, stuff that has complex state really shouldn't be restarted without *investigation* - message brokers, relational database titans, cluster HA set ups, etc. The worst outage of my career was a terracotta cluster that had suffered from a split brain. Restarting it naively caused it to _delete_ the only remaining good state. Is this your 'clean state' caveat above?


Far better to return "trouble" status, then it is to the application
to decide what must be done in context, as ignore, override, bypass,
crash, etc.

 What "trouble" status do you return when a function dereferences a
NULL pointer ? This is exactly what's happening here. Passing NULL
to setenv is as incorrect as dereferencing NULL, and should result
in the same behaviour.


A sensible policy in case of UB would be for such low level code to
swallow the problem, (protect the hardware and keep the program
running as much as possible).

 The language you want is Javascript, not C.


As reported, the crashing application is hwclock, (util-linux-2.26),
this a kind of code in the field for a very  very long time, so the
library (glibc and old libc) used for linux over the years defined an
expected behavior to this "UB".

 And this is why musl is so much better. If glibc and uclibc devs
hadn't been so complacent, the bug wouldn't have lived for so long.


Crashing is not an option for code pertaining to musl/libc layer.

 It definitely is. You don't want your program to crash ? Don't
invoke UB.
 If you want to be "safe", you can ignore SIGSEGV at the start of
all your applications - it will be the exact same thing as what you
are asking. Your daemons will live longer, I guarantee it.


(:-} why bother to return an error, just crash for all
problems in open, close, write, etc. just bringing the crashing
concept to the extreme :-}).

 Straw man. You know as well as we do the difference between a
programming error and a run-time error.


My experience (for a long time now) about writing complex daemon
running for months/year, it is not that straightforward (may
be for a simple application it is)

 And mine is that it is. We're evens, now please let's stop bringing
up anecdotal evidence.

--
 Laurent