On 23 April 2015 at 10:23, Laurent Bercot <ska-dietlibc@skarnet.org> wrote:

> On 23/04/2015 06:24, Jean-Marc Pigeon wrote:
>
>> Think about this, you write an application working perfectly right,
>> but 1 in 1000000 you reach something not trapped by low level and
>> once in while the application (in production for month) just stop
>> to work because "unexpected" within musl...
>>
>
>  And why do you think the problem exists in the first place ?
> Because other libcs were defensive and failed to fail early, so the
> bug was never discovered until now. Your application is not working
> perfectly right - it is buggy, and it *should* fail. musl is giving
> developers a gift that other libcs do not: it helps them debug.
>
>
>  (so someone will propose to set a cron to automatically restart this
>> unreliable daemon, hmmm...)
>>
>
>  You want to be defensive, well, yeah, this is the place to be
> defensive. Until the bug is found and fixed, at least the daemon is
> kind of providing service.
>
>  Raphael says this behaviour is wrong for the same reason that
> silently failing is wrong, but I disagree. First, restarting crashing
> daemons is not silent at all, a crash is always a loud warning and
> can hardly be ignored; and second, restarting a process is not
> continuing it. A process can always be restarted from a clean state
> and work in a predictable way until it trips the bug again, whereas
> silently ignoring UB makes the process unpredictable for the rest of
> its lifetime.


Yes and no. Crashes are loud and noisy, and should immediately trigger
alerts, but without intimate knowledge of the application and the cause of
the fault, auto-restarting is risky. In my operational experience, it's
usually been a hack employed by incompetent sysadmins (no names, no pack
drill, but one large government dept comes to mind). If you have knowledge
of your daemon processes, then you could if:-
- you know they are idempotent or do not have persistent state (eg DNS
caches)
- they're essential system services (definitions might vary, but I'd have
ssh for geographically remote boxes here)
 That said, stuff that has complex state really shouldn't be restarted
without *investigation* - message brokers, relational database titans,
cluster HA set ups, etc. The worst outage of my career was a terracotta
cluster that had suffered from a split brain. Restarting it naively caused
it to _delete_ the only remaining good state. Is this your 'clean state'
caveat above?

>
>
>  Far better to return "trouble" status, then it is to the application
>> to decide what must be done in context, as ignore, override, bypass,
>> crash, etc.
>>
>
>  What "trouble" status do you return when a function dereferences a
> NULL pointer ? This is exactly what's happening here. Passing NULL
> to setenv is as incorrect as dereferencing NULL, and should result
> in the same behaviour.
>
>
>  A sensible policy in case of UB would be for such low level code to
>> swallow the problem, (protect the hardware and keep the program
>> running as much as possible).
>>
>
>  The language you want is Javascript, not C.
>
>
>  As reported, the crashing application is hwclock, (util-linux-2.26),
>> this a kind of code in the field for a very  very long time, so the
>> library (glibc and old libc) used for linux over the years defined an
>> expected behavior to this "UB".
>>
>
>  And this is why musl is so much better. If glibc and uclibc devs
> hadn't been so complacent, the bug wouldn't have lived for so long.
>
>
>  Crashing is not an option for code pertaining to musl/libc layer.
>>
>
>  It definitely is. You don't want your program to crash ? Don't
> invoke UB.
>  If you want to be "safe", you can ignore SIGSEGV at the start of
> all your applications - it will be the exact same thing as what you
> are asking. Your daemons will live longer, I guarantee it.
>
>
>  (:-} why bother to return an error, just crash for all
>> problems in open, close, write, etc. just bringing the crashing
>> concept to the extreme :-}).
>>
>
>  Straw man. You know as well as we do the difference between a
> programming error and a run-time error.
>
>
>  My experience (for a long time now) about writing complex daemon
>> running for months/year, it is not that straightforward (may
>> be for a simple application it is)
>>
>
>  And mine is that it is. We're evens, now please let's stop bringing
> up anecdotal evidence.
>
> --
>  Laurent
>
>