From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/5735 Path: news.gmane.org!not-for-mail From: Alon Zakai Newsgroups: gmane.linux.lib.musl.general Subject: Re: Bug report on iswalpha Date: Tue, 5 Aug 2014 14:10:25 -0700 Message-ID: References: <20140805210238.GL1674@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=089e0103eebaccd11504ffe8487d X-Trace: ger.gmane.org 1407273045 18598 80.91.229.3 (5 Aug 2014 21:10:45 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 5 Aug 2014 21:10:45 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-5740-gllmg-musl=m.gmane.org@lists.openwall.com Tue Aug 05 23:10:39 2014 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1XEm0D-0001EA-Sw for gllmg-musl@plane.gmane.org; Tue, 05 Aug 2014 23:10:38 +0200 Original-Received: (qmail 3102 invoked by uid 550); 5 Aug 2014 21:10:37 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 3094 invoked from network); 5 Aug 2014 21:10:37 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=oTX5d7QJ0uIu1rorxza2buHN0DU2N5NSsymEq2mHLBA=; b=FaPCzmUcFw7ROiuEPB1fCV5M46z61igCa8thG5E5kIxrkJiqWugJYPC3JyyQO1PrXA hKeViXmKGuaK9mVj5r3OspB/VZSgSows0yewZ33B8FQL8PwtiRE9V29wEHD/aUusQLSn i1XTV7024knIwwWm1tWSO+N4BfKTtx6Ol9C53wGMSiK7aHlJbK08zu6qTmQYp7Sfg1UY 3HUKVJCUqPOq25ZrOUpgH/Q1TJRZoFhNDOkSsUpnH+4T32DDkQmM3/zH957G3uDnqLxY 1/aKh3QMqRcR1kSEg1wi/WtdscbaGiJsJf8yBciMCpxEnXGyAkI+dx1eqXYvcfETiIX3 Mlzw== X-Received: by 10.194.103.38 with SMTP id ft6mr9611840wjb.18.1407273025718; Tue, 05 Aug 2014 14:10:25 -0700 (PDT) In-Reply-To: <20140805210238.GL1674@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:5735 Archived-At: --089e0103eebaccd11504ffe8487d Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I see what you mean, yes, this does seem like undefined behavior then, as it's invalid in that locale. Thanks for the quick response! And thanks for musl in general! We are very happy with it in the emscripten project. - Alon On Tue, Aug 5, 2014 at 2:02 PM, Rich Felker wrote: > On Tue, Aug 05, 2014 at 01:35:27PM -0700, Alon Zakai wrote: > > I think we have encountered a bug in iswalpha, as shown by the followin= g > > program: > > At least an inconsistency with glibc. Not necessarily a bug. > > > =3D=3D=3D=3D > > #include > > #include > > #include > > > > int > > main(const int argc, const char * const * const argv) > > { > > const char * const locale =3D (argc > 1 ? argv[1] : "C"); > > const char * const actual =3D setlocale(LC_ALL, locale); > > if(actual =3D=3D NULL) { > > printf("%s locale not supported; skipped locale-dependent code\n", > > locale); > > return 0; > > } > > printf("locale set to %s: %s\n", locale, actual); > > > > const int result =3D iswalpha(0xf4); // =C3=B4 > > printf("iswalpha(\"\xc3\xb4\") =3D %d\n", result); > > return 0; > > } > > =3D=3D=3D=3D > > > > It returns 1 in the final printf, saying that that char is an walpha > char, > > when I believe it is not. For comparison, glibc reports 0. > > > > Tested on musl 1.0.3 (used in emscripten) and musl trunk on git, same > > result. > > Expecting iswalpha(0xf4) to return 0 in the C locale is wron, since > 0xf4 has not been established to be valid wchar_t value in the current > locale, and the behavior of iswalpha is _undefined_ unless the > argument is either WEOF or a valid wchar_t in the current locale. > > As documented, musl's C locale contains all of Unicode, and > additionally classifies all Unicode characters into the C classes like > "alpha", etc. based on their Unicode identities. This behavior is > definitely conforming to the requirements of ISO C and likely (though > the specification is not entirely clear) conforming to the current > requirements of POSIX, but is expected to be forbidden in future > issues of POSIX. > > This is actually a topic of current discussion and possible change > (depending on what happens in POSIX), but I don't think the behavior > of iswalpha is likely to change in any case. If the C locale in musl > is changed not to include all of Unicode, then iswalpha(0xf4) would > just be undefined behavior in the C locale, and there would be no > reason to make it check the locale and return false. If the above code > is part of a test, I think it's an invalid test. With a better idea of > what it's trying to test, I could possibly suggest a fix that avoids > the UB. > > Rich > --089e0103eebaccd11504ffe8487d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I see what you mean, yes, this does seem like undefin= ed behavior then, as it's invalid in that locale. Thanks for the quick = response!

And thanks for musl in general! We are very hap= py with it in the emscripten project.

- Alon


On Tue, Aug 5, 2014 at 2:02 PM, Rich Felker <dal= ias@libc.org> wrote:
On Tue, Aug 05, 2014 at 01:3= 5:27PM -0700, Alon Zakai wrote:
> I think we have encountered a bug in iswalpha, as shown by the followi= ng
> program:

At least an inconsistency with glibc. Not necessarily a bug.

> =3D=3D=3D=3D
> #include <locale.h>
> #include <stdio.h>
> #include <wctype.h>
>
> int
> main(const int argc, const char * const * const argv)
> {
> =C2=A0 const char * const locale =3D (argc > 1 ? argv[1] : "C&= quot;);
> =C2=A0 const char * const actual =3D setlocale(LC_ALL, locale);
> =C2=A0 if(actual =3D=3D NULL) {
> =C2=A0 =C2=A0 printf("%s locale not supported; skipped locale-dep= endent code\n",
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0locale);
> =C2=A0 =C2=A0 return 0;
> =C2=A0 }
> =C2=A0 printf("locale set to %s: %s\n", locale, actual);
>
> =C2=A0 const int result =3D iswalpha(0xf4); // =C3=B4
> =C2=A0 printf("iswalpha(\"\xc3\xb4\") =3D %d\n", r= esult);
> =C2=A0 return 0;
> }
> =3D=3D=3D=3D
>
> It returns 1 in the final printf, saying that that char is an walpha c= har,
> when I believe it is not. For comparison, glibc reports 0.
>
> Tested on musl 1.0.3 (used in emscripten) and musl trunk on git, same<= br> > result.

Expecting iswalpha(0xf4) to return 0 in the C locale is wron, since 0xf4 has not been established to be valid wchar_t value in the current
locale, and the behavior of iswalpha is _undefined_ unless the
argument is either WEOF or a valid wchar_t in the current locale.

As documented, musl's C locale contains all of Unicode, and
additionally classifies all Unicode characters into the C classes like
"alpha", etc. based on their Unicode identities. This behavior is=
definitely conforming to the requirements of ISO C and likely (though
the specification is not entirely clear) conforming to the current
requirements of POSIX, but is expected to be forbidden in future
issues of POSIX.

This is actually a topic of current discussion and possible change
(depending on what happens in POSIX), but I don't think the behavior of iswalpha is likely to change in any case. If the C locale in musl
is changed not to include all of Unicode, then iswalpha(0xf4) would
just be undefined behavior in the C locale, and there would be no
reason to make it check the locale and return false. If the above code
is part of a test, I think it's an invalid test. With a better idea of<= br> what it's trying to test, I could possibly suggest a fix that avoids the UB.

Rich

--089e0103eebaccd11504ffe8487d--