From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/13229 Path: news.gmane.org!.POSTED!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: string-backed FILEs mess Date: Wed, 12 Sep 2018 11:43:06 -0400 Message-ID: <20180912154306.GW1878@brightrain.aerifal.cx> References: <20180912140239.GV1878@brightrain.aerifal.cx> <20180912150941.GB13976@voyager> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1536766877 17439 195.159.176.226 (12 Sep 2018 15:41:17 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 12 Sep 2018 15:41:17 +0000 (UTC) User-Agent: Mutt/1.5.21 (2010-09-15) To: musl@lists.openwall.com Original-X-From: musl-return-13245-gllmg-musl=m.gmane.org@lists.openwall.com Wed Sep 12 17:41:13 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1g07GR-0004PL-0b for gllmg-musl@m.gmane.org; Wed, 12 Sep 2018 17:41:11 +0200 Original-Received: (qmail 26022 invoked by uid 550); 12 Sep 2018 15:43:19 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 26004 invoked from network); 12 Sep 2018 15:43:18 -0000 Content-Disposition: inline In-Reply-To: <20180912150941.GB13976@voyager> Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:13229 Archived-At: On Wed, Sep 12, 2018 at 05:09:41PM +0200, Markus Wichmann wrote: > On Wed, Sep 12, 2018 at 10:02:39AM -0400, Rich Felker wrote: > > While working on the headers/declarations/hidden stuff, I've run > > across some issues that should be documented as needing fix. One big > > one is this: > > > > strto* is still setting up a fake FILE with buffer end pointer as > > (void*)-1 as a hack to read directly from a string of unknown length. > > This is of course nonsensical UB and might lead to actual problems > > (signed overflow) in places where rend-rpos is computed. strtol.c > > handles that aspect already by "if ((size_t)s > (size_t)-1/2)" but > > it's still horribly wrong. There's a __string_read function sscanf > > uses that avoids the whole problem by incrementally measuring string > > length and copying it to a real stdio buffer, which also makes ungetc > > safe, and this is the obvious *right* thing to do, but it might make > > strto* unacceptably slower. I haven't done any measurement. > > > > Well, first of all, I might set my foot wrong here very badly, but I > generally don't care about C standard UB as long as the behavior is > defined elsewhere. Like where? In order for it to be defined, the *compiler* has to define it, since otherwise it can make transformations that assume the behavior is undefined. So what you're asking for here is basically amounting to only supporting certain compilers (with certain flags), and notably *not supporting* UBSan, which is a really valuable tool for catching bugs. > In this case, the behavior is undefined because the C > standard only defines a partial mapping from pointer to integer and back > (namely null pointer <-> 0 and non-null pointer <-> nonzero, but nothing > else), and that is why most operations on pointers are undefined. > However, Linux defines that it only works on a linear address space, and > so completes the mapping from pointer to integer, and so avoids the UB > by defining it. > > We have done similar things in the past, like assuming long to be large > enough for a pointer, as Linux depends on that as well. > > As for the overflow: That would be fixed if you set rend to rpos + > PTRDIFF_MAX unless that overflows. Technically still UB, but at least > the calculations work now. That's what strtol.c is doing right now. The corresponding fix in strtod.c (all the float functions) is missing. > The __string_read approach shouldn't be all that much slower, though. It > only adds overhead when the read buffer is empty, which is going to be > on the first call and then every 256 characters. And the overhead is > mostly a memchr()... on second thought, that might take a while. Ah, I forgot that __string_read is still doing the hack of having buffer pointers point directly to the string, not copying it, so that ungetc is unsafe (in general; the intscan/floatscan/shgetc framework has a way to handle it). So it's not a copy, but it is a memchr of N to N+(-N&255) bytes, where N is the length of the input item. Even if N is short, the input item might lie in a longer string, and if you have a sequence of short numbers (worst-case, most single-digit) in a long string of numbers, you'll end up repeatedly scanning for null up to 255 bytes past the end of the number. Just lowering the scanahead from 256 to something like 16-24 should make this largely irrelevant, though. Such tuning should probably be specific to caller (strto* vs vsscanf, since the latter is likely to be reading multiple fields and possibly non-numeric fields). With that said, I'd kind of like to make it work the way I was mistakenly thinking __string_read works, doing actual copying into a stdio buffer. Then there's no risk of bugs from other stdio code assuming it has a real buffer (e.g. ungetc is always safe). But I'm not sure if it would incur noticable cost. > > The other "mostly right" (modulo ungetc not being available then) > > approach would be getting rid of the whole current buffer design with > > start/end pointers and using indices instead. This would solve a lot > > of stupid gratuitous UB in stdio, like (T*)0-(T*)0 being undefined. > > It's not clear to me whether it would be more or less efficient. It > > would "break" glibc ABI-compat for stdio -- the original reason I used > > the pointer-based design -- but that could be fixed by putting > > "must-be-null" fields in place of the buffer pointers so that any > > glibc code using getc_unlocked/putc_unlocked macros would hit the > > "no buffer space" code path and call an actual function. In many ways > > that's desirable anyway. > > > > I don't know how important the glibc-compat still is for people. I don't It's important enough that Alpine has done a good deal of work making it run better with additional libraries. People use it to run various proprietary software that doesn't ship a musl-linked version. > need it, but then, I have glibc available if necessary. And with Steam > moving to Linux, closed-source software is certainly going to be on the > rise for Linux/PC (i.e. Linux/i386 and Linux/amd64). Which is going to > be a problem for musl-based distributions, if they want to support that. Ideally they would be using musl for Steam and similar things, since it's much less dependant on the host library ecosystem. It would also be more license-friendly to them. However the awful dlopen-based architecture of video drivers is a stopper right now. > > Probably the right next step here is measuring whether just using > > __string_read would make anything measurably slower. > > Probably. Although the memchr() in that function might access past the > string, but only if the string isn't terminated, so we can invoke the > GIGO principle, right? Input to strto* has to be a string (null-terminated), but it doesn't need to end after the input item (number). That's why the endptr argument exists -- for parsing numbers out of a stream of data. Rich