mailing list of musl libc
 help / color / mirror / code / Atom feed
* stdio glitch & questions
@ 2018-11-30 10:51 Xan Phung
  2018-11-30 16:09 ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Xan Phung @ 2018-11-30 10:51 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2043 bytes --]

Hi,

A few questions about stdio:

(1) I notice __toread.c uses angular quotes for <stdio_impl.h> whereas all
other source files use "stdio_impl.h".  I assume the latter is correct and
__toread.c's use of angular quotes was a glitch & it should really be
double quotes... is that correct?

(2) I notice vfprintf first tries to call printf_core with f=0 (line 667)
then calls printf_core again with f set to the actual file to receive
output (line 682).  Why is printf_core called twice?  I struggle to
understand the purpose of the first call with f=0.

(3) When I do a step thru the __fwritex function to understand how printf
works, I notice the resulting writev system calls pass on the output data
as a two element iovec array, with the 1st element comprising all line
buffered text up to & including the last variable data item, and then the
2nd element comprising the residual format string trailing the last
variable data item (more often than not just a single '\n').

For example, printf("error: %s\n", msg) would put all text up to &
including %s text in first iovec and the second iovec contains only '\n'.
I understand the rationale of this is to avoid copying the final '\n' to
the buffer at f->wpos.  (There is actually guaranteed space in the buffer
itself due to a check at line 10 of fwrite.c).  The use the array of 2x
iovec's presumably then relies on Linux kernel scatter-gather I/O to then
optimally handle the iovec array, ie: that the writev() of 2x iovec is more
efficient than avoiding the copy of a few additional bytes (often a single
'\n' byte) into f->wpos, and then using a single write() syscall.

Isn't this a big assumption?  With Linux itself, can we really know that
Linux device drivers are smart enough to do writev() optimally?  Also,
there is a lot of interest in porting musl to non-Linux os's, many of which
do not have writev().  (I am porting musl to WebAssembly and to Plan 9).

I can prepare a patch of a version using write() instead of writev() if
there is interest in this...

regards
Xan Phung

[-- Attachment #2: Type: text/html, Size: 2420 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: stdio glitch & questions
  2018-11-30 10:51 stdio glitch & questions Xan Phung
@ 2018-11-30 16:09 ` Rich Felker
  2018-11-30 22:15   ` Xan Phung
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2018-11-30 16:09 UTC (permalink / raw)
  To: musl

On Fri, Nov 30, 2018 at 09:51:39PM +1100, Xan Phung wrote:
> Hi,
> 
> A few questions about stdio:
> 
> (1) I notice __toread.c uses angular quotes for <stdio_impl.h> whereas all
> other source files use "stdio_impl.h".  I assume the latter is correct and
> __toread.c's use of angular quotes was a glitch & it should really be
> double quotes... is that correct?

Yes, this doesn't make any difference but it's a style mistake.

> (2) I notice vfprintf first tries to call printf_core with f=0 (line 667)
> then calls printf_core again with f set to the actual file to receive
> output (line 682).  Why is printf_core called twice?  I struggle to
> understand the purpose of the first call with f=0.

To understand this you need to look inside printf_core. When called
with !f, it attempts to collect the %N$-form arguments if they're
used, or bails out early if it detects that normal % arguments are
used. Two passes are needed here because random access to a va_list is
not possible.

> (3) When I do a step thru the __fwritex function to understand how printf
> works, I notice the resulting writev system calls pass on the output data
> as a two element iovec array, with the 1st element comprising all line
> buffered text up to & including the last variable data item, and then the
> 2nd element comprising the residual format string trailing the last
> variable data item (more often than not just a single '\n').
> 
> For example, printf("error: %s\n", msg) would put all text up to &
> including %s text in first iovec and the second iovec contains only '\n'.
> I understand the rationale of this is to avoid copying the final '\n' to
> the buffer at f->wpos.  (There is actually guaranteed space in the buffer
> itself due to a check at line 10 of fwrite.c).  The use the array of 2x
> iovec's presumably then relies on Linux kernel scatter-gather I/O to then
> optimally handle the iovec array, ie: that the writev() of 2x iovec is more
> efficient than avoiding the copy of a few additional bytes (often a single
> '\n' byte) into f->wpos, and then using a single write() syscall.

Indeed, in the case where the new data is very short, it's almost
certainly faster to just copy it to the buffer and perform a single
write syscall. Likewise, for reading a single character it's almost
surely faster to perform a single read syscall then pull it out of the
buffer.

However, conversely, it's possible to see a call to the stdio write
backend (f->write) where the new data is too large to fit in the
buffer. In this case, a writev syscall is almost certainly faster
(fewer trips back and forth between user and kernel space, which are
the dominant cost), and moving data into the buffer is not helpful
because it can't reduce the number of syscalls. Prior to commit
e3cd6c5c265cd481db6e0c5b529855d99f0bda30, fwrite contained heuristic
logic for individual cases, but it couldn't necessarily be optimal
under all usage patterns. After the change, the number of syscalls is
always minimized.

> Isn't this a big assumption?  With Linux itself, can we really know that
> Linux device drivers are smart enough to do writev() optimally?  Also,
> there is a lot of interest in porting musl to non-Linux os's, many of which
> do not have writev().  (I am porting musl to WebAssembly and to Plan 9).
> 
> I can prepare a patch of a version using write() instead of writev() if
> there is interest in this...

You can emulate readv and writev using the property that short reads
and writes are permissible, copying data through a fixed-size
intermediate buffer on the stack. This is of course suboptimal but
easy to do.

Emulation of readv is really sensitive, because breaking it up into
multiple reads can cause inapprpriate blocking. Linux actually has a
bug where this can happen anyway -- see commit
2cff36a84f268c09f4c9dc5a1340652c8e298dc0 -- so musl's __stdio_read
already reads the last character of the first (caller-requested) part
through the buffer, and collapses the readv to a read if this makes
the first iov empty.

It would probably be welcome to make __stdio_write make use of
SYS_write when it would be expected to be faster (len very small), but
I'm not sure what the exact cutoff should be. Switching away from
writev/readv would not be a welcome change though; use of them is very
intentional and it's how musl avoids some pathological slowness under
certain stdio usage patterns.

If you're porting to a system that lacks the underlying syscalls, I
think it probably makes sense to emulate them at the syscall() level
using a strategy like I described above. It's necessary for making the
public readv()/writev() functions work anyway.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: stdio glitch & questions
  2018-11-30 16:09 ` Rich Felker
@ 2018-11-30 22:15   ` Xan Phung
  2018-12-01  0:02     ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Xan Phung @ 2018-11-30 22:15 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2144 bytes --]

Thanks for the quick answer, and I've taken a look at the pre-2011 fwrite.c
code, and using SYS_writev is indeed much cleaner code!

See below on my proposed answer to your question about what the "cutoff"
should be for copying.  (SYS_writev is fully retained, but the 2nd iovec
element will very often be zero length in this proposal, which makes
emulation of SYS_writev much more efficient).

On Sat, 1 Dec 2018 at 03:10, Rich Felker <dalias@libc.org> wrote:

>
> It would probably be welcome to make __stdio_write make use of
> SYS_write when it would be expected to be faster (len very small), but
> I'm not sure what the exact cutoff should be.
>
>
My proposal is the cutoff be 5-8 bytes (on 32 bit CPUs) , and on 64 bit
CPUs, 9-16 bytes.

The cutoffs are selected in such a way that the "no copy" loop (searching
for '\n') always ends on a word aligned position (opening the door to
future optimisations by using word-at-a-time search for '\n' instead of
byte-at-a-time).  The "copy" branch is also guaranteed to only be a double
word at most, but a minimum of a single word (allowing a two word memcpy to
be done with just a 2x load/mask/store word code sequence).  Some example
code is shown to give a general idea of word-aligning the cutoff amount
(but not yet doing word-at-a-time searching of '\n', or optimised two word
memcpy).

*CURRENT __fwritex.c CODE (lines 12-20)*:


if (f->lbf >= 0) {

/* Match /^(.*\n|)/ */

for (i=l; i && s[i-1] != '\n'; i--);

if (i) {

size_t n = f->write(f, s, i);

if (n < i) return n;

s += i;

l -= i;

}

}



*PROPOSED*:

	size_t i, len;
	if (f->lbf >= 0) {
		const unsigned char *t = ALIGN(s+sizeof(size_t)*2);
		for (i = l+s-t; ; i--) {
			if (i <= 0) {   /* SHORT LINE - copy up to 16 bytes into f->wpos
buffer and then flush line */
				for (j = t-s; j && s[j-1] != '\n'; j--);
				if (j) {
			  		memcpy(f->wpos, s, j);  f->wpos += j;
					size_t n = f->write(f, t, 0);
					if (n < 0) return n;
					s += j;
					l -= j;
				}				break;
			}
			if (t[i-1] == '\n') {
				size_t n = f->write(f, s, len = i+t-s);
				if (n < len) return n;
				s += len;
				l -= len;
				break;
			}
		}
	}

[-- Attachment #2: Type: text/html, Size: 6864 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: stdio glitch & questions
  2018-11-30 22:15   ` Xan Phung
@ 2018-12-01  0:02     ` Rich Felker
  2018-12-01  2:42       ` Xan Phung
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2018-12-01  0:02 UTC (permalink / raw)
  To: musl

On Sat, Dec 01, 2018 at 09:15:56AM +1100, Xan Phung wrote:
> Thanks for the quick answer, and I've taken a look at the pre-2011 fwrite.c
> code, and using SYS_writev is indeed much cleaner code!
> 
> See below on my proposed answer to your question about what the "cutoff"
> should be for copying.  (SYS_writev is fully retained, but the 2nd iovec
> element will very often be zero length in this proposal, which makes
> emulation of SYS_writev much more efficient).
> 
> On Sat, 1 Dec 2018 at 03:10, Rich Felker <dalias@libc.org> wrote:
> 
> >
> > It would probably be welcome to make __stdio_write make use of
> > SYS_write when it would be expected to be faster (len very small), but
> > I'm not sure what the exact cutoff should be.
> >
> >
> My proposal is the cutoff be 5-8 bytes (on 32 bit CPUs) , and on 64 bit
> CPUs, 9-16 bytes.
> 
> The cutoffs are selected in such a way that the "no copy" loop (searching
> for '\n') always ends on a word aligned position (opening the door to
> future optimisations by using word-at-a-time search for '\n' instead of
> byte-at-a-time).  The "copy" branch is also guaranteed to only be a double
> word at most, but a minimum of a single word (allowing a two word memcpy to
> be done with just a 2x load/mask/store word code sequence).  Some example
> code is shown to give a general idea of word-aligning the cutoff amount
> (but not yet doing word-at-a-time searching of '\n', or optimised two word
> memcpy).
> 
> *CURRENT __fwritex.c CODE (lines 12-20)*:
> 
> 
> if (f->lbf >= 0) {
> 
> /* Match /^(.*\n|)/ */
> 
> for (i=l; i && s[i-1] != '\n'; i--);
> 
> if (i) {
> 
> size_t n = f->write(f, s, i);
> 
> if (n < i) return n;
> 
> s += i;
> 
> l -= i;
> 
> }
> 
> }
> 
> 
> 
> *PROPOSED*:
> 
> 	size_t i, len;
> 	if (f->lbf >= 0) {
> 		const unsigned char *t = ALIGN(s+sizeof(size_t)*2);
> 		for (i = l+s-t; ; i--) {
> 			if (i <= 0) {   /* SHORT LINE - copy up to 16 bytes into f->wpos
> buffer and then flush line */
> 				for (j = t-s; j && s[j-1] != '\n'; j--);
> 				if (j) {
> 			  		memcpy(f->wpos, s, j);  f->wpos += j;
> 					size_t n = f->write(f, t, 0);
> 					if (n < 0) return n;
> 					s += j;
> 					l -= j;
> 				}				break;
> 			}
> 			if (t[i-1] == '\n') {
> 				size_t n = f->write(f, s, len = i+t-s);
> 				if (n < len) return n;
> 				s += len;
> 				l -= len;
> 				break;
> 			}
> 		}
> 	}

I've been trying to understand what you're trying to do. It seems you
chose to work at the point of line-buffered flush logic, since that
happens to be the only case where f->write is called with an argument
that might fit in the remaining buffer space.

As written the alignment logic and pointer arithmetic is invalid; the
sums/differences are out of bounds of the array, and i<=0 is not
meaningful since i has an unsigned type (and so does l+s-t). But even
if it could be made correct, it's all completely unnecessary and just
making the code slower and less readable.

If __fwritex were the right place for this code, all you would need to
do is check whether i<16 (or whatever threshold) before calling
f->write, and if so, memcpy'ing it to the buffer then calling f->write
with a length of 0. However, then you could not use the return value
of f->write to determine if it succeeded (see how fflush and fseek
have to deal with this case). Contrary to what your code assumes,
f->write does not (and cannot, since the type is unsigned) return a
negative value on error.

Instead, I think it probably makes more sense to put the logic in
__stdio_write, but this will also be somewhat nontrivial to work in.
At least the "iovcnt == 2 ? ..." logic needs to be adapted to
something like "rem > len ? ...". Before the loop should probably be
something like "if (len < f->wend-f->wpos && len <= 16) ..." to
conditionally copy the new data into the buffer.

Do you see any reason to prefer doing it in __fwritex?

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: stdio glitch & questions
  2018-12-01  0:02     ` Rich Felker
@ 2018-12-01  2:42       ` Xan Phung
  2018-12-01  3:17         ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Xan Phung @ 2018-12-01  2:42 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3079 bytes --]

On Sat, 1 Dec 2018 at 11:02, Rich Felker <dalias@libc.org> wrote:

>
> I've been trying to understand what you're trying to do. It seems you
> chose to work at the point of line-buffered flush logic, since that
> happens to be the only case where f->write is called with an argument
> that might fit in the remaining buffer space.
>
>
Yes that's correct... Also, the other reason I chose __fwrite.c is it's the
only place where the search for '\n' is done.

An additional objective I had was to split the loop searching for '\n' into
two stages:
(i) Stage 1: search for '\n' word by word ie: 8 bytes at a time ... if '\n'
found at >~16 bytes before start of "s" array then go straight to f->write
without copying bytes
(I don't show it in my code, but the algorithm to do stage 1 would be
similar to memchr.c)
(ii) Stage 2: in final 9~16 bytes, drop down into byte-by-byte searching
for  '\n' (also doing copy of residual bytes into buffer when '\n' found)


> As written the alignment logic and pointer arithmetic is invalid; the
> sums/differences are out of bounds of the array, and i<=0 is not
> meaningful since i has an unsigned type (and so does l+s-t). But even
> if it could be made correct, it's all completely unnecessary and just
> making the code slower and less readable. If __fwritex were the right
> place for this code, all you would need to
>
do is check whether i<16 (or whatever threshold) before calling
> f->write, and if so, memcpy'ing it to the buffer then calling f->write
> with a length of 0.


I agree, the check for i <= X is a simpler way of expressing the algorithm.
[X is a value between 9 and 16 that guarantees (s + X) will be word aligned]

In my version, I introduced a new array base "t" and rebase i to index into
"t" because "t" is a guaranteed word aligned pointer.
However, this word alignment of "t" is not exploited in the current
byte-at-a-time searching for '\n', so yes at present it's unnecessary.


> However, then you could not use the return value
> of f->write to determine if it succeeded (see how fflush and fseek
> have to deal with this case). Contrary to what your code assumes,
> f->write does not (and cannot, since the type is unsigned) return a
> negative value on error.
>
>
Ok, noted, the expression to check for error should be (!f->wpos) and not n
< 0

Instead, I think it probably makes more sense to put the logic in
> __stdio_write, but this will also be somewhat nontrivial to work in.
> At least the "iovcnt == 2 ? ..." logic needs to be adapted to
> something like "rem > len ? ...". Before the loop should probably be
> something like "if (len < f->wend-f->wpos && len <= 16) ..." to
> conditionally copy the new data into the buffer.
>
> Do you see any reason to prefer doing it in __fwritex?
>
> Wouldn't putting check for i < X in __fwrite.c better because it:
(a) facilitates word-by-word searching objective outlined above
(b) check for space in buffer is already done in line 10 of __fwrite.c,
hence avoids need for any more buffer length checks
(c) ?? speeds up writes which doesn't use __stdio_write

[-- Attachment #2: Type: text/html, Size: 4598 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: stdio glitch & questions
  2018-12-01  2:42       ` Xan Phung
@ 2018-12-01  3:17         ` Rich Felker
  2018-12-01  8:02           ` Xan Phung
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2018-12-01  3:17 UTC (permalink / raw)
  To: musl

On Sat, Dec 01, 2018 at 01:42:30PM +1100, Xan Phung wrote:
> On Sat, 1 Dec 2018 at 11:02, Rich Felker <dalias@libc.org> wrote:
> 
> >
> > I've been trying to understand what you're trying to do. It seems you
> > chose to work at the point of line-buffered flush logic, since that
> > happens to be the only case where f->write is called with an argument
> > that might fit in the remaining buffer space.
> >
> >
> Yes that's correct... Also, the other reason I chose __fwrite.c is it's the
> only place where the search for '\n' is done.
> 
> An additional objective I had was to split the loop searching for '\n' into
> two stages:
> (i) Stage 1: search for '\n' word by word ie: 8 bytes at a time ... if '\n'
> found at >~16 bytes before start of "s" array then go straight to f->write
> without copying bytes
> (I don't show it in my code, but the algorithm to do stage 1 would be
> similar to memchr.c)
> (ii) Stage 2: in final 9~16 bytes, drop down into byte-by-byte searching
> for  '\n' (also doing copy of residual bytes into buffer when '\n' found)
> 
> 
> > As written the alignment logic and pointer arithmetic is invalid; the
> > sums/differences are out of bounds of the array, and i<=0 is not
> > meaningful since i has an unsigned type (and so does l+s-t). But even
> > if it could be made correct, it's all completely unnecessary and just
> > making the code slower and less readable. If __fwritex were the right
> > place for this code, all you would need to
> >
> do is check whether i<16 (or whatever threshold) before calling
> > f->write, and if so, memcpy'ing it to the buffer then calling f->write
> > with a length of 0.
> 
> 
> I agree, the check for i <= X is a simpler way of expressing the algorithm.
> [X is a value between 9 and 16 that guarantees (s + X) will be word aligned]
> 
> In my version, I introduced a new array base "t" and rebase i to index into
> "t" because "t" is a guaranteed word aligned pointer.
> However, this word alignment of "t" is not exploited in the current
> byte-at-a-time searching for '\n', so yes at present it's unnecessary.
> 
> 
> > However, then you could not use the return value
> > of f->write to determine if it succeeded (see how fflush and fseek
> > have to deal with this case). Contrary to what your code assumes,
> > f->write does not (and cannot, since the type is unsigned) return a
> > negative value on error.
> >
> >
> Ok, noted, the expression to check for error should be (!f->wpos) and not n
> < 0
> 
> Instead, I think it probably makes more sense to put the logic in
> > __stdio_write, but this will also be somewhat nontrivial to work in.
> > At least the "iovcnt == 2 ? ..." logic needs to be adapted to
> > something like "rem > len ? ...". Before the loop should probably be
> > something like "if (len < f->wend-f->wpos && len <= 16) ..." to
> > conditionally copy the new data into the buffer.
> >
> > Do you see any reason to prefer doing it in __fwritex?
> >
> Wouldn't putting check for i < X in __fwrite.c better because it:
> (a) facilitates word-by-word searching objective outlined above

Whether a faster search for '\n' is possible or makes sense is
completely independent of where or whether you copy into the buffer.
The search is necessary either way, and the scope of the search has to
be the entire input array.

There is a namespace-safe __memrchr function that could be used here
if it were optimized, but I don't think open-coding an optimized
version of a string function inside a stdio function makes sense.

> (b) check for space in buffer is already done in line 10 of __fwrite.c,
> hence avoids need for any more buffer length checks

Indeed, that is potentially a reason for doing it here.

> (c) ?? speeds up writes which doesn't use __stdio_write

You mean other backends like fmemopen, open_memstream, fopencookie,
etc.? This is possibly also a good reason, but in order for it to
help, some of their write functions might need improvements too. For
example I just noticed that fopencookie's cookiewrite() function
unnecessarily calls the application's write function a second time
with length 0 when length 0 was passed to it. This should probably be
avoided for other reasons, since it's not clear that the callback
needs to accept 0 as a valid length.

Note that __stdio_write also needs some changes to avoid unnecessary
writev syscalls when write would suffice.

So anyway, I think there are probably some good reasons to do the
merging in __fwritex rather than in the __stdio_write backend. But it
should be a lot simpler than what you initially proposed.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: stdio glitch & questions
  2018-12-01  3:17         ` Rich Felker
@ 2018-12-01  8:02           ` Xan Phung
  0 siblings, 0 replies; 7+ messages in thread
From: Xan Phung @ 2018-12-01  8:02 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1019 bytes --]

>
> So anyway, I think there are probably some good reasons to do the
> merging in __fwritex rather than in the __stdio_write backend. But it
> should be a lot simpler than what you initially proposed.
>
> Rich
>

Thanks for your feedback!  In interest of keeping the proposal as simple as
possible, I've gone with just one extra line of code (see below).  It
simply does copying for  i <= 8.

It works on my initial testing and I'm getting zero lengths in the 2nd
writev iovec (as expected), but I'll test it more in coming week.

size_t __fwritex(const unsigned char *restrict s, size_t l, FILE *restrict
f)

{

size_t i=0;


if (!f->wend && __towrite(f)) return 0;


if (l > f->wend - f->wpos) return f->write(f, s, l);


if (f->lbf >= 0) {

/* Match /^(.*\n|)/ */

for (i=l; i && s[i-1] != '\n'; i--);

if (i) {

*if (i <= 8) for (; i; i--, l--) *(f->wpos)++ = *s++;*

size_t n = f->write(f, s, i);

if (*!f->wpos ||* n < i) return n;

l -= i;

s += i;

}

}


memcpy(f->wpos, s, l);

f->wpos += l;

return l+i;

}

[-- Attachment #2: Type: text/html, Size: 9327 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-12-01  8:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-30 10:51 stdio glitch & questions Xan Phung
2018-11-30 16:09 ` Rich Felker
2018-11-30 22:15   ` Xan Phung
2018-12-01  0:02     ` Rich Felker
2018-12-01  2:42       ` Xan Phung
2018-12-01  3:17         ` Rich Felker
2018-12-01  8:02           ` Xan Phung

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).