The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
@ 2014-07-15  2:31 Doug McIlroy
  2014-07-15  2:40 ` Larry McVoy
  2014-07-15 18:55 ` scj
  0 siblings, 2 replies; 12+ messages in thread
From: Doug McIlroy @ 2014-07-15  2:31 UTC (permalink / raw)


> Err, why is buffering data in the process a sin? (Or was this just a
humourous aside?)

Process A spawns process B, which reads stdin with buffering. B gets
all it deserves from stdin and exits. What's left in the buffer,
intehded for A, is lost. Sinful.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-15  2:31 [TUHS] the sin of buffering [offshoot of excise process from a pipeline] Doug McIlroy
@ 2014-07-15  2:40 ` Larry McVoy
  2014-07-15 18:55 ` scj
  1 sibling, 0 replies; 12+ messages in thread
From: Larry McVoy @ 2014-07-15  2:40 UTC (permalink / raw)


On Mon, Jul 14, 2014 at 10:31:27PM -0400, Doug McIlroy wrote:
> > Err, why is buffering data in the process a sin? (Or was this just a
> humourous aside?)
> 
> Process A spawns process B, which reads stdin with buffering. B gets
> all it deserves from stdin and exits. What's left in the buffer,
> intehded for A, is lost. Sinful.

It really depends on what you want.  That buffering is a big win for
some use cases.  Even on today's processors reading a byte at a time via
read(2) is costly.  Like 5000x more costly on the laptop I'm typing on:

calvin:~/tmp lmdd opat=1 move=100m of=XXX
104.8576 MB in 0.1093 secs, 959.5578 MB/sec
calvin:~/tmp time a.out fd < XXX

real    0m14.754s
user    0m1.516s
sys     0m13.201s
calvin:~/tmp time a.out stdio < XXX

real    0m0.003s
user    0m0.000s
sys     0m0.000s
calvin:~/tmp bc
14.754/.003
4918.00000000000000000000

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define unless(x)       if (!(x))
#define streq(a, b)     !strcmp(a, b)

main(int ac, char **av)
{
        char    c;

        unless (ac == 2) exit(1);
        if (streq(av[1], "stdio")) {
                while ((c = fgetc(stdin)) != EOF)
                        ;
        } else {
                while (read(0, &c, 1) == 1)
                        ;
        }
        exit(0);
}




^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-15  2:31 [TUHS] the sin of buffering [offshoot of excise process from a pipeline] Doug McIlroy
  2014-07-15  2:40 ` Larry McVoy
@ 2014-07-15 18:55 ` scj
  1 sibling, 0 replies; 12+ messages in thread
From: scj @ 2014-07-15 18:55 UTC (permalink / raw)


Bah!  This is a bug in Unix, IMHO.  We would consider it a bug if a
buffered output file refused to dump it's output buffer upon exit.  It
seems to me to be just as much a bug if a buffered input file refuses to
push back its unused input on exit.  Unix should have provided a mechanism
to permit this...

Steve


>> Err, why is buffering data in the process a sin? (Or was this just a
> humourous aside?)
>
> Process A spawns process B, which reads stdin with buffering. B gets
> all it deserves from stdin and exits. What's left in the buffer,
> intehded for A, is lost. Sinful.
> _______________________________________________
> TUHS mailing list
> TUHS at minnie.tuhs.org
> https://minnie.tuhs.org/mailman/listinfo/tuhs
>





^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
@ 2014-07-16 21:46 Noel Chiappa
  0 siblings, 0 replies; 12+ messages in thread
From: Noel Chiappa @ 2014-07-16 21:46 UTC (permalink / raw)


    > From: Doug McIlroy <doug at cs.dartmouth.edu>

    > Process A spawns process B, which reads stdin with buffering. B gets
    > all it deserves from stdin and exits. What's left in the buffer,
    > intehded for A, is lost. 

Ah. Got it.

The problem is not with buffering as a generic approach, the problem is that
you're trying to use a buffering package intended for simple,
straight-forward situations in one which doesn't fall into that category! :-)

Clearly, either B has to i) be able to put back data which was not for it
('ungets' as a system call), or ii) not read the data that's not for it - but
that may be incompatible with the concept of buffering the input (depending
on the syntax, and thus the ability to predict the approaching of the data B
wants, the only way to avoid the need for ungetc() might be to read a byte at
a time).

If B and its upstream (U) are written together, that could be another way to
deal with it: if U knows where B's syntatical boundaries are, it can give it
advance warning, and B could then use a non-trivial buffering package to do
the right thing. E.g. if U emits 'records' with a header giving the record
length X, B could tell its buffering package 'don't read ahead more than X
bytes until I tell you to go ahead with the next record'.

Of course, that's not a general solution; it only works with prepared U's.
Really, the only general, efficient way to deal with that situation that I can
see is to add 'ungets' to the operating system...

	Noel



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-16 14:56           ` Dan Cross
@ 2014-07-16 15:41             ` Larry McVoy
  0 siblings, 0 replies; 12+ messages in thread
From: Larry McVoy @ 2014-07-16 15:41 UTC (permalink / raw)


What is being provided is a generic layering system on top of stdio.
Any sort of conversion you want.  Encryption, CRC, XOR block, compression
(we support gzip and lz4).  Those are just the layers we use right now,
it's easy to imagine others being added.

The point is that there is no one API that is going to pleasantly encode
all of the options to all of those layers and any that may come later.
Are you seriously suggesting that you want to read the freopen(3)
man page and see all of these options explained?  That's the classic
open source way, dump everything in one poorly thought out man page.
It's not the Unix way, people think about it harder.

For the record, I pushed for the single string encoding as well but got
pushed off it as I realized the API wasn't as simple as I imagined.
While you could do it that way you shouldn't do it that way, it's just
not a good API.

I'm very pleased with how it turned out in our code, other than a handful
of fpush() calls, it just looks like stock stdio.  

On Wed, Jul 16, 2014 at 10:56:57AM -0400, Dan Cross wrote:
> Why can't those be embedded in the relevant string?  freopen(fp, "rx{128}")
> or something?
> 
> 
> On Wed, Jul 16, 2014 at 10:30 AM, Larry McVoy <lm at mcvoy.com> wrote:
> 
> > On Wed, Jul 16, 2014 at 02:03:58AM -0400, John Cowan wrote:
> > > Larry McVoy scripsit:
> > >
> > > > We tried that but the problem is that you can't encode all the options
> > you
> > > > want in just a character.  Compression doesn't take options, the
> > CRC/XOR
> > > > layer wants to know how big you might think the file is (because we
> > > > support blocksizes from about 256B to 256K and we want to know the
> > > > file size to guess the block size).
> > >
> > > It's a string: you can have as many characters as you want.
> >
> > I understand your desire to have one API.  We tried and it just wasn't
> > practical.  Imagine pushing an encryption layer that wants a key,
> > XOR layer that wants block size, etc.
> > _______________________________________________
> > TUHS mailing list
> > TUHS at minnie.tuhs.org
> > https://minnie.tuhs.org/mailman/listinfo/tuhs
> >

-- 
---
Larry McVoy            	     lm at mcvoy.com             http://www.mcvoy.com/lm 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-16 14:30         ` Larry McVoy
@ 2014-07-16 14:56           ` Dan Cross
  2014-07-16 15:41             ` Larry McVoy
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Cross @ 2014-07-16 14:56 UTC (permalink / raw)


Why can't those be embedded in the relevant string?  freopen(fp, "rx{128}")
or something?


On Wed, Jul 16, 2014 at 10:30 AM, Larry McVoy <lm at mcvoy.com> wrote:

> On Wed, Jul 16, 2014 at 02:03:58AM -0400, John Cowan wrote:
> > Larry McVoy scripsit:
> >
> > > We tried that but the problem is that you can't encode all the options
> you
> > > want in just a character.  Compression doesn't take options, the
> CRC/XOR
> > > layer wants to know how big you might think the file is (because we
> > > support blocksizes from about 256B to 256K and we want to know the
> > > file size to guess the block size).
> >
> > It's a string: you can have as many characters as you want.
>
> I understand your desire to have one API.  We tried and it just wasn't
> practical.  Imagine pushing an encryption layer that wants a key,
> XOR layer that wants block size, etc.
> _______________________________________________
> TUHS mailing list
> TUHS at minnie.tuhs.org
> https://minnie.tuhs.org/mailman/listinfo/tuhs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://minnie.tuhs.org/pipermail/tuhs/attachments/20140716/1847c3c6/attachment.html>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-16  6:03       ` John Cowan
@ 2014-07-16 14:30         ` Larry McVoy
  2014-07-16 14:56           ` Dan Cross
  0 siblings, 1 reply; 12+ messages in thread
From: Larry McVoy @ 2014-07-16 14:30 UTC (permalink / raw)


On Wed, Jul 16, 2014 at 02:03:58AM -0400, John Cowan wrote:
> Larry McVoy scripsit:
> 
> > We tried that but the problem is that you can't encode all the options you
> > want in just a character.  Compression doesn't take options, the CRC/XOR
> > layer wants to know how big you might think the file is (because we 
> > support blocksizes from about 256B to 256K and we want to know the 
> > file size to guess the block size).
> 
> It's a string: you can have as many characters as you want.

I understand your desire to have one API.  We tried and it just wasn't
practical.  Imagine pushing an encryption layer that wants a key,
XOR layer that wants block size, etc.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-16  4:05     ` Larry McVoy
@ 2014-07-16  6:03       ` John Cowan
  2014-07-16 14:30         ` Larry McVoy
  0 siblings, 1 reply; 12+ messages in thread
From: John Cowan @ 2014-07-16  6:03 UTC (permalink / raw)


Larry McVoy scripsit:

> We tried that but the problem is that you can't encode all the options you
> want in just a character.  Compression doesn't take options, the CRC/XOR
> layer wants to know how big you might think the file is (because we 
> support blocksizes from about 256B to 256K and we want to know the 
> file size to guess the block size).

It's a string: you can have as many characters as you want.

-- 
John Cowan          http://www.ccil.org/~cowan        cowan at ccil.org
Dievas dave dantis; Dievas duos duonos        --Lithuanian proverb
Deus dedit dentes; deus dabit panem           --Latin version thereof
Deity donated dentition;
  deity'll donate doughnuts                   --English version by Muke Tever
God gave gums; God'll give granary            --Version by Mat McVeagh



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-16  3:53   ` John Cowan
@ 2014-07-16  4:05     ` Larry McVoy
  2014-07-16  6:03       ` John Cowan
  0 siblings, 1 reply; 12+ messages in thread
From: Larry McVoy @ 2014-07-16  4:05 UTC (permalink / raw)


On Tue, Jul 15, 2014 at 11:53:03PM -0400, John Cowan wrote:
> Larry McVoy scripsit:
> 
> > Want your stream compressed or uncompressed?
> > 
> > 	fpush(&stdin, fopen_vzip(stdin, "r"));
> 
> Me, I would have done it with freopen(stdin, "rv").

We tried that but the problem is that you can't encode all the options you
want in just a character.  Compression doesn't take options, the CRC/XOR
layer wants to know how big you might think the file is (because we 
support blocksizes from about 256B to 256K and we want to know the 
file size to guess the block size).
-- 
---
Larry McVoy            	     lm at mcvoy.com             http://www.mcvoy.com/lm 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-16  0:32 ` Larry McVoy
@ 2014-07-16  3:53   ` John Cowan
  2014-07-16  4:05     ` Larry McVoy
  0 siblings, 1 reply; 12+ messages in thread
From: John Cowan @ 2014-07-16  3:53 UTC (permalink / raw)


Larry McVoy scripsit:

> Want your stream compressed or uncompressed?
> 
> 	fpush(&stdin, fopen_vzip(stdin, "r"));

Me, I would have done it with freopen(stdin, "rv").

-- 
 John Cowan          http://www.ccil.org/~cowan        cowan at ccil.org
               Si hoc legere scis, nimium eruditionis habes.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
  2014-07-15 23:43 Doug McIlroy
@ 2014-07-16  0:32 ` Larry McVoy
  2014-07-16  3:53   ` John Cowan
  0 siblings, 1 reply; 12+ messages in thread
From: Larry McVoy @ 2014-07-16  0:32 UTC (permalink / raw)


I dunno, we have a distributed source management system that uses a lot
of network I/O.  We've carefully layered stdio on top of it because we
had many cases where it was a performance bummer.

Personally, I've come to really love stdio, at least our version of it.
Want your stream compressed or uncompressed?

	fpush(&stdin, fopen_vzip(stdin, "r"));

Want your stream integrity checked with a CRC per block and an XOR block
at the end so you can correct any single block error?

	fpush(&stdout, fopen_crc(stdout, "w", 0, 0));

I'm a performance guy for the most part and while read/write seem like 
the fastest way to move stuff around that's only true for really nicely
formed data, page sized blocks or bigger.  Fine for benchmarking but if
you want to approach that performance with poorly formed data, like 
small blocks, different sized blocks, that buffering layer smooths 
things out.  You pay an extra bcopy() but that's typically lost in 
the noise.

I used to hate the idea of stdio but working in real world applications
where I can't control the size of the data coming at me, yeah, I've 
come to love stdio.  It's pretty darn useful.

On Tue, Jul 15, 2014 at 07:43:49PM -0400, Doug McIlroy wrote:
> Yes, an evil necessary to get things going. 
> The very definition of original sin.
> 
> Doug
> 
> Larry McVoy wrote:
> 
> >>>> For stdio, of course, one would need fsplice(3), which must flush the
> >>>> in-process buffers--penance for stdio's original sin of said buffering.
> 
> >>> Err, why is buffering data in the process a sin? (Or was this just a
> >>> humourous aside?)
>  
> >> Process A spawns process B, which reads stdin with buffering. B gets
> >> all it deserves from stdin and exits. What's left in the buffer,
> >> intehded for A, is lost. Sinful.
>  
> > It really depends on what you want.  That buffering is a big win for
> > some use cases.  Even on today's processors reading a byte at a time via
> > read(2) is costly.  Like 5000x more costly on the laptop I'm typing on:

-- 
---
Larry McVoy            	     lm at mcvoy.com             http://www.mcvoy.com/lm 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [TUHS] the sin of buffering [offshoot of excise process from a pipeline]
@ 2014-07-15 23:43 Doug McIlroy
  2014-07-16  0:32 ` Larry McVoy
  0 siblings, 1 reply; 12+ messages in thread
From: Doug McIlroy @ 2014-07-15 23:43 UTC (permalink / raw)


Yes, an evil necessary to get things going. 
The very definition of original sin.

Doug

Larry McVoy wrote:

>>>> For stdio, of course, one would need fsplice(3), which must flush the
>>>> in-process buffers--penance for stdio's original sin of said buffering.

>>> Err, why is buffering data in the process a sin? (Or was this just a
>>> humourous aside?)
 
>> Process A spawns process B, which reads stdin with buffering. B gets
>> all it deserves from stdin and exits. What's left in the buffer,
>> intehded for A, is lost. Sinful.
 
> It really depends on what you want.  That buffering is a big win for
> some use cases.  Even on today's processors reading a byte at a time via
> read(2) is costly.  Like 5000x more costly on the laptop I'm typing on:



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-07-16 21:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-15  2:31 [TUHS] the sin of buffering [offshoot of excise process from a pipeline] Doug McIlroy
2014-07-15  2:40 ` Larry McVoy
2014-07-15 18:55 ` scj
2014-07-15 23:43 Doug McIlroy
2014-07-16  0:32 ` Larry McVoy
2014-07-16  3:53   ` John Cowan
2014-07-16  4:05     ` Larry McVoy
2014-07-16  6:03       ` John Cowan
2014-07-16 14:30         ` Larry McVoy
2014-07-16 14:56           ` Dan Cross
2014-07-16 15:41             ` Larry McVoy
2014-07-16 21:46 Noel Chiappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).