Using file lines as "input files"

zsh-users
 help / color / mirror / code / Atom feed

* Using file lines as "input files"
@ 2022-07-08 20:58 Dominik Vogt
  2022-07-08 21:58 ` Mikael Magnusson
  2022-07-08 22:04 ` Bart Schaefer
  0 siblings, 2 replies; 9+ messages in thread
From: Dominik Vogt @ 2022-07-08 20:58 UTC (permalink / raw)
  To: Zsh Users

Okay, there's this script that calculates a checksum on each line
of a file by reading each line and passing it to
cksum/md5sum/shasum etc.

  cat "$INFILE" | while read LINE; do
          echo "$LINE" | cksum
  done

This takes about four minutes on a file with 265,000 lines because
of the program call overhead.

--

Disclaimer: I _know_ this can be done in seconds with perl /
python, but I like to not rely on scripting languages when the
shell can do the job.

--

So, would it be possible to pass each line in "$INFILE" as a file
argument to "cksum", i.e.

  $ chksum Fline1 Fline2 Fline3 ... Fline265000

(Of course without actually splitting the input file - the point
is to get rid of the four minute wait, not generating more
bottlenecks.)

And there's this open file escriptor limit of 1024 too.  :-)

Ciao

Dominik ^_^  ^_^

--

Dominik Vogt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using file lines as "input files"
  2022-07-08 20:58 Using file lines as "input files" Dominik Vogt
@ 2022-07-08 21:58 ` Mikael Magnusson
  2022-07-08 22:04 ` Bart Schaefer
  1 sibling, 0 replies; 9+ messages in thread
From: Mikael Magnusson @ 2022-07-08 21:58 UTC (permalink / raw)
  To: dominik.vogt, Zsh Users

On 7/8/22, Dominik Vogt <dominik.vogt@gmx.de> wrote:
> So, would it be possible to pass each line in "$INFILE" as a file
> argument to "cksum", i.e.
>
>   $ chksum Fline1 Fline2 Fline3 ... Fline265000

Assuming that the above command works, you should be able to do
% cksum ${(f)"$(<$INFILE)"}

Depending on your kernel this may be too long of a command line, you
can use zargs (or xargs) to work around that if needed, or if you're
on Linux you may be able to increase your stack size with ulimit -s,
eg
% ulimit -s 132768
(my default is 8192 which is not enough for the given example)

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using file lines as "input files"
  2022-07-08 20:58 Using file lines as "input files" Dominik Vogt
  2022-07-08 21:58 ` Mikael Magnusson
@ 2022-07-08 22:04 ` Bart Schaefer
  2022-07-08 23:17   ` Dominik Vogt
  1 sibling, 1 reply; 9+ messages in thread
From: Bart Schaefer @ 2022-07-08 22:04 UTC (permalink / raw)
  To: dominik.vogt, Zsh Users

On Fri, Jul 8, 2022 at 1:58 PM Dominik Vogt <dominik.vogt@gmx.de> wrote:
>
> Disclaimer: I _know_ this can be done in seconds with perl /
> python, but I like to not rely on scripting languages when the
> shell can do the job.

This is sort of like saying "I like to not rely on hiking boots when
shoes can do the job."

>   $ chksum Fline1 Fline2 Fline3 ... Fline265000
>
> (Of course without actually splitting the input file

If "not actually splitting" means what it seems to mean, and you
literally want to run cksum, the answer is no.  The things on the
cksum command line have to be file names, so you'd have to create a
file for each line of the original input.

The other option would be to write the CRC algorithm as a shell function.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using file lines as "input files"
  2022-07-08 22:04 ` Bart Schaefer
@ 2022-07-08 23:17   ` Dominik Vogt
  2022-07-09  2:21     ` Mikael Magnusson
  0 siblings, 1 reply; 9+ messages in thread
From: Dominik Vogt @ 2022-07-08 23:17 UTC (permalink / raw)
  To: Zsh Users

On Fri, Jul 08, 2022 at 03:04:31PM -0700, Bart Schaefer wrote:
> On Fri, Jul 8, 2022 at 1:58 PM Dominik Vogt <dominik.vogt@gmx.de> wrote:
> >
> > Disclaimer: I _know_ this can be done in seconds with perl /
> > python, but I like to not rely on scripting languages when the
> > shell can do the job.
>
> This is sort of like saying "I like to not rely on hiking boots when
> shoes can do the job."

Actually, for me, scripting languages are the "shoes" because they
don't interact very well with the command pipeline, unless you
spend an absurd amount of work to make them do so.  Calling
commands for everything can be slower, but most of the time it's
just a symptom of bad scripting.  GNU coreutils are faster than
anything I'll ever be willing to code (or any perl or python
script or C or C++ library for that matter).  The trick is keeping
the process spawning overhead low.

> >   $ chksum Fline1 Fline2 Fline3 ... Fline265000
> >
> > (Of course without actually splitting the input file
>
> If "not actually splitting" means what it seems to mean, and you
> literally want to run cksum, the answer is no.

Right.

This does the job pretty well, relying entirely on existing Unix
tools:

  ulimit -s 100000
  split -l 1 "$INPUTF" ff
  cksum ff*
  rm ff*

That cuts runtime down to seven seconds instead of four minutes,
at the cost of a fem hunred MB on the RAM disk.  Splitting the
source file and removing the fragments takes about three to four
seconds.

Thanks for the comments which put me on the right track.

--

(I prefer to have a huge stack size anyway to be able to do things
like "grep foobar **/*(.)".)

Ciao

Dominik ^_^  ^_^

--

Dominik Vogt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using file lines as "input files"
  2022-07-08 23:17   ` Dominik Vogt
@ 2022-07-09  2:21     ` Mikael Magnusson
  2022-07-10  0:42       ` Dominik Vogt
  0 siblings, 1 reply; 9+ messages in thread
From: Mikael Magnusson @ 2022-07-09  2:21 UTC (permalink / raw)
  To: dominik.vogt, Zsh Users

On 7/9/22, Dominik Vogt <dominik.vogt@gmx.de> wrote:
> On Fri, Jul 08, 2022 at 03:04:31PM -0700, Bart Schaefer wrote:
>> On Fri, Jul 8, 2022 at 1:58 PM Dominik Vogt <dominik.vogt@gmx.de> wrote:
>> >
>> > Disclaimer: I _know_ this can be done in seconds with perl /
>> > python, but I like to not rely on scripting languages when the
>> > shell can do the job.
>>
>> This is sort of like saying "I like to not rely on hiking boots when
>> shoes can do the job."
>
> Actually, for me, scripting languages are the "shoes" because they
> don't interact very well with the command pipeline, unless you
> spend an absurd amount of work to make them do so.  Calling
> commands for everything can be slower, but most of the time it's
> just a symptom of bad scripting.  GNU coreutils are faster than
> anything I'll ever be willing to code (or any perl or python
> script or C or C++ library for that matter).  The trick is keeping
> the process spawning overhead low.
>
>> >   $ chksum Fline1 Fline2 Fline3 ... Fline265000
>> >
>> > (Of course without actually splitting the input file
>>
>> If "not actually splitting" means what it seems to mean, and you
>> literally want to run cksum, the answer is no.
>
> Right.
>
> This does the job pretty well, relying entirely on existing Unix
> tools:
>
>   ulimit -s 100000
>   split -l 1 "$INPUTF" ff
>   cksum ff*
>   rm ff*
>
> That cuts runtime down to seven seconds instead of four minutes,
> at the cost of a fem hunred MB on the RAM disk.  Splitting the
> source file and removing the fragments takes about three to four
> seconds.
>
> Thanks for the comments which put me on the right track.
>
> --
>
> (I prefer to have a huge stack size anyway to be able to do things
> like "grep foobar **/*(.)".)

I realized I misinterpreted the question originally, and the following
doesn't seem to work 100% but it was a fun idea:
% mkfifo apipe
% foo[265000]='' # number of lines in the file
% cksum apipe$^foo # pass "apipe" to cksum 265000 times
(in another terminal or job control etc)
% while read; do echo $REPLY > apipe; done < infile

When I tried the above on some test data, I got about 10 broken pipes.
Also several lines sometimes get passed through the pipe without an
intervening EOF, I'll admit I don't know the finer points of pipe/fifo
behavior when you open and close them rapidly.

That said, this also seems to take around 4-5 seconds to run.

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using file lines as "input files"
  2022-07-09  2:21     ` Mikael Magnusson
@ 2022-07-10  0:42       ` Dominik Vogt
  2022-07-10  0:45         ` Dominik Vogt
  2022-07-10  3:27         ` Bart Schaefer
  0 siblings, 2 replies; 9+ messages in thread
From: Dominik Vogt @ 2022-07-10  0:42 UTC (permalink / raw)
  To: Zsh Users

On Sat, Jul 09, 2022 at 04:21:37AM +0200, Mikael Magnusson wrote:
> I realized I misinterpreted the question originally, and the following
> doesn't seem to work 100% but it was a fun idea:
> % mkfifo apipe
> % foo[265000]='' # number of lines in the file

> % cksum apipe$^foo # pass "apipe" to cksum 265000 times

For some mysterious reason that doesn't work with the shwordsplit
option active:

  $ foo[9]=''
  $ setopt shwordsplit
  $ echo x^foo
  x x
  $ unsetopt shwordsplit
  $ echo x^foo
  x x x x x x x x x

> (in another terminal or job control etc)
> % while read; do echo $REPLY > apipe; done < infile
>
> When I tried the above on some test data, I got about 10 broken pipes.
> Also several lines sometimes get passed through the pipe without an
> intervening EOF, I'll admit I don't know the finer points of pipe/fifo
> behavior when you open and close them rapidly.

Hm, a fifo created with mkfifo is automatically blocking.  So,
when either end is opened while the other is not present, it
blocks until the other end is opened.

The reader gets an EOF when there's no more data and no writer has
the fifo open.  Otherwise it waits for more data.

1) Multiple lines processed by a single reader:

 * writer opens the fifo and blocks for a reader
 * reader opens the fifo and blocks for data
 * writer writes its data and closes the fifo
 * the next writes opens the fifo
 * the reader processes the first writer's data but gets no EOF
   because the new writer has the fifo open
 * the new writer writes another line to the fifo and terminates
 * the reader reads the next line, gets an EOF because no writer
   is open and terminates itself.

2) SIGPIPE may be generated in this case:

 * writer opens the fifo and blocks for a reader
 * reader opens the fifo and blocks for data
 * the writer unblocks, writes its data and closes.
 * the reader unblocks consumes the data and gets an EOF
 * the next writer opens the pipe without blocking because the
   reader has not yet closed the fifo
 * the reader closes the fifo
 * the writer tries to write data but gets SIGPIPE because there
   is no reader

Unfortunately fifos have no notion of an EOF as part of the data
stream.

> That said, this also seems to take around 4-5 seconds to run.

A pity that pipes are so uncomfortable to handle in Unix.  I like
that approach.

Ciao

Dominik ^_^  ^_^

--

Dominik Vogt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using file lines as "input files"
  2022-07-10  0:42       ` Dominik Vogt
@ 2022-07-10  0:45         ` Dominik Vogt
  2022-07-10  3:27         ` Bart Schaefer
  1 sibling, 0 replies; 9+ messages in thread
From: Dominik Vogt @ 2022-07-10  0:45 UTC (permalink / raw)
  To: Zsh Users

On Sun, Jul 10, 2022 at 01:42:10AM +0100, Dominik Vogt wrote:
> Hm, a fifo created with mkfifo is automatically blocking.

Um, blockiness is determined by the process opening the fifo, of
course, but redirection with ">" is probably always blocking, and
cksum has no reason to set O_NONBLOCKING.

Ciao

Dominik ^_^  ^_^

--

Dominik Vogt


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using file lines as "input files"
  2022-07-10  0:42       ` Dominik Vogt
  2022-07-10  0:45         ` Dominik Vogt
@ 2022-07-10  3:27         ` Bart Schaefer
  2022-07-10 17:49           ` Bart Schaefer
  1 sibling, 1 reply; 9+ messages in thread
From: Bart Schaefer @ 2022-07-10  3:27 UTC (permalink / raw)
  To: Zsh Users

On Sat, Jul 9, 2022 at 5:42 PM Dominik Vogt <dominik.vogt@gmx.de> wrote:
>
>   $ foo[9]=''
>   $ setopt shwordsplit
>   $ echo x^foo
>   x x

Er, missing a $ there?

Anyway ...

% foo[9]=z
Macadamia% print -l x${^=foo}y
xy
xzy

The leading "nonexistent" elements become one word, and the remaining
element with a value becomes another.  I'm not sure why that happens.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using file lines as "input files"
  2022-07-10  3:27         ` Bart Schaefer
@ 2022-07-10 17:49           ` Bart Schaefer
  0 siblings, 0 replies; 9+ messages in thread
From: Bart Schaefer @ 2022-07-10 17:49 UTC (permalink / raw)
  To: Zsh Users

On Sat, Jul 9, 2022 at 8:27 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> The leading "nonexistent" elements become one word, and the remaining
> element with a value becomes another.  I'm not sure why that happens.

In fact any series of "unset" elements becomes an (empty) word.

% foo[9]=x
% foo[18]=''
% unset foo\[18\]
% print $#foo
18
% setopt shwordsplit
% print a${^foo}b
ab axb ab


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-07-10 17:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08 20:58 Using file lines as "input files" Dominik Vogt
2022-07-08 21:58 ` Mikael Magnusson
2022-07-08 22:04 ` Bart Schaefer
2022-07-08 23:17   ` Dominik Vogt
2022-07-09  2:21     ` Mikael Magnusson
2022-07-10  0:42       ` Dominik Vogt
2022-07-10  0:45         ` Dominik Vogt
2022-07-10  3:27         ` Bart Schaefer
2022-07-10 17:49           ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).