* Using file lines as "input files" @ 2022-07-08 20:58 Dominik Vogt 2022-07-08 21:58 ` Mikael Magnusson 2022-07-08 22:04 ` Bart Schaefer 0 siblings, 2 replies; 9+ messages in thread From: Dominik Vogt @ 2022-07-08 20:58 UTC (permalink / raw) To: Zsh Users Okay, there's this script that calculates a checksum on each line of a file by reading each line and passing it to cksum/md5sum/shasum etc. cat "$INFILE" | while read LINE; do echo "$LINE" | cksum done This takes about four minutes on a file with 265,000 lines because of the program call overhead. -- Disclaimer: I _know_ this can be done in seconds with perl / python, but I like to not rely on scripting languages when the shell can do the job. -- So, would it be possible to pass each line in "$INFILE" as a file argument to "cksum", i.e. $ chksum Fline1 Fline2 Fline3 ... Fline265000 (Of course without actually splitting the input file - the point is to get rid of the four minute wait, not generating more bottlenecks.) And there's this open file escriptor limit of 1024 too. :-) Ciao Dominik ^_^ ^_^ -- Dominik Vogt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Using file lines as "input files" 2022-07-08 20:58 Using file lines as "input files" Dominik Vogt @ 2022-07-08 21:58 ` Mikael Magnusson 2022-07-08 22:04 ` Bart Schaefer 1 sibling, 0 replies; 9+ messages in thread From: Mikael Magnusson @ 2022-07-08 21:58 UTC (permalink / raw) To: dominik.vogt, Zsh Users On 7/8/22, Dominik Vogt <dominik.vogt@gmx.de> wrote: > So, would it be possible to pass each line in "$INFILE" as a file > argument to "cksum", i.e. > > $ chksum Fline1 Fline2 Fline3 ... Fline265000 Assuming that the above command works, you should be able to do % cksum ${(f)"$(<$INFILE)"} Depending on your kernel this may be too long of a command line, you can use zargs (or xargs) to work around that if needed, or if you're on Linux you may be able to increase your stack size with ulimit -s, eg % ulimit -s 132768 (my default is 8192 which is not enough for the given example) -- Mikael Magnusson ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Using file lines as "input files" 2022-07-08 20:58 Using file lines as "input files" Dominik Vogt 2022-07-08 21:58 ` Mikael Magnusson @ 2022-07-08 22:04 ` Bart Schaefer 2022-07-08 23:17 ` Dominik Vogt 1 sibling, 1 reply; 9+ messages in thread From: Bart Schaefer @ 2022-07-08 22:04 UTC (permalink / raw) To: dominik.vogt, Zsh Users On Fri, Jul 8, 2022 at 1:58 PM Dominik Vogt <dominik.vogt@gmx.de> wrote: > > Disclaimer: I _know_ this can be done in seconds with perl / > python, but I like to not rely on scripting languages when the > shell can do the job. This is sort of like saying "I like to not rely on hiking boots when shoes can do the job." > $ chksum Fline1 Fline2 Fline3 ... Fline265000 > > (Of course without actually splitting the input file If "not actually splitting" means what it seems to mean, and you literally want to run cksum, the answer is no. The things on the cksum command line have to be file names, so you'd have to create a file for each line of the original input. The other option would be to write the CRC algorithm as a shell function. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Using file lines as "input files" 2022-07-08 22:04 ` Bart Schaefer @ 2022-07-08 23:17 ` Dominik Vogt 2022-07-09 2:21 ` Mikael Magnusson 0 siblings, 1 reply; 9+ messages in thread From: Dominik Vogt @ 2022-07-08 23:17 UTC (permalink / raw) To: Zsh Users On Fri, Jul 08, 2022 at 03:04:31PM -0700, Bart Schaefer wrote: > On Fri, Jul 8, 2022 at 1:58 PM Dominik Vogt <dominik.vogt@gmx.de> wrote: > > > > Disclaimer: I _know_ this can be done in seconds with perl / > > python, but I like to not rely on scripting languages when the > > shell can do the job. > > This is sort of like saying "I like to not rely on hiking boots when > shoes can do the job." Actually, for me, scripting languages are the "shoes" because they don't interact very well with the command pipeline, unless you spend an absurd amount of work to make them do so. Calling commands for everything can be slower, but most of the time it's just a symptom of bad scripting. GNU coreutils are faster than anything I'll ever be willing to code (or any perl or python script or C or C++ library for that matter). The trick is keeping the process spawning overhead low. > > $ chksum Fline1 Fline2 Fline3 ... Fline265000 > > > > (Of course without actually splitting the input file > > If "not actually splitting" means what it seems to mean, and you > literally want to run cksum, the answer is no. Right. This does the job pretty well, relying entirely on existing Unix tools: ulimit -s 100000 split -l 1 "$INPUTF" ff cksum ff* rm ff* That cuts runtime down to seven seconds instead of four minutes, at the cost of a fem hunred MB on the RAM disk. Splitting the source file and removing the fragments takes about three to four seconds. Thanks for the comments which put me on the right track. -- (I prefer to have a huge stack size anyway to be able to do things like "grep foobar **/*(.)".) Ciao Dominik ^_^ ^_^ -- Dominik Vogt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Using file lines as "input files" 2022-07-08 23:17 ` Dominik Vogt @ 2022-07-09 2:21 ` Mikael Magnusson 2022-07-10 0:42 ` Dominik Vogt 0 siblings, 1 reply; 9+ messages in thread From: Mikael Magnusson @ 2022-07-09 2:21 UTC (permalink / raw) To: dominik.vogt, Zsh Users On 7/9/22, Dominik Vogt <dominik.vogt@gmx.de> wrote: > On Fri, Jul 08, 2022 at 03:04:31PM -0700, Bart Schaefer wrote: >> On Fri, Jul 8, 2022 at 1:58 PM Dominik Vogt <dominik.vogt@gmx.de> wrote: >> > >> > Disclaimer: I _know_ this can be done in seconds with perl / >> > python, but I like to not rely on scripting languages when the >> > shell can do the job. >> >> This is sort of like saying "I like to not rely on hiking boots when >> shoes can do the job." > > Actually, for me, scripting languages are the "shoes" because they > don't interact very well with the command pipeline, unless you > spend an absurd amount of work to make them do so. Calling > commands for everything can be slower, but most of the time it's > just a symptom of bad scripting. GNU coreutils are faster than > anything I'll ever be willing to code (or any perl or python > script or C or C++ library for that matter). The trick is keeping > the process spawning overhead low. > >> > $ chksum Fline1 Fline2 Fline3 ... Fline265000 >> > >> > (Of course without actually splitting the input file >> >> If "not actually splitting" means what it seems to mean, and you >> literally want to run cksum, the answer is no. > > Right. > > This does the job pretty well, relying entirely on existing Unix > tools: > > ulimit -s 100000 > split -l 1 "$INPUTF" ff > cksum ff* > rm ff* > > That cuts runtime down to seven seconds instead of four minutes, > at the cost of a fem hunred MB on the RAM disk. Splitting the > source file and removing the fragments takes about three to four > seconds. > > Thanks for the comments which put me on the right track. > > -- > > (I prefer to have a huge stack size anyway to be able to do things > like "grep foobar **/*(.)".) I realized I misinterpreted the question originally, and the following doesn't seem to work 100% but it was a fun idea: % mkfifo apipe % foo[265000]='' # number of lines in the file % cksum apipe$^foo # pass "apipe" to cksum 265000 times (in another terminal or job control etc) % while read; do echo $REPLY > apipe; done < infile When I tried the above on some test data, I got about 10 broken pipes. Also several lines sometimes get passed through the pipe without an intervening EOF, I'll admit I don't know the finer points of pipe/fifo behavior when you open and close them rapidly. That said, this also seems to take around 4-5 seconds to run. -- Mikael Magnusson ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Using file lines as "input files" 2022-07-09 2:21 ` Mikael Magnusson @ 2022-07-10 0:42 ` Dominik Vogt 2022-07-10 0:45 ` Dominik Vogt 2022-07-10 3:27 ` Bart Schaefer 0 siblings, 2 replies; 9+ messages in thread From: Dominik Vogt @ 2022-07-10 0:42 UTC (permalink / raw) To: Zsh Users On Sat, Jul 09, 2022 at 04:21:37AM +0200, Mikael Magnusson wrote: > I realized I misinterpreted the question originally, and the following > doesn't seem to work 100% but it was a fun idea: > % mkfifo apipe > % foo[265000]='' # number of lines in the file > % cksum apipe$^foo # pass "apipe" to cksum 265000 times For some mysterious reason that doesn't work with the shwordsplit option active: $ foo[9]='' $ setopt shwordsplit $ echo x^foo x x $ unsetopt shwordsplit $ echo x^foo x x x x x x x x x > (in another terminal or job control etc) > % while read; do echo $REPLY > apipe; done < infile > > When I tried the above on some test data, I got about 10 broken pipes. > Also several lines sometimes get passed through the pipe without an > intervening EOF, I'll admit I don't know the finer points of pipe/fifo > behavior when you open and close them rapidly. Hm, a fifo created with mkfifo is automatically blocking. So, when either end is opened while the other is not present, it blocks until the other end is opened. The reader gets an EOF when there's no more data and no writer has the fifo open. Otherwise it waits for more data. 1) Multiple lines processed by a single reader: * writer opens the fifo and blocks for a reader * reader opens the fifo and blocks for data * writer writes its data and closes the fifo * the next writes opens the fifo * the reader processes the first writer's data but gets no EOF because the new writer has the fifo open * the new writer writes another line to the fifo and terminates * the reader reads the next line, gets an EOF because no writer is open and terminates itself. 2) SIGPIPE may be generated in this case: * writer opens the fifo and blocks for a reader * reader opens the fifo and blocks for data * the writer unblocks, writes its data and closes. * the reader unblocks consumes the data and gets an EOF * the next writer opens the pipe without blocking because the reader has not yet closed the fifo * the reader closes the fifo * the writer tries to write data but gets SIGPIPE because there is no reader Unfortunately fifos have no notion of an EOF as part of the data stream. > That said, this also seems to take around 4-5 seconds to run. A pity that pipes are so uncomfortable to handle in Unix. I like that approach. Ciao Dominik ^_^ ^_^ -- Dominik Vogt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Using file lines as "input files" 2022-07-10 0:42 ` Dominik Vogt @ 2022-07-10 0:45 ` Dominik Vogt 2022-07-10 3:27 ` Bart Schaefer 1 sibling, 0 replies; 9+ messages in thread From: Dominik Vogt @ 2022-07-10 0:45 UTC (permalink / raw) To: Zsh Users On Sun, Jul 10, 2022 at 01:42:10AM +0100, Dominik Vogt wrote: > Hm, a fifo created with mkfifo is automatically blocking. Um, blockiness is determined by the process opening the fifo, of course, but redirection with ">" is probably always blocking, and cksum has no reason to set O_NONBLOCKING. Ciao Dominik ^_^ ^_^ -- Dominik Vogt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Using file lines as "input files" 2022-07-10 0:42 ` Dominik Vogt 2022-07-10 0:45 ` Dominik Vogt @ 2022-07-10 3:27 ` Bart Schaefer 2022-07-10 17:49 ` Bart Schaefer 1 sibling, 1 reply; 9+ messages in thread From: Bart Schaefer @ 2022-07-10 3:27 UTC (permalink / raw) To: Zsh Users On Sat, Jul 9, 2022 at 5:42 PM Dominik Vogt <dominik.vogt@gmx.de> wrote: > > $ foo[9]='' > $ setopt shwordsplit > $ echo x^foo > x x Er, missing a $ there? Anyway ... % foo[9]=z Macadamia% print -l x${^=foo}y xy xzy The leading "nonexistent" elements become one word, and the remaining element with a value becomes another. I'm not sure why that happens. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Using file lines as "input files" 2022-07-10 3:27 ` Bart Schaefer @ 2022-07-10 17:49 ` Bart Schaefer 0 siblings, 0 replies; 9+ messages in thread From: Bart Schaefer @ 2022-07-10 17:49 UTC (permalink / raw) To: Zsh Users On Sat, Jul 9, 2022 at 8:27 PM Bart Schaefer <schaefer@brasslantern.com> wrote: > > The leading "nonexistent" elements become one word, and the remaining > element with a value becomes another. I'm not sure why that happens. In fact any series of "unset" elements becomes an (empty) word. % foo[9]=x % foo[18]='' % unset foo\[18\] % print $#foo 18 % setopt shwordsplit % print a${^foo}b ab axb ab ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-07-10 17:51 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-07-08 20:58 Using file lines as "input files" Dominik Vogt 2022-07-08 21:58 ` Mikael Magnusson 2022-07-08 22:04 ` Bart Schaefer 2022-07-08 23:17 ` Dominik Vogt 2022-07-09 2:21 ` Mikael Magnusson 2022-07-10 0:42 ` Dominik Vogt 2022-07-10 0:45 ` Dominik Vogt 2022-07-10 3:27 ` Bart Schaefer 2022-07-10 17:49 ` Bart Schaefer
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).