zsh-users
 help / color / mirror / code / Atom feed
From: Roman Perepelitsa <roman.perepelitsa@gmail.com>
To: Bart Schaefer <schaefer@brasslantern.com>
Cc: Zsh Users <zsh-users@zsh.org>
Subject: Re: Slurping a file (was: more spllitting travails)
Date: Sun, 14 Jan 2024 11:34:00 +0100	[thread overview]
Message-ID: <CAN=4vMq=E4s2a0sDFq-Mc8=pVzPnYOM9NaTmesgXQqi+O+mHpw@mail.gmail.com> (raw)
In-Reply-To: <CAH+w=7aT-gbt7PRo=uvPK5=+rR3X-PE7nEssOkh+=fxwdeG_7w@mail.gmail.com>

On Sat, Jan 13, 2024 at 9:02 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> On Fri, Jan 12, 2024 at 9:39 PM Roman Perepelitsa
> <roman.perepelitsa@gmail.com> wrote:
> >
> > The standard trick here is to print an extra character after the
> > content of the file and then remove it. This works when capturing
> > stdout of commands, too.
>
> This actually led me to the best (?) solution:
>
>   IFS= read -rd '' file_content <file
>
> If IFS is not set, newlines are not stripped.  Of course this still
> only works if the file does not contain nul bytes, the -d delimiter
> has to be something that's not in the file.

In addition to being unable to read files with nul bytes, this
solution suffers from additional drawbacks:

- It's impossible to distinguish EOF from I/O error.
- It's slow when reading from non-file file descriptors.
- It's slower than the optimized sysread-based slurp (see below) for
larger files.

Conversely, sysread-based slurp can read the full content of any file
descriptor quickly and report success if and only if it manages to
read until EOF. Its only downside is that it can be up to 2x slower
for tiny files.

> >         sysread 'content[$#content+1]' && continue
>
> You can speed this up a little by using the -c option to sysread to
> get back a count of bytes read, and accumulate that in another var to
> avoid having to re-calculate $#content on every loop.

Indeed, this would be faster but the code would still have quadratic
time complexity. Here's a version with linear time complexity:

    function slurp() {
      emulate -L zsh -o no_multibyte
      zmodload zsh/system || return
      local -a content
      local -i i
      while true; do
        sysread 'content[++i]' && continue
        (( $? == 5 )) || return
        break
      done
      typeset -g REPLY=${(j::)content}
    }

(I am not certain it's linear. I've benchmarked it for files up to
512MB in size, and it is linear in practice.)

I've benchmarked read and slurp for reading files and pipes.

    emulate -L zsh -o pipe_fail -o no_multibyte
    zmodload zsh/datetime || return

    local -i i len

    function bench() {
      local REPLY
      local -F start end
      start=EPOCHREALTIME
      eval $1
      end=EPOCHREALTIME
      (( $#REPLY == len )) || return
      printf ' %10d' '1e6 * (end - start)' || return
    }

    printf '%2s %7s %10s %10s %10s %10s\n' \
           n size read-file slurp-file read-pipe slurp-pipe || return

    for ((i = 1; i != 26; ++i)); do
      len='i == 1 ? 0 : 1 << (i - 2)'
      head -c $len </dev/urandom | tr '\0' x >$i || return
      <$i >/dev/null || return

      printf '%2d %7d' i len || return

      # read-file
      bench 'IFS= read -rd "" <$i' || return

      # slurp-file
      bench 'slurp <$i || return' || return

      # read-pipe
      bench '<$i | IFS= read -rd ""' || return

      # slurp-pipe
      bench '<$i | slurp || return' || return

      print || return
    done

Here's the output (best viewed with a fixed-width font):

     n    size  read-file slurp-file  read-pipe slurp-pipe
     1       0         74        107       1908       2068
     2       1         52        126       2182       1931
     3       2         52        111       1863       2471
     4       4         65        150       2097       2028
     5       8         58        159       1849       2073
     6      16         61        118       1934       2089
     7      32         73        123       1867       2235
     8      64         73        120       2067       2033
     9     128        102        122       1904       2172
    10     256        129        115       2025       2114
    11     512        254        123       2070       2089
    12    1024        372        137       2441       2190
    13    2048        762        156       2624       2132
    14    4096       1306        177       3488       2500
    15    8192       2486        263       4446       2540
    16   16384       4718        390       6565       3140
    17   32768      13919        953      13524       4323
    18   65536      20965       1195      21532       5195
    19  131072      41741       2124     127089      11325
    20  262144      81777       4214     461189      12515
    21  524288     161077       8342    1068388      21149
    22 1048576     312015      16330    2321501      37422
    23 2097152     606270      31752    4773261      67625
    24 4194304    1291121      61298   10253544     154340
    25 8388608    2534093     135694   19551480     264041

The second column is the file size, ranging from 0 to 8MB. After that
we have four columns listing the amount of time it takes to read the
file in various ways, in microseconds.

Observations from the data:

- All routines appear to have linear time complexity.
- For small files, read is up to twice as fast as slurp.
- For files over 256 bytes in size, slurp is faster.
- With slurp, the time it takes to read from a pipe is about 2x
  compared to reading from a file. With read, the penalty is 8x.
- For an 8MB file, slurp is 20 times faster than read when reading
  from a file, and 70 times faster when reading from a pipe.

I am tempted to declare slurp the winner here.

Roman.


  parent reply	other threads:[~2024-01-14 10:35 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-12 19:05 more splitting travails Ray Andrews
2024-01-12 19:19 ` Bart Schaefer
2024-01-12 19:56   ` Ray Andrews
2024-01-12 20:07     ` Mark J. Reed
     [not found]   ` <CAA=-s3zc5a+PA7draaA=FmXtwU9K8RrHbb70HbQN8MhmuXTYrQ@mail.gmail.com>
2024-01-12 20:03     ` Fwd: " Bart Schaefer
2024-01-12 20:32       ` Ray Andrews
2024-01-12 20:50         ` Roman Perepelitsa
2024-01-13  2:12           ` Ray Andrews
2024-01-12 20:51         ` Bart Schaefer
2024-01-12 21:57           ` Mark J. Reed
2024-01-12 22:09             ` Bart Schaefer
2024-01-13  3:06               ` Ray Andrews
2024-01-13  3:36                 ` Ray Andrews
2024-01-13  4:07                   ` Bart Schaefer
2024-01-13  5:39               ` Roman Perepelitsa
2024-01-13 20:02                 ` Slurping a file (was: more spllitting travails) Bart Schaefer
2024-01-13 20:07                   ` Slurping a file Ray Andrews
2024-01-14  5:03                     ` zcurses mouse delay (not Re: Slurping a file) Bart Schaefer
2024-01-14  5:35                       ` Ray Andrews
2024-01-14 10:34                   ` Roman Perepelitsa [this message]
2024-01-14 10:57                     ` Slurping a file (was: more spllitting travails) Roman Perepelitsa
2024-01-14 15:36                     ` Slurping a file Ray Andrews
2024-01-14 15:41                       ` Roman Perepelitsa
2024-01-14 20:13                       ` Lawrence Velázquez
2024-01-15  0:03                         ` Ray Andrews
2024-01-15  0:55                           ` Empty element elision and associative arrays (was Re: Slurping a file) Bart Schaefer
2024-01-15  4:09                             ` Ray Andrews
2024-01-15  7:01                               ` Lawrence Velázquez
2024-01-15 14:47                                 ` Ray Andrews
2024-01-18 16:20                                 ` Mark J. Reed
2024-01-18 17:22                                   ` Ray Andrews
2024-01-18 17:36                                     ` Mark J. Reed
2024-01-18 17:55                                       ` Ray Andrews
2024-01-18 22:34                               ` Bart Schaefer
2024-01-18 23:08                                 ` Ray Andrews
2024-01-19  2:46                                   ` Bart Schaefer
2024-01-19  2:58                                     ` Ray Andrews
2024-01-19 10:27                                       ` Stephane Chazelas
2024-01-19 13:45                                         ` Mikael Magnusson
2024-01-19 14:37                                           ` Mark J. Reed
2024-01-19 14:57                                             ` Ray Andrews
2024-01-19 15:46                                               ` Mark J. Reed
2024-01-19 16:01                                                 ` Mikael Magnusson
2024-01-19 17:15                                                   ` Ray Andrews
2024-01-19 17:42                                                     ` Bart Schaefer
2024-01-19 18:45                                                       ` Ray Andrews
2024-01-14 22:09                     ` Slurping a file (was: more spllitting travails) Bart Schaefer
2024-01-15  8:53                       ` Roman Perepelitsa
2024-01-16 19:57                         ` Bart Schaefer
2024-01-16 20:07                           ` Slurping a file Ray Andrews
2024-01-16 20:14                             ` Roman Perepelitsa
2024-01-16 20:38                               ` Ray Andrews
2024-01-16 20:43                                 ` Roman Perepelitsa
2024-01-16 22:27                                   ` Ray Andrews
2024-01-15  2:00                     ` Slurping a file (was: more spllitting travails) Bart Schaefer
2024-01-15  4:24                       ` Slurping a file Ray Andrews
2024-01-15  6:56                         ` Lawrence Velázquez
2024-01-15 14:37                           ` Ray Andrews
2024-01-15 15:10                             ` Marc Chantreux
2024-01-15 15:29                               ` Mark J. Reed
2024-01-15 16:16                                 ` Marc Chantreux
2024-01-15 16:33                                   ` MUAs (was: Re: Slurping a file) zeurkous
2024-01-16  7:23                               ` Slurping a file Lawrence Velázquez
2024-01-16 14:37                                 ` Ray Andrews
2024-01-17  3:50                                   ` Lawrence Velázquez
2024-01-17  5:10                                     ` Ray Andrews
2024-01-15  7:26                       ` Slurping a file (was: more spllitting travails) Lawrence Velázquez
2024-01-15 14:48                         ` Slurping a file Ray Andrews
2024-01-15 13:13                       ` Slurping a file (was: more spllitting travails) Marc Chantreux
2024-02-10 20:48                     ` Stephane Chazelas
2024-02-11  0:59                       ` Mikael Magnusson
2024-02-11  4:49                         ` Bart Schaefer
2024-02-11  5:04                           ` Mikael Magnusson
2024-02-11  4:46                       ` Bart Schaefer
2024-02-11  5:06                         ` Mikael Magnusson
2024-02-11  7:09                         ` Stephane Chazelas
2024-01-13  2:19           ` Fwd: more splitting travails Ray Andrews
2024-01-13  3:59             ` Bart Schaefer
2024-01-13  4:54               ` Ray Andrews
2024-01-13  5:51                 ` Roman Perepelitsa
2024-01-13 16:40                   ` Ray Andrews
2024-01-13 18:22                     ` Bart Schaefer
2024-01-13 19:08                       ` Ray Andrews

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAN=4vMq=E4s2a0sDFq-Mc8=pVzPnYOM9NaTmesgXQqi+O+mHpw@mail.gmail.com' \
    --to=roman.perepelitsa@gmail.com \
    --cc=schaefer@brasslantern.com \
    --cc=zsh-users@zsh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).