From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: from zero.zsh.org (zero.zsh.org [IPv6:2a02:898:31:0:48:4558:7a:7368]) by inbox.vuxu.org (Postfix) with ESMTP id 727EA28AFB for ; Sun, 14 Jan 2024 11:35:15 +0100 (CET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20210803; t=1705228514; b=SJ5AsUA/0bSP3DoeghPXOWFm5ll32hJ9iz4b1SArfY8IQfyzdcz03r9GiJIOaFWy4c61zb9g3I AfurcPElvIvVRhURvDN2mZ+GIMZ3fsju4UMqY+HmcgbmjuPkBoIWG0DEICY8CACSLRyT95PyKH txzrt6U94oZvpTWMxpIcy4Z1PtSnPiGvQmCakX0HHrjdJXoRwQYwvmjScVxif5z1cIuUwfiAhf Jmrp9qrq5gsuUE/sJOAemH87h/3hg35i/4YnDq2Vqt54Uazz22xtA/pzhcOG4SIwuKmhbE1CGl 84h4kHJtCD4aOFcde1fFw98L/6NE7AEutbFsIWqKjKHQdQ==; ARC-Authentication-Results: i=1; zsh.org; iprev=pass (mail-lj1-f176.google.com) smtp.remote-ip=209.85.208.176; dkim=pass header.d=gmail.com header.s=20230601 header.a=rsa-sha256; dmarc=pass header.from=gmail.com; arc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20210803; t=1705228514; bh=aZXa90qheCBM8B0tg51lFQ1OuIIpGDk5lSwM6/zN+eU=; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help: List-Id:Sender:Content-Transfer-Encoding:Content-Type:Cc:To:Subject: Message-ID:Date:From:In-Reply-To:References:MIME-Version:DKIM-Signature: DKIM-Signature; b=bVI0s/sMeKwSWv3zXREzI/XgyLCAMPQb9UrxAUNruLZcjL2IAY73iNwPLc4yCwzeW2t3e1o5cJ NdvIeOY+iUdE44RX1xpvVBdMcbmsc2LHiwIFNLAKzsSNQ8Xy8yxMFCvIjmGtIPo7z5lLjCl2sy G6LR/2jiiePiQf80j0y9o2oZJhyRkSy78vAtPFZN6rkICBVLWhQBp2jpXrmAAqOLgHDCKtMRMl Is22Lf3A/UeGJs3iCzm9U/dzEdSobU8kgDv/cLx9wAUjUg6lJ1Q2ZixaKfgyNJD82crt7s4vbm tpGEIEqglOqPAVnNKASIniUgLiAVE0VFxoC85rv8pkDQMw==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20210803; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:Content-Transfer-Encoding: Content-Type:Cc:To:Subject:Message-ID:Date:From:In-Reply-To:References: MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID; bh=ZuDpNnFEzmnYoF4RPWEzniGU9NKB9EZqzsQes9Tt+U4=; b=Q54dPiSrw+L8q/flYWC4He58+A YdAWGxNB1fYMNWyp8WvNKKUsxWm/JGauoRhN4r6WuCgZy1h43u5ia4VOOX81Y65xwp4Iybx+cjTLW ZWHExlkrxbzq0NM9yyx869Y2u+p1VoPaMxG+sOLBO/MjI4g9gScjzuvHJR/r6ZQqgiPmxKIHmPtUb Zs6jkGtJESk6gOh1HYKIbjoAsBG8loi+oz22A+dawXeeaE2C/+WGRhK+93wvOM/r3D7qfboGn8b8c +vbn8WvKeVP+b4+AmZyzVvNcDT5UwRRx/8vrD4hape+bTECNgTMLi772H5c7UHqS1EQe3QJ93Janh mJ+bD/TA==; Received: by zero.zsh.org with local id 1rOxph-000MfV-07; Sun, 14 Jan 2024 10:35:13 +0000 Authentication-Results: zsh.org; iprev=pass (mail-lj1-f176.google.com) smtp.remote-ip=209.85.208.176; dkim=pass header.d=gmail.com header.s=20230601 header.a=rsa-sha256; dmarc=pass header.from=gmail.com; arc=none Received: from mail-lj1-f176.google.com ([209.85.208.176]:57669) by zero.zsh.org with esmtps (TLS1.3:TLS_AES_128_GCM_SHA256:128) id 1rOxok-000M0v-2z; Sun, 14 Jan 2024 10:34:16 +0000 Received: by mail-lj1-f176.google.com with SMTP id 38308e7fff4ca-2cd1ca52f31so90021191fa.3 for ; Sun, 14 Jan 2024 02:34:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705228453; x=1705833253; darn=zsh.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ZuDpNnFEzmnYoF4RPWEzniGU9NKB9EZqzsQes9Tt+U4=; b=N1TWQ9/h41Mgmq7eY8dXwT5Lduz34qGM/lm3RJXAEaK7pNYEze7CKEA+cgyDqhgSOX BG81ojOLeKeOOuLWNkKCcij56ESP+7sCuu8B979/UpShPIdiwVzmj1/hF17lHJOCJP91 gDJfVED4sHJlf2WuTGmNHaGP8iLe+BCGfFk7V5oaQWFArA/9oMvDwkc80JezK/f2NeeX DJAiBfHXLS9Zt+ffZ4InYvUi0J1Ujx0YeFMptkjDO7O09hh9sIXmo/4EMBQvQbKiCNJA g4ycythJ4oiVEhqrt8EXP2PyBM888M3roEBr6u8frkPl9yNPPOk/ixj+KlzhlDmejz8e EzjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705228453; x=1705833253; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZuDpNnFEzmnYoF4RPWEzniGU9NKB9EZqzsQes9Tt+U4=; b=De9ZZ0M+fMa0qUKF8bHl9hT2cEKrxELziwciwFrMrZpweERKte8cHnX5roPD40D7mB GS1DwuejvEKlK0gW1p8To1gVykUkEu2qiBVgApIxz0bwKHrAAWJ2E4x3JI176dXosKnD ywNOfPWn+d8tnuH81F3BcRlG9jdUnMrtF5OI13on2fX/8vovnUcDrnlsPC6FazPPiPFS 2IaMMPhxMd8Sq5NtRbfP5KUcqiuSe/heYcijdXvp731tGa0CbKIRujt3N23I/VOlzd+y RWu75f/N2ovCwnPXtNzpfLqcd8C81bz5NArDoIhL3K4d5lvxFVbCLLH3b3OG+Cm0l5z0 2sdw== X-Gm-Message-State: AOJu0Yx/+mBE21/Rk101Prh9hRes/zetxdJdWYHOH8dQZkIMX4la5uRs yIRPWssLRDh0hfgS5guWS42jBdvx9xkXd4ovkPVR4DLSUTM= X-Google-Smtp-Source: AGHT+IEwRcatmwxBgKZgadliDpR3qe6neAPLYDvXreUuOwfPhvZus2upReCOlFK97Pgng7rrU/MGx2nEgKoGF1fIzrI= X-Received: by 2002:a2e:7d18:0:b0:2cd:d1:9032 with SMTP id y24-20020a2e7d18000000b002cd00d19032mr1747593ljc.23.1705228452535; Sun, 14 Jan 2024 02:34:12 -0800 (PST) MIME-Version: 1.0 References: <205735b2-11e1-4b5e-baa2-7418753f591f@eastlink.ca> In-Reply-To: From: Roman Perepelitsa Date: Sun, 14 Jan 2024 11:34:00 +0100 Message-ID: Subject: Re: Slurping a file (was: more spllitting travails) To: Bart Schaefer Cc: Zsh Users Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Seq: 29472 Archived-At: X-Loop: zsh-users@zsh.org Errors-To: zsh-users-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-users-request@zsh.org X-no-archive: yes List-Id: List-Help: , List-Subscribe: , List-Unsubscribe: , List-Post: List-Owner: List-Archive: On Sat, Jan 13, 2024 at 9:02=E2=80=AFPM Bart Schaefer wrote: > > On Fri, Jan 12, 2024 at 9:39=E2=80=AFPM Roman Perepelitsa > wrote: > > > > The standard trick here is to print an extra character after the > > content of the file and then remove it. This works when capturing > > stdout of commands, too. > > This actually led me to the best (?) solution: > > IFS=3D read -rd '' file_content > If IFS is not set, newlines are not stripped. Of course this still > only works if the file does not contain nul bytes, the -d delimiter > has to be something that's not in the file. In addition to being unable to read files with nul bytes, this solution suffers from additional drawbacks: - It's impossible to distinguish EOF from I/O error. - It's slow when reading from non-file file descriptors. - It's slower than the optimized sysread-based slurp (see below) for larger files. Conversely, sysread-based slurp can read the full content of any file descriptor quickly and report success if and only if it manages to read until EOF. Its only downside is that it can be up to 2x slower for tiny files. > > sysread 'content[$#content+1]' && continue > > You can speed this up a little by using the -c option to sysread to > get back a count of bytes read, and accumulate that in another var to > avoid having to re-calculate $#content on every loop. Indeed, this would be faster but the code would still have quadratic time complexity. Here's a version with linear time complexity: function slurp() { emulate -L zsh -o no_multibyte zmodload zsh/system || return local -a content local -i i while true; do sysread 'content[++i]' && continue (( $? =3D=3D 5 )) || return break done typeset -g REPLY=3D${(j::)content} } (I am not certain it's linear. I've benchmarked it for files up to 512MB in size, and it is linear in practice.) I've benchmarked read and slurp for reading files and pipes. emulate -L zsh -o pipe_fail -o no_multibyte zmodload zsh/datetime || return local -i i len function bench() { local REPLY local -F start end start=3DEPOCHREALTIME eval $1 end=3DEPOCHREALTIME (( $#REPLY =3D=3D len )) || return printf ' %10d' '1e6 * (end - start)' || return } printf '%2s %7s %10s %10s %10s %10s\n' \ n size read-file slurp-file read-pipe slurp-pipe || return for ((i =3D 1; i !=3D 26; ++i)); do len=3D'i =3D=3D 1 ? 0 : 1 << (i - 2)' head -c $len $i || return <$i >/dev/null || return printf '%2d %7d' i len || return # read-file bench 'IFS=3D read -rd "" <$i' || return # slurp-file bench 'slurp <$i || return' || return # read-pipe bench '<$i | IFS=3D read -rd ""' || return # slurp-pipe bench '<$i | slurp || return' || return print || return done Here's the output (best viewed with a fixed-width font): n size read-file slurp-file read-pipe slurp-pipe 1 0 74 107 1908 2068 2 1 52 126 2182 1931 3 2 52 111 1863 2471 4 4 65 150 2097 2028 5 8 58 159 1849 2073 6 16 61 118 1934 2089 7 32 73 123 1867 2235 8 64 73 120 2067 2033 9 128 102 122 1904 2172 10 256 129 115 2025 2114 11 512 254 123 2070 2089 12 1024 372 137 2441 2190 13 2048 762 156 2624 2132 14 4096 1306 177 3488 2500 15 8192 2486 263 4446 2540 16 16384 4718 390 6565 3140 17 32768 13919 953 13524 4323 18 65536 20965 1195 21532 5195 19 131072 41741 2124 127089 11325 20 262144 81777 4214 461189 12515 21 524288 161077 8342 1068388 21149 22 1048576 312015 16330 2321501 37422 23 2097152 606270 31752 4773261 67625 24 4194304 1291121 61298 10253544 154340 25 8388608 2534093 135694 19551480 264041 The second column is the file size, ranging from 0 to 8MB. After that we have four columns listing the amount of time it takes to read the file in various ways, in microseconds. Observations from the data: - All routines appear to have linear time complexity. - For small files, read is up to twice as fast as slurp. - For files over 256 bytes in size, slurp is faster. - With slurp, the time it takes to read from a pipe is about 2x compared to reading from a file. With read, the penalty is 8x. - For an 8MB file, slurp is 20 times faster than read when reading from a file, and 70 times faster when reading from a pipe. I am tempted to declare slurp the winner here. Roman.