* An idea for fast "last-N-lines" read @ 2017-03-21 6:04 Sebastian Gniazdowski 2017-03-23 3:53 ` Bart Schaefer 0 siblings, 1 reply; 4+ messages in thread From: Sebastian Gniazdowski @ 2017-03-21 6:04 UTC (permalink / raw) To: zsh-workers Hello I read somewhere that to read "last-N-lines" it is good to memory-map the file. Cannot check with Zsh: for (( i=size; i>=1; --i)); do if [[ ${${mapfile[input.db]}[i]} = $'\n' ]]; then echo Got newline / $SECONDS ... This gives: Got newline / 0.1383100000 Got newline / 16.0876810000 Got newline / 26.8089250000 for 2 MB file – apparently because it memory-maps the file on each newline check. So the idea is to add such feature. It would allow to run Zsh on machines where e.g. periodic random check of last 1000 lines of gigabyte-logs would be needed. It's possible that even Perl doesn't have this. I'm thinking about: $(<10<filepath) syntax, that would return buffer with 10 last lines, to be splited with (@f). Or maybe $(<10L<filepath) for lines, and $(<10<filepath) for bytes. Maybe it's easy to add? Otherwise, extension to zsh/mapfile could be added, or a new module written. BTW, (@f) skips trailing \n\n... That's quite problematic and there's probably no workaround? -- Sebastian Gniazdowski psprint3@fastmail.com ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: An idea for fast "last-N-lines" read 2017-03-21 6:04 An idea for fast "last-N-lines" read Sebastian Gniazdowski @ 2017-03-23 3:53 ` Bart Schaefer [not found] ` <etPan.58d39baa.74b0dc51.10ab3@MacMini.local> 0 siblings, 1 reply; 4+ messages in thread From: Bart Schaefer @ 2017-03-23 3:53 UTC (permalink / raw) To: Sebastian Gniazdowski, zsh-workers On Mar 20, 11:04pm, Sebastian Gniazdowski wrote: } } I read somewhere that to read "last-N-lines" it is good to memory-map } the file. Cannot check with Zsh [...] } - apparently because it memory-maps the file on each newline check. Indeed, the mapfile module doesn't help much after the initial file read because zsh has no mechanism for holding a reference to the mapped block of memory. It's mostly for quickly copying the entire file into and out of regular heap. This could, however, be made a lot better, e.g. by introducing a cache of mapped files into mapfile.c and causing get_contents() to first use the cache (and setpmmapfile to update it, unsetpmmapfile to erase an entry from it) before resorting to remapping the actual file. } I'm thinking about: $(<10<filepath) syntax I'm not thrilled about adding new syntax for this, and anyway that conflicts with multios semantics. } BTW, (@f) skips trailing \n\n... That's quite problematic and there's } probably no workaround? In double quotes, (@f) retains empty elements, which includes making empty elements out of trailing newlines. However, there is no way to get $(<file) [or $(<<<string) etc.] to retain trailing newlines, which is most likely what's misleading you. ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <etPan.58d39baa.74b0dc51.10ab3@MacMini.local>]
* Re: An idea for fast "last-N-lines" read [not found] ` <etPan.58d39baa.74b0dc51.10ab3@MacMini.local> @ 2017-03-23 10:17 ` Sebastian Gniazdowski 2017-03-23 16:46 ` Bart Schaefer 0 siblings, 1 reply; 4+ messages in thread From: Sebastian Gniazdowski @ 2017-03-23 10:17 UTC (permalink / raw) To: zsh-workers Włącz 23 marca 2017 at 05:05:33, Bart Schaefer (schaefer@brasslantern.com) napisano: > This could, however, be made a lot better, e.g. by introducing a cache > of mapped files into mapfile.c and causing get_contents() to first use > the cache (and setpmmapfile to update it, unsetpmmapfile to erase an > entry from it) before resorting to remapping the actual file. I think this should be done (I might get to it too at some point). As for the use case "last-N-lines", one could do ${${mapfile[name]}[-250000,-1]}}, providing that average line length would be computed to be 250.. So the 250000 is expected to hold needed 1000 lines. Tried this on some log that I aggregate: % lines=(); typeset -F SECONDS=0; lines=( "${(@f)${${mapfile[input.db]}[-250000,-1]}}" ); echo $SECONDS, ${#lines} 0.3925410000, 1536 Doing this in traditional way: % lines=(); typeset -F SECONDS=0; lines=( ${"${(@f)"$(<input.db)"}"[-1000,-1]} ); echo $SECONDS, ${#lines} 0.8828770000, 1000 Without slicing: % lines=(); typeset -F SECONDS=0; lines=( "${(@f)"$(<input.db)"}" ); echo $SECONDS, ${#lines} 0.8219170000, 38707 I would expect mapfile to perform little better. For input file with 12500 lines, it's 0.1625320000, 1151. So the time quite raises from 162 ms to 400 ms when input file is larger. This could be just constant. It looks like [-250000,-1] does typical multibyte string iteration over buffer, to establish offset. If one would just look for last 250000 bytes and do this at mapfile level, it would be constant. Not sure if unicode can be fully broken for whole buffer this way, telling from the way how Zsh handles unicode, it would be just few first characters that could be broken. For looking for last-N-lines, I wonder how should be '\n' handled in reverse-read unicode buffer. > } BTW, (@f) skips trailing \n\n... That's quite problematic and there's > } probably no workaround? > > In double quotes, (@f) retains empty elements, which includes making > empty elements out of trailing newlines. However, there is no way to > get $(<file) [or $(<<<string) etc.] to retain trailing newlines, which > is most likely what's misleading you. Ah, thanks. Wonder how would sysread perform, and what about metafication when using it. -- Sebastian Gniazdowski psprint [at] zdharma.org ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: An idea for fast "last-N-lines" read 2017-03-23 10:17 ` Sebastian Gniazdowski @ 2017-03-23 16:46 ` Bart Schaefer 0 siblings, 0 replies; 4+ messages in thread From: Bart Schaefer @ 2017-03-23 16:46 UTC (permalink / raw) To: zsh-workers On Mar 23, 11:17am, Sebastian Gniazdowski wrote: } Subject: Re: An idea for fast "last-N-lines" read } } % lines=(); typeset -F SECONDS=0; lines=( "${(@f)${${mapfile[input.db]}[-250000,-1]}}" ); echo $SECONDS, ${#lines} } 0.3925410000, 1536 } } I would expect mapfile to perform little better. The whole file has to get metafied by heap copy before subscripting can be applied ... mapfile was a lot more efficient before we began to need to store parameters in metafied state. Other parameter ops are also going to do pass-by-value even if the base reference is mmap'd, so there will be some constructs where the whole file is copied over and over. Nothing to be done about that without a full rewrite of subst.c ... } Wonder how would sysread perform, and what about metafication when using it. Metafication should be OK, it makes a metafied heap copy just as does mapfile. if you first "sysseek -w end 25000" and then "sysread -s 25000" it should be quite fast, but reading an entire large file may be slower than with mapfile. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2017-03-23 16:45 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-21 6:04 An idea for fast "last-N-lines" read Sebastian Gniazdowski 2017-03-23 3:53 ` Bart Schaefer [not found] ` <etPan.58d39baa.74b0dc51.10ab3@MacMini.local> 2017-03-23 10:17 ` Sebastian Gniazdowski 2017-03-23 16:46 ` Bart Schaefer
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).