An idea for fast "last-N-lines" read

zsh-workers
 help / color / mirror / code / Atom feed

* An idea for fast "last-N-lines" read
@ 2017-03-21  6:04 Sebastian Gniazdowski
  2017-03-23  3:53 ` Bart Schaefer
  0 siblings, 1 reply; 4+ messages in thread
From: Sebastian Gniazdowski @ 2017-03-21  6:04 UTC (permalink / raw)
  To: zsh-workers

Hello
I read somewhere that to read "last-N-lines" it is good to memory-map
the file. Cannot check with Zsh:

    for (( i=size; i>=1; --i)); do
        if [[ ${${mapfile[input.db]}[i]} = $'\n' ]]; then
            echo Got newline / $SECONDS
    ...

This gives:

Got newline / 0.1383100000
Got newline / 16.0876810000
Got newline / 26.8089250000

for 2 MB file – apparently because it memory-maps the file on each
newline check.

So the idea is to add such feature. It would allow to run Zsh on
machines where e.g. periodic random check of last 1000 lines of
gigabyte-logs would be needed. It's possible that even Perl doesn't have
this. I'm thinking about: $(<10<filepath) syntax, that would return
buffer with 10 last lines, to be splited with (@f). Or maybe
$(<10L<filepath) for lines, and $(<10<filepath) for bytes. Maybe it's
easy to add? Otherwise, extension to zsh/mapfile could be added, or a
new module written.

BTW, (@f) skips trailing \n\n... That's quite problematic and there's
probably no workaround?
-- 
  Sebastian Gniazdowski
  psprint3@fastmail.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: An idea for fast "last-N-lines" read
  2017-03-21  6:04 An idea for fast "last-N-lines" read Sebastian Gniazdowski
@ 2017-03-23  3:53 ` Bart Schaefer
       [not found]   ` <etPan.58d39baa.74b0dc51.10ab3@MacMini.local>
  0 siblings, 1 reply; 4+ messages in thread
From: Bart Schaefer @ 2017-03-23  3:53 UTC (permalink / raw)
  To: Sebastian Gniazdowski, zsh-workers

On Mar 20, 11:04pm, Sebastian Gniazdowski wrote:
}
} I read somewhere that to read "last-N-lines" it is good to memory-map
} the file. Cannot check with Zsh [...]
} - apparently because it memory-maps the file on each newline check.

Indeed, the mapfile module doesn't help much after the initial file
read because zsh has no mechanism for holding a reference to the
mapped block of memory.  It's mostly for quickly copying the entire
file into and out of regular heap.

This could, however, be made a lot better, e.g. by introducing a cache
of mapped files into mapfile.c and causing get_contents() to first use
the cache (and setpmmapfile to update it, unsetpmmapfile to erase an
entry from it) before resorting to remapping the actual file.

} I'm thinking about: $(<10<filepath) syntax

I'm not thrilled about adding new syntax for this, and anyway that
conflicts with multios semantics.

} BTW, (@f) skips trailing \n\n... That's quite problematic and there's
} probably no workaround?

In double quotes, (@f) retains empty elements, which includes making
empty elements out of trailing newlines.  However, there is no way to
get $(<file) [or $(<<<string) etc.] to retain trailing newlines, which
is most likely what's misleading you.

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <etPan.58d39baa.74b0dc51.10ab3@MacMini.local>]

* Re: An idea for fast "last-N-lines" read
       [not found]   ` <etPan.58d39baa.74b0dc51.10ab3@MacMini.local>
@ 2017-03-23 10:17     ` Sebastian Gniazdowski
  2017-03-23 16:46       ` Bart Schaefer
  0 siblings, 1 reply; 4+ messages in thread
From: Sebastian Gniazdowski @ 2017-03-23 10:17 UTC (permalink / raw)
  To: zsh-workers

Włącz 23 marca 2017 at 05:05:33, Bart Schaefer (schaefer@brasslantern.com) napisano: 
> This could, however, be made a lot better, e.g. by introducing a cache 
> of mapped files into mapfile.c and causing get_contents() to first use 
> the cache (and setpmmapfile to update it, unsetpmmapfile to erase an 
> entry from it) before resorting to remapping the actual file. 

I think this should be done (I might get to it too at some point). As for the use case "last-N-lines", one could do ${${mapfile[name]}[-250000,-1]}}, providing that average line length would be computed to be 250.. So the 250000 is expected to hold needed 1000 lines. Tried this on some log that I aggregate: 

% lines=(); typeset -F SECONDS=0; lines=( "${(@f)${${mapfile[input.db]}[-250000,-1]}}" ); echo $SECONDS, ${#lines} 
0.3925410000, 1536 

Doing this in traditional way: 

% lines=(); typeset -F SECONDS=0; lines=( ${"${(@f)"$(<input.db)"}"[-1000,-1]} ); echo $SECONDS, ${#lines} 
0.8828770000, 1000 

Without slicing: 

% lines=(); typeset -F SECONDS=0; lines=( "${(@f)"$(<input.db)"}" ); echo $SECONDS, ${#lines} 
0.8219170000, 38707 

I would expect mapfile to perform little better. For input file with 12500 lines, it's 0.1625320000, 1151. So the time quite raises from 162 ms to 400 ms when input file is larger. This could be just constant. It looks like [-250000,-1] does typical multibyte string iteration over buffer, to establish offset. If one would just look for last 250000 bytes and do this at mapfile level, it would be constant. Not sure if unicode can be fully broken for whole buffer this way, telling from the way how Zsh handles unicode, it would be just few first characters that could be broken. For looking for last-N-lines, I wonder how should be '\n' handled in reverse-read unicode buffer. 

> } BTW, (@f) skips trailing \n\n... That's quite problematic and there's 
> } probably no workaround? 
> 
> In double quotes, (@f) retains empty elements, which includes making 
> empty elements out of trailing newlines. However, there is no way to 
> get $(<file) [or $(<<<string) etc.] to retain trailing newlines, which 
> is most likely what's misleading you.  

Ah, thanks. Wonder how would sysread perform, and what about metafication when using it. 

-- 
Sebastian Gniazdowski 
psprint [at] zdharma.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: An idea for fast "last-N-lines" read
  2017-03-23 10:17     ` Sebastian Gniazdowski
@ 2017-03-23 16:46       ` Bart Schaefer
  0 siblings, 0 replies; 4+ messages in thread
From: Bart Schaefer @ 2017-03-23 16:46 UTC (permalink / raw)
  To: zsh-workers

On Mar 23, 11:17am, Sebastian Gniazdowski wrote:
} Subject: Re: An idea for fast "last-N-lines" read
}
} % lines=(); typeset -F SECONDS=0; lines=( "${(@f)${${mapfile[input.db]}[-250000,-1]}}" ); echo $SECONDS, ${#lines} 
} 0.3925410000, 1536 
} 
} I would expect mapfile to perform little better.

The whole file has to get metafied by heap copy before subscripting can
be applied ... mapfile was a lot more efficient before we began to need
to store parameters in metafied state.

Other parameter ops are also going to do pass-by-value even if the base
reference is mmap'd, so there will be some constructs where the whole
file is copied over and over.  Nothing to be done about that without a
full rewrite of subst.c ...

} Wonder how would sysread perform, and what about metafication when using it. 

Metafication should be OK, it makes a metafied heap copy just as does
mapfile.  if you first "sysseek -w end 25000" and then "sysread -s 25000"
it should be quite fast, but reading an entire large file may be slower
than with mapfile.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-03-23 16:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-21  6:04 An idea for fast "last-N-lines" read Sebastian Gniazdowski
2017-03-23  3:53 ` Bart Schaefer
     [not found]   ` <etPan.58d39baa.74b0dc51.10ab3@MacMini.local>
2017-03-23 10:17     ` Sebastian Gniazdowski
2017-03-23 16:46       ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).