* line overhead
@ 2022-09-10 20:21 Karl Dahlke
0 siblings, 0 replies; only message in thread
From: Karl Dahlke @ 2022-09-10 20:21 UTC (permalink / raw)
To: edbrowse-dev
One of the driving factors for large or even medium files is the
overhead per line.
Let's step back and do some math, ok that's too rigorous, let's do some
estimating.
Lines are variable length, from one byte to thousands, so they have to
be allocated, and there is probably no way to optimize or customize the
allocation process.
In other words, no special circumstance, and I would be pretty dog gone
arrogant to think I could do better than malloc, which has been refined
over the past 45 years by the best minds in computer science.
So, each line has a malloc overhead.
What is it?
I looked on the internet and there is no clear answer.
I'm gonna guess 16 bytes.
Could be more or less.
Also chunks are 8 bytes aligned, so if your line is 41 bytes long
you're going to get a slot of length 48.
That's an average of another 4 bytes per line.
Then there's my representation in edbrowse.
In current edbrowse, pointers to lines are stored in an array, so a
million line file has an array of a million pointers pointing to the
million lines.
Simple enough.
That adds 4 bytes per line, or 8 bytes per line for 64 bit pointers.
In linklist edbrowse, lines are in a linked list and that means two
pointers, next and previous, you know the drill.
So to compare, each line has 28 bytes overhead in one version of
edbrowse, 36 bytes overhead in the other.
An empty line, one byte, could consume 40 bytes of ram in linklist
edbrowse.
That weighs in favor of linear edbrowse, though not heavily, not a huge
difference.
Performance also has many tradeoffs.
Something like g/re/ .m-2 is *way* more efficient in linklist.
Each matching line: change some pointers and move it two lines back.
That is a quadratic explosion in linear edbrowse.
Try this:
r !seq 400000
g/7$/ .m-2
It's 2 minutes 53 second in linear edbrowse, 1 second in linklist.
and the former is quadratic in time for larger files. Twice as big 4
times as slow etc.
However, if your file has 20 million lines and you ask for line 11382930
there is nothing to do but start at 1 and step through all the links
and count until you find the line.
I do tricks like remembering where dot is, and the last line displayed,
so - just steps back one line, sure, but those are tricks and still
random access can be slow.
The real question is can we reduce overhead, and I have found no
practical way to do so.
Store 16 lines per allocated chunk?
Tempting, but becomes a nightmare when you delete a line or move a line
up or down in the buffer etc.
Point to lines on disk by off_t, don't take them into memory unless you
are changing them.
Tempting, but mini disk reads are a lot of overhead, and it doesn't
really save much space, since we still have all those linklist pointers
and such.
This is a great exercise in memory and performance optimization, and
kinda fun, but I'm not making much progress,
since lines in a file can be just anything.
There's not much to customize or take advantage of here.
Karl Dahlke
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2022-09-10 20:21 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-10 20:21 line overhead Karl Dahlke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).