line overhead

edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed

* line overhead
@ 2022-09-10 20:21 Karl Dahlke
  0 siblings, 0 replies; only message in thread
From: Karl Dahlke @ 2022-09-10 20:21 UTC (permalink / raw)
  To: edbrowse-dev

One of the driving factors for large or even medium files is the 
overhead per line. 
Let's step back and do some math, ok that's too rigorous, let's do some 
estimating.

Lines are variable length, from one byte to thousands, so they have to 
be allocated, and there is probably no way to optimize or customize the 
allocation process. 
In other words, no special circumstance, and I would be pretty dog gone 
arrogant to think I could do better than malloc, which has been refined 
over the past 45 years by the best minds in computer science. 
So, each line has a malloc overhead. 
What is it? 
I looked on the internet and there is no clear answer. 
I'm gonna guess 16 bytes. 
Could be more or less. 
Also chunks are 8 bytes aligned, so if your line is 41 bytes long 
you're going to get a slot of length 48. 
That's an average of another 4 bytes per line. 
Then there's my representation in edbrowse. 
In current edbrowse, pointers to lines are stored in an array, so a 
million line file has an array of a million pointers pointing to the 
million lines. 
Simple enough. 
That adds 4 bytes per line, or 8 bytes per line for 64 bit pointers. 
In linklist edbrowse, lines are in a linked list and that means two 
pointers, next and previous, you know the drill. 
So to compare, each line has 28 bytes overhead in one version of 
edbrowse, 36 bytes overhead in the other. 
An empty line, one byte, could consume 40 bytes of ram in linklist 
edbrowse. 
That weighs in favor of linear edbrowse, though not heavily, not a huge 
difference.

Performance also has many tradeoffs.
Something like g/re/ .m-2 is *way* more efficient in linklist.
Each matching line: change some pointers and move it two lines back.
That is a quadratic explosion in linear edbrowse.
Try this:

r !seq 400000
g/7$/ .m-2

It's 2 minutes 53 second in linear edbrowse, 1 second in linklist. 
and the former is quadratic in time for larger files. Twice as big 4 
times as slow etc.

However, if your file has 20 million lines and you ask for line 11382930 
there is nothing to do but start at 1 and step through all the links 
and count until you find the line. 
I do tricks like remembering where dot is, and the last line displayed, 
so - just steps back one line, sure, but those are tricks and still 
random access can be slow.

The real question is can we reduce overhead, and I have found no 
practical way to do so. 
Store 16 lines per allocated chunk? 
Tempting, but becomes a nightmare when you delete a line or move a line 
up or down in the buffer etc. 
Point to lines on disk by off_t, don't take them into memory unless you 
are changing them. 
Tempting, but mini disk reads are a lot of overhead, and it doesn't 
really save much space, since we still have all those linklist pointers 
and such.

This is a great exercise in memory and performance optimization, and 
kinda fun, but I'm not making much progress, 
since lines in a file can be just anything. 
There's not much to customize or take advantage of here.

Karl Dahlke

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2022-09-10 20:21 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-10 20:21 line overhead Karl Dahlke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).