line overhead - Karl Dahlke

edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed

From: Karl Dahlke <eklhad@comcast.net>
To: edbrowse-dev@edbrowse.org
Subject: line overhead
Date: Sat, 10 Sep 2022 16:21:07 -0400	[thread overview]
Message-ID: <20220810162107.eklhad@comcast.net> (raw)

One of the driving factors for large or even medium files is the 
overhead per line. 
Let's step back and do some math, ok that's too rigorous, let's do some 
estimating.

Lines are variable length, from one byte to thousands, so they have to 
be allocated, and there is probably no way to optimize or customize the 
allocation process. 
In other words, no special circumstance, and I would be pretty dog gone 
arrogant to think I could do better than malloc, which has been refined 
over the past 45 years by the best minds in computer science. 
So, each line has a malloc overhead. 
What is it? 
I looked on the internet and there is no clear answer. 
I'm gonna guess 16 bytes. 
Could be more or less. 
Also chunks are 8 bytes aligned, so if your line is 41 bytes long 
you're going to get a slot of length 48. 
That's an average of another 4 bytes per line. 
Then there's my representation in edbrowse. 
In current edbrowse, pointers to lines are stored in an array, so a 
million line file has an array of a million pointers pointing to the 
million lines. 
Simple enough. 
That adds 4 bytes per line, or 8 bytes per line for 64 bit pointers. 
In linklist edbrowse, lines are in a linked list and that means two 
pointers, next and previous, you know the drill. 
So to compare, each line has 28 bytes overhead in one version of 
edbrowse, 36 bytes overhead in the other. 
An empty line, one byte, could consume 40 bytes of ram in linklist 
edbrowse. 
That weighs in favor of linear edbrowse, though not heavily, not a huge 
difference.

Performance also has many tradeoffs.
Something like g/re/ .m-2 is *way* more efficient in linklist.
Each matching line: change some pointers and move it two lines back.
That is a quadratic explosion in linear edbrowse.
Try this:

r !seq 400000
g/7$/ .m-2

It's 2 minutes 53 second in linear edbrowse, 1 second in linklist. 
and the former is quadratic in time for larger files. Twice as big 4 
times as slow etc.

However, if your file has 20 million lines and you ask for line 11382930 
there is nothing to do but start at 1 and step through all the links 
and count until you find the line. 
I do tricks like remembering where dot is, and the last line displayed, 
so - just steps back one line, sure, but those are tricks and still 
random access can be slow.

The real question is can we reduce overhead, and I have found no 
practical way to do so. 
Store 16 lines per allocated chunk? 
Tempting, but becomes a nightmare when you delete a line or move a line 
up or down in the buffer etc. 
Point to lines on disk by off_t, don't take them into memory unless you 
are changing them. 
Tempting, but mini disk reads are a lot of overhead, and it doesn't 
really save much space, since we still have all those linklist pointers 
and such.

This is a great exercise in memory and performance optimization, and 
kinda fun, but I'm not making much progress, 
since lines in a file can be just anything. 
There's not much to customize or take advantage of here.

Karl Dahlke

                 reply	other threads:[~2022-09-10 20:21 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220810162107.eklhad@comcast.net \
    --to=eklhad@comcast.net \
    --cc=edbrowse-dev@edbrowse.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).