From: Karl Dahlke <firstname.lastname@example.org>
Subject: line overhead
Date: Sat, 10 Sep 2022 16:21:07 -0400 [thread overview]
Message-ID: <email@example.com> (raw)
One of the driving factors for large or even medium files is the
overhead per line.
Let's step back and do some math, ok that's too rigorous, let's do some
Lines are variable length, from one byte to thousands, so they have to
be allocated, and there is probably no way to optimize or customize the
In other words, no special circumstance, and I would be pretty dog gone
arrogant to think I could do better than malloc, which has been refined
over the past 45 years by the best minds in computer science.
So, each line has a malloc overhead.
What is it?
I looked on the internet and there is no clear answer.
I'm gonna guess 16 bytes.
Could be more or less.
Also chunks are 8 bytes aligned, so if your line is 41 bytes long
you're going to get a slot of length 48.
That's an average of another 4 bytes per line.
Then there's my representation in edbrowse.
In current edbrowse, pointers to lines are stored in an array, so a
million line file has an array of a million pointers pointing to the
That adds 4 bytes per line, or 8 bytes per line for 64 bit pointers.
In linklist edbrowse, lines are in a linked list and that means two
pointers, next and previous, you know the drill.
So to compare, each line has 28 bytes overhead in one version of
edbrowse, 36 bytes overhead in the other.
An empty line, one byte, could consume 40 bytes of ram in linklist
That weighs in favor of linear edbrowse, though not heavily, not a huge
Performance also has many tradeoffs.
Something like g/re/ .m-2 is *way* more efficient in linklist.
Each matching line: change some pointers and move it two lines back.
That is a quadratic explosion in linear edbrowse.
r !seq 400000
It's 2 minutes 53 second in linear edbrowse, 1 second in linklist.
and the former is quadratic in time for larger files. Twice as big 4
times as slow etc.
However, if your file has 20 million lines and you ask for line 11382930
there is nothing to do but start at 1 and step through all the links
and count until you find the line.
I do tricks like remembering where dot is, and the last line displayed,
so - just steps back one line, sure, but those are tricks and still
random access can be slow.
The real question is can we reduce overhead, and I have found no
practical way to do so.
Store 16 lines per allocated chunk?
Tempting, but becomes a nightmare when you delete a line or move a line
up or down in the buffer etc.
Point to lines on disk by off_t, don't take them into memory unless you
are changing them.
Tempting, but mini disk reads are a lot of overhead, and it doesn't
really save much space, since we still have all those linklist pointers
This is a great exercise in memory and performance optimization, and
kinda fun, but I'm not making much progress,
since lines in a file can be just anything.
There's not much to customize or take advantage of here.
reply other threads:[~2022-09-10 20:21 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).