From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: * X-Spam-Status: No, score=1.7 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FROM,HDRS_MISSP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 Received: (qmail 27194 invoked from network); 10 Sep 2022 20:21:30 -0000 Received: from hurricane.the-brannons.com (2602:ff06:725:1:20::25) by inbox.vuxu.org with ESMTPUTF8; 10 Sep 2022 20:21:30 -0000 Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by hurricane.the-brannons.com (OpenSMTPD) with ESMTP id b5bd750d for ; Sat, 10 Sep 2022 13:21:25 -0700 (PDT) Received: from resdmta-c1p-023853.sys.comcast.net (resdmta-c1p-023853.sys.comcast.net [2001:558:fd00:56::e]) by hurricane.the-brannons.com (OpenSMTPD) with ESMTPS id cfb8a86b (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO) for ; Sat, 10 Sep 2022 13:21:17 -0700 (PDT) Received: from resomta-c1p-023269.sys.comcast.net ([96.102.18.227]) by resdmta-c1p-023853.sys.comcast.net with ESMTP id X6k5oEkA9INxMX6yZozUd4; Sat, 10 Sep 2022 20:21:15 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=20190202a; t=1662841275; bh=kvL7kb2xC9shYWU5QeByrQHGZUYP5wCWmEqNSl6a0Kc=; h=Received:Received:To:From:Reply-to:Subject:Date:Message-ID: Mime-Version:Content-Type; b=4bc/PJ5U9YUuNPmsbarGg/55455UZQUvjckVTwqA75W8I4GKO+pKa8FXas5Pa6VDs IORFDglEjDOdgNwpW4Towk2JiZPf29lMsZmRz7gpS/+8nyWRccWvyc6Ks/wrzZuI8m +wPem2N9a9Tli4rWFz5anrXhFgDPlNEXBbVpPq93Da1gwU1HHwI5nIr9farO9MlwWN GgXDOUPRhbpcZ4wr61+ZFeeLdrLTkDnoR+TRBgml001iCtTNGxgJZcokydkrEDONng 96/CdXpEELKvPxr4KrrgW+B5/e9aMQYs7uryY+GSpKjeJg8rwGNvZwcSxPj3CfMLie 2+vNoOBZUrIGA== Received: from unknown ([IPv6:2601:408:c000:2e40::c3b6]) by resomta-c1p-023269.sys.comcast.net with ESMTPSA id X6yRoUFU9px7MX6ySotER0; Sat, 10 Sep 2022 20:21:09 +0000 X-Xfinity-VMeta: sc=0.00;st=legit To:edbrowse-dev@edbrowse.org From: Karl Dahlke Reply-to: Karl Dahlke User-Agent: edbrowse/3.8.5 Subject: line overhead Date: Sat, 10 Sep 2022 16:21:07 -0400 Message-ID: <20220810162107.eklhad@comcast.net> X-BeenThere: edbrowse-dev@edbrowse.org List-Id: Edbrowse Development List Mime-Version: 1.0 Content-Type: text/plain; format=flowed; delsp=no Content-Transfer-Encoding: 7bit One of the driving factors for large or even medium files is the overhead per line. Let's step back and do some math, ok that's too rigorous, let's do some estimating. Lines are variable length, from one byte to thousands, so they have to be allocated, and there is probably no way to optimize or customize the allocation process. In other words, no special circumstance, and I would be pretty dog gone arrogant to think I could do better than malloc, which has been refined over the past 45 years by the best minds in computer science. So, each line has a malloc overhead. What is it? I looked on the internet and there is no clear answer. I'm gonna guess 16 bytes. Could be more or less. Also chunks are 8 bytes aligned, so if your line is 41 bytes long you're going to get a slot of length 48. That's an average of another 4 bytes per line. Then there's my representation in edbrowse. In current edbrowse, pointers to lines are stored in an array, so a million line file has an array of a million pointers pointing to the million lines. Simple enough. That adds 4 bytes per line, or 8 bytes per line for 64 bit pointers. In linklist edbrowse, lines are in a linked list and that means two pointers, next and previous, you know the drill. So to compare, each line has 28 bytes overhead in one version of edbrowse, 36 bytes overhead in the other. An empty line, one byte, could consume 40 bytes of ram in linklist edbrowse. That weighs in favor of linear edbrowse, though not heavily, not a huge difference. Performance also has many tradeoffs. Something like g/re/ .m-2 is *way* more efficient in linklist. Each matching line: change some pointers and move it two lines back. That is a quadratic explosion in linear edbrowse. Try this: r !seq 400000 g/7$/ .m-2 It's 2 minutes 53 second in linear edbrowse, 1 second in linklist. and the former is quadratic in time for larger files. Twice as big 4 times as slow etc. However, if your file has 20 million lines and you ask for line 11382930 there is nothing to do but start at 1 and step through all the links and count until you find the line. I do tricks like remembering where dot is, and the last line displayed, so - just steps back one line, sure, but those are tricks and still random access can be slow. The real question is can we reduce overhead, and I have found no practical way to do so. Store 16 lines per allocated chunk? Tempting, but becomes a nightmare when you delete a line or move a line up or down in the buffer etc. Point to lines on disk by off_t, don't take them into memory unless you are changing them. Tempting, but mini disk reads are a lot of overhead, and it doesn't really save much space, since we still have all those linklist pointers and such. This is a great exercise in memory and performance optimization, and kinda fun, but I'm not making much progress, since lines in a file can be just anything. There's not much to customize or take advantage of here. Karl Dahlke