The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: Dan Cross <crossd@gmail.com>
To: Dibyendu Majumdar <mobile@majumdar.org.uk>
Cc: The TUHS <tuhs@tuhs.org>
Subject: Re: [TUHS] Memory management in Dennis Ritchie's C Compiler
Date: Mon, 24 Aug 2020 11:58:46 -0400	[thread overview]
Message-ID: <CAEoi9W7dsRm_FAsV7mfMMbn7wfc0D=_SZwt8y2TQC2X5hhxKqw@mail.gmail.com> (raw)
In-Reply-To: <CACXZuxeW+FfzNv1me6pUcsXp0WG2HmuMcbEBBg1rX+by3osYoQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5780 bytes --]

On Mon, Aug 17, 2020 at 4:16 PM Dibyendu Majumdar <mobile@majumdar.org.uk>
wrote:

> On Mon, 17 Aug 2020 at 17:13, Dan Cross <crossd@gmail.com> wrote:
> > From my light skimming of V10 sources, it appears that the various
> components of the default C compiler (that is, not LCC) either use
> malloc/free or call `sbrk` directly.
>
> Yes, it only uses sbrk().


No, that's not true; or at least it's only partially true. Note that I'm
talking specifically about 10th Edition here.

The compiler is implemented as several cooperating processes, some of which
employ different memory management strategies. Notice, for example, that
`ccom` uses `malloc` but not `sbrk`, while `c2` does the opposite and uses
`sbrk` but not `malloc`. On the other hand, `cpp` does no dynamic memory
allocation and instead has fixed-size symbol tables for preprocessor
definitions etc. Finally, the assembler uses `sbrk` but `ld` uses `malloc`.


> One consequence I think is that sbrk()
> expands the process memory without invalidating existing use of memory
> - so the code is able to periodically expand heap while retaining all
> existing allocations.
> A simple workaround I used was to preallocate a heap and just stub out
> sbrk() calls - so that works. So in essence given a single chunk of
> memory (if large enough - which is still quite small by today's
> standards) the compiler manages fine.
>
> However I find this unsatisfactory and would like to improve it. But
> it is a bit difficult to understand how the memory is being used.
>

So in a nutshell, `sbrk` works by setting the "break" pointer: that is, the
highest usable virtual address in the process's data segment. Stacks may be
at the top of the user portion of the address space; but I'd have to double
check the details. When the process starts up and gets going, one can
discover the initial value of that pointer by executing `sbrk(0)`: since
`sbrk` takes a delta and returns the old value of the break, the return
value will be (basically) the current end of the data segment.

By manipulating the break pointer via `sbrk`, one can cause the kernel to
allocate memory to a process. That memory can then be managed internally
(that is, within the process) via e.g. malloc _or_ a hand-rolled allocator
(e.g., a "bump" allocator): when you want to put something in memory, you
know where the next available pointer is (after the last one you
allocated), you know how much space is available on the heap (since you
allocated it using `sbrk`) and you presumably know how big the thing you
want to store is; if the current pointer + alignment + size of next object
is > break, you allocate more memory using `sbrk` and continue.

There are a couple of approaches you can take to modernize this. The first
you've already discovered: preallocate a heap that you manage with a stub
`sbrk` and be done with it. Another might be to try and find some otherwise
unallocated region of the address space and using `mmap()` to allocate
successive pages to that portion of the address space to simulate `sbrk`;
this may fail if you run into something that's already allocated in one of
those regions.

The most robust way might be the hardest, which is to replace the
`sbrk`-based allocator with calls to e.g. `malloc`. However, looking a
little more closely at 10th Ed's `c2`, for example`, this doesn't seem like
it would be all that bad: `sbrk` is mostly wrapped by a function called
`alloc` that takes an integer number of bytes to allocate and returns a
`struct node *`; the return value is cast to whatever type is needed
though. The use of `struct node *` as the return value is almost certainly
because the compiler was written before `void *` had achieved primacy in
the ANSI C world.

Anyway, I suspect if you simply replaced the guts of `alloc` with `return
malloc(an);` you'd get approximately what you want.

Memory can be used for declarations, trees (for expressions) and
> strings as far as I can tell. Strings actually use the tree
> allocation, and just pretend that a node is a string.
>

Not exactly. The return type of `alloc` is `struct node *`, but that's just
a detail; in the dialect of C that e.g. 10th Ed's PCC is written in, the
return type doesn't matter: it's just a pointer. You can cast it to `char
*` and copy string data there if you want to. I don't know why `struct node
*` was chosen for `alloc`, but I suspect simply convenience.

It seems that tree memory is allocated in a stack discipline. But what
> puzzled me is that when a tree starts, about 512 bytes of memory are
> left as gap for declarations to use. I have been trying to think in
> what circumstances would you encounter a declaration while parsing an
> expression - perhaps cast expressions? Anyway - if a declaration
> occurs inside an expression (i.e. tree) then it only has 512 bytes
> available. Of course this could be made bigger ... but at the least I
> would like to have separate heaps for declarations, trees and strings.
>

This sounds like it's something independent of the allocator and more about
the program's strategy around allocation in general. It's a common trick in
C programs to e.g., allocate the space for some structs plus things that
may be pointed to by those structs in a single allocation; not only does
this cut down on overhead with respect to calling into the allocator, it
also gives you locality.

I don't think you want separate _heaps_ for the things you mention, though
you may want to allocate them separately using independent calls to e.g.
`malloc`. That's fine.

I guess no one really dug into this code - as presumably once PCC came
> along the original compiler by Dennis stopped being used.
>

10th Edition `cc` _is_ `pcc`. I haven't looked at the 7th Ed compiler in
detail.

        - Dan C.

[-- Attachment #2: Type: text/html, Size: 7238 bytes --]

  parent reply	other threads:[~2020-08-24 16:00 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-15 21:50 Dibyendu Majumdar
2020-08-16 15:20 ` arnold
2020-08-16 15:27   ` Jim Geist
2020-08-17 16:13 ` Dan Cross
2020-08-17 20:16   ` Dibyendu Majumdar
2020-08-17 20:34     ` Paul Winalski
2020-08-17 20:43       ` Dibyendu Majumdar
2020-08-17 21:05         ` Warner Losh
2020-08-17 22:47         ` Dave Horsfall
2020-08-17 23:06           ` Nemo Nusquam
2020-08-17 21:29     ` Dibyendu Majumdar
2020-08-24 15:58     ` Dan Cross [this message]
2020-08-24 17:08       ` John Cowan
2020-08-24 17:15         ` Clem Cole
2020-08-24 17:24           ` Clem Cole
2020-08-24 17:20         ` Dan Cross
2020-08-25 23:06           ` Arthur Krewat
2020-08-17  2:02 Noel Chiappa
2020-08-17 18:02 ` Paul Winalski
2020-08-17 18:13   ` Jim Geist
2020-08-17 18:48     ` Paul Winalski
2020-08-17 19:08       ` Jim Geist
2020-08-17 19:28         ` Paul Winalski
2020-08-17 19:35           ` Jim Geist
2020-08-17 19:41             ` Richard Salz
2020-08-17 23:40               ` Steffen Nurpmeso
2020-08-17 19:27 Noel Chiappa
2020-08-17 19:30 ` Larry McVoy
2020-08-17 19:44   ` Dan Halbert
2020-08-17 19:50   ` Paul Winalski
2020-08-17 22:05     ` Arthur Krewat
2020-08-18  0:52 ` Rob Gingell
2020-08-17 19:51 Noel Chiappa
2020-08-17 22:32 Norman Wilson
2020-08-17 22:55 ` Bakul Shah
2020-08-17 23:12 ` Chet Ramey
2020-08-18  6:33 ` arnold
2020-08-25  9:36 Steve Simon
2020-08-26 13:24 Doug McIlroy
2020-08-26 15:41 ` John Cowan
2020-08-26 15:59 Noel Chiappa
2020-08-26 16:09 ` Warner Losh
2020-08-26 19:13   ` Clem Cole
2020-08-26 16:14 ` Jim Geist
2020-08-26 17:31 ` John Cowan
2020-08-27  8:37 Paul Ruizendaal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEoi9W7dsRm_FAsV7mfMMbn7wfc0D=_SZwt8y2TQC2X5hhxKqw@mail.gmail.com' \
    --to=crossd@gmail.com \
    --cc=mobile@majumdar.org.uk \
    --cc=tuhs@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).