caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Brian Hurt <bhurt@spnz.org>
To: akalin@akalin.cx
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Array 4 MB size limit
Date: Mon, 15 May 2006 20:48:16 -0400 (EDT)	[thread overview]
Message-ID: <Pine.LNX.4.63.0605152016400.10710@localhost.localdomain> (raw)
In-Reply-To: <20060515141230.ajyupn2z28k0484s@horde.akalin.cx>



On Mon, 15 May 2006, akalin@akalin.cx wrote:

> I'm running into cases where the 4 MB limit on arrays is starting to become a 
> problem.  Lists are much slower and cause seg faults for me on the same data 
> set, and Bigarrays are a partial solution because I'd like to be able to 
> store arbitrary types in arrays (not just numbers).
>
> I was greatly surprised when I found out there was such a low limit on 
> arrays.  Is there a reason for this?  Will this limit ever be increased?

It comes from how arrays are stored in memory.  Every boxed value has a 
"tag word", which, among other uses, is used by the GC to figure out how 
big the object is (i.e. where the next object starts).  Now, the encoding 
scheme of this tag word is fairly complicated, but it works out that for 
arrays, 10 bits of the 32 bits is used for other stuff, leaving 22 bits 
for the size.  This limits the size of an array to 4M-1 elements 
(actually, it limits the size of the array to 16M-4 bytes, as size is 
measured in words- which is why arrays of floats are limited to 2M-1 
entries- each float takes up 8 bytes aka 2 words).

When you move to 64 bits, the tag word doubles in size, but the amount of 
"other" information in the tag word doesn't- this means that suddenly you 
have 52 bits of size, or 4T arrays.  And now floats only take up one word, 
not two, so they too can have 4T arrays.

The array of array idea someone brought up is a good one.  There's a 
slight performance hit, but not much.  I'd keep the top level array short- 
1K entries is enough.  The advantage of this is that if you're heavily 
using the array, the top level array will stay in cache (possibly even L1 
cache), meaning the main cost of an array access only goes up by about 10% 
or so, maybe less (maybe 1% if it stays in L1 cache).  The reasoning goes 
like this: the largest cost by far of an array access on a large array is 
the cache miss.  Doing a memory reference that is satisified out of L1 
cache generally takes 1-4 clock cycles.  If you have to go to L2 cache, 
that's going to take 10-40 clock cycles (approximately).  Going to main 
memory?  100-350+ clock cycles.  With a large array, you're likely going 
to take the 100-350+ clock cycle hit anyways. If the top level array is in 
L1 cache, you're only adding 1-4 clockcycles to the 100-350+ clock cycle 
hit you're going to take anyways.

Personally, I think PCs should have gone 64 bit circa 1997 or so.  Once we 
started getting hard disks large enough that we couldn't mmap the entire 
hard disk, we started hitting problems.  Long long and the large file 
interfaces are examples of these problems.  The longer we go- and the 
larger memories and hard disks get, the bigger the problems get. 
Microsoft seems to be the main impediment, now that AMD has forced Intel 
to get off the stick.

> If I had a record type with 5 floats, 
> will the limit then by Sys.max_array_size / 10?

Within the record, the floats are unboxed (assuming you didn't have any 
non-float elements).  This means the floats are stored directly in the 
record, and that the record doesn't hold pointers to the floats.  But the 
record itself is boxed- which means that an array of these records is 
really an array of pointers to these records, meaning that you're back at 
the 4M-1 limit.

Note that a record of 5 floats costs 44 bytes (40 bytes for the 5 floats + 
4 bytes for the tag word).  Assuming no records are stored in more than 
one location, the total memory cost of an array of 4M-1 of them is 16M for 
the array, plus (4M-1)*44 for the records, for a total of 192M-44 bytes. 
This is almost 10% of your available memory space right there.

> Is there some sort of 
> existing ArrayList module that works around this problem?  Ideally, I'd like 
> to have something like C++'s std::vector<> type, which can be dynamically 
> resized.

Ocaml can't dynamically resize arrays.  This gets tricky to do when arrays 
get large.  At the extreme, if the array takes up 50% + 1 byte of the 
total address space, you can't resize it to be any larger- to resize 
requires you to copy it, which takes more memory than you have.  I've 
seend problems well before then.

> Also, the fact that using lists crashes for the same data set is surprising. 
> Is there a similar hard limit for lists, or would this be a bug?  Should I 
> post a test case?

I'm willing to bet dollars to donuts that the problem you hit with lists 
was stack overflow.  Without signifigantly impacting performance there 
isn't a whole lot Ocaml can do about detecting stack overflow before the 
segfault hits.  My general rule is, if it's going to contain more than a 
few dozen items, it probably shouldn't be a list.  Extlib's library is 
less prone to this error, but you can still run into serious problems with 
long lists.

Brian


  parent reply	other threads:[~2006-05-16  0:48 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-05-15 18:12 akalin
2006-05-15 18:22 ` [Caml-list] " Nicolas Cannasse
2006-05-15 20:32 ` Damien Doligez
2006-05-15 21:27   ` akalin
2006-05-15 22:51 ` Oliver Bandel
2006-05-16  0:48 ` Brian Hurt [this message]
2006-05-16  9:57   ` Damien Doligez
2006-05-16 15:10     ` Markus Mottl
2006-05-16  8:01 ` Xavier Leroy
2006-05-16  8:20   ` Nicolas Cannasse
2006-05-19 17:13     ` Xavier Leroy
2006-05-19  5:57   ` Frederick Akalin
2006-05-19  6:21     ` Oliver Bandel
2006-05-19 12:15     ` Jon Harrop
2006-05-19 19:36       ` akalin
2006-05-19 20:17         ` Oliver Bandel
2006-05-19 16:28     ` Jozef Kosoru
2006-05-19 20:08       ` Oliver Bandel
2006-05-19 21:26       ` Jon Harrop
2006-05-20  1:06         ` Brian Hurt
2006-05-20 18:32           ` brogoff
2006-05-20 21:29             ` immutable strings II ([Caml-list] Array 4 MB size limit) Oliver Bandel
2006-05-22 22:09               ` Aleksey Nogin
2006-05-20 21:11           ` immutable strings (Re: [Caml-list] " Oliver Bandel
2006-05-25  4:32             ` immutable strings (Re: " Stefan Monnier
2006-05-25  5:56               ` [Caml-list] " Martin Jambon
2006-05-25  7:23                 ` j h woodyatt
2006-05-25 10:22                   ` Jon Harrop
2006-05-25 19:28                   ` Oliver Bandel
2006-05-25 11:14                 ` Brian Hurt
2006-05-25 19:42                   ` Oliver Bandel
2006-05-26  6:51                   ` Alain Frisch
2006-05-25 17:31                 ` Aleksey Nogin
2006-05-25 19:54                   ` Martin Jambon
2006-05-25 11:18               ` Brian Hurt
2006-05-25 17:34                 ` Aleksey Nogin
2006-05-25 18:44                   ` Tom
2006-05-25 23:00                     ` Jon Harrop
2006-05-25 23:15                       ` Martin Jambon
2006-05-20  0:57       ` [Caml-list] Array 4 MB size limit Brian Hurt
2006-05-20  1:17         ` Frederick Akalin
2006-05-20  1:52           ` Brian Hurt
2006-05-20  9:08             ` Jozef Kosoru
2006-05-20 10:12               ` skaller
2006-05-20 11:06                 ` Jozef Kosoru
2006-05-20 12:02                   ` skaller
2006-05-20 21:42                 ` Oliver Bandel
2006-05-21  1:24                   ` skaller
2006-05-21 14:10                     ` Oliver Bandel
     [not found]               ` <Pine.LNX.4.63.0605200847530.10710@localhost.localdomain>
2006-05-20 19:52                 ` Jozef Kosoru
2006-05-20 21:45                   ` Oliver Bandel
2006-05-21  9:26           ` Richard Jones
     [not found]             ` <5CE30707-5DCE-4A22-970E-A49C36F9C901@akalin.cx>
2006-05-22 10:40               ` Richard Jones
2006-05-20 10:51         ` Jozef Kosoru
2006-05-20 14:22           ` Brian Hurt
2006-05-20 18:41             ` j h woodyatt
2006-05-20 19:37               ` Jon Harrop
2006-05-20 20:47             ` Jozef Kosoru
2006-05-26 18:34             ` Ken Rose
2006-05-20 22:07           ` Oliver Bandel
2006-05-20 15:15         ` Don Syme
2006-05-20 22:15           ` Oliver Bandel
2006-05-21  1:25             ` skaller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.63.0605152016400.10710@localhost.localdomain \
    --to=bhurt@spnz.org \
    --cc=akalin@akalin.cx \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).