From: Andres Varon <avaron@gmail.com>
To: OCaml List <caml-list@yquem.inria.fr>
Subject: marshal and C structures crash
Date: Wed, 7 Feb 2007 17:05:27 -0500 [thread overview]
Message-ID: <5F3F8FC7-2C10-4F4A-A7D3-6268AD6D1E5A@gmail.com> (raw)
Hello Everyone,
I would like to ask a question regarding a bug I have been observing
in one program, which I have been unable to fix:
The program in question is a large phylogenetic analysis application
(bioinformatics), which has been written in OCaml and C. It's almost
ready for public beta testing _excepting_ for this particular bug.
The bulk of the code is in OCaml (~70.000 LOC), and a small fraction
of core functions in C (obviously it's hard to post the code in
question). It runs both in sequential and parallel versions using
MPI, and uses heavily polymorphic variants, functors, and object
oriented features, where each fit better our requirements.
I had the parallel version broken for a while, but it used to run
without a problem. Few weeks ago, when I updated the code for
parallel runs (using a master-slave distributed model), I started to
observe slaves segfaulting after a while. I nailed down the problem
to some marshal related issue that I can reproduce in the sequential
versions by doing the following:
1. load some data in the program and marshal what I would have sent
to a slave in a file
2. run the program in a loop that unmarshals the data from the file,
and repeats a short script. The loop usually ends with a crash (few
iterations).
The data structure being marshaled is pure OCaml (Sets and Maps of
other ocaml structures), and so all C structures (wrapped with a
custom tag), are produced locally. The segfault happens if the
computations are concentrated in either one of the only two C custom
types, which where programmed independently by two of us (extremely
different computations).
If I don't do the unmarshal step, but run the previous loop by just
reading the data from the input files, the program works flawlessly,
and tools such as valgrind, watch points I have set in gdb, and lots
of assertions in our C and ocaml code, pass every test. I also have
checks for every array access in our C side to ensure that each
access and write occurs within bounds.
However, if the data comes from the marshaled channel, after few
iterations the program segfaults, and the reason appears to be
(according to valgrind, and all my attempts to detect a failure as
early as possible), that some custom type is free while still alive
from the OCaml side (what I catch is a double free, or that the
contents of a DNA sequence is invalid because it has been free
already). Note, again, that I am completely unable to reproduce the
issue (even a single warning or assertion failure), unless I
unmarshal the data to start with. Moreover, the error occurs with two
data structures that where programmed independently by two
experienced OCaml programmers. I believe that OCaml is duplicating
the custom type and therefore I get two ocaml values pointing at the
same C structure, is that possible?. I though one of the C types uses
a pool of arrays to speedup some computations, the other one only has
one pointer, going from the Ocaml custom type to the C structure, and
from there to a couple of arrays, that's it. Also note that every
type is treated as an immutable data structure, and we provide no in-
place modifications in our OCaml interface.
Of course, I have been hunting a bug in my C functions and can't find
anything that could cause the double free (the only way to call
seq_CAML_free is from the garbage collector!), or an out of bounds
write. Is there anything special about marshaling that could be
causing this? Even some particular pattern in the way OCaml allocates
memory for the unmarshaling step? Any ideas about what the problem
could be or where should I look at?
As you see, I'm lost; I just don't see where else can I place a check
in our code.
For those of you who reached this line of my email, thanks for the
effort! I will listen at any ideas that could pop up in your minds.
best,
Andrés Varón
American Museum of Natural History
next reply other threads:[~2007-02-07 22:06 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-07 22:05 Andres Varon [this message]
2007-02-07 22:59 ` [Caml-list] " Robert Roessler
2007-02-08 0:16 ` Andres Varon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5F3F8FC7-2C10-4F4A-A7D3-6268AD6D1E5A@gmail.com \
--to=avaron@gmail.com \
--cc=caml-list@yquem.inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).