caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* marshal and C structures crash
@ 2007-02-07 22:05 Andres Varon
  2007-02-07 22:59 ` [Caml-list] " Robert Roessler
  0 siblings, 1 reply; 3+ messages in thread
From: Andres Varon @ 2007-02-07 22:05 UTC (permalink / raw)
  To: OCaml List

Hello Everyone,

I would like to ask a question regarding a bug I have been observing  
in one program, which I have been unable to fix:

The program in question is a large phylogenetic analysis application  
(bioinformatics), which has been written in OCaml and C. It's almost  
ready for public beta testing _excepting_ for this particular bug.  
The bulk of the code is in OCaml (~70.000 LOC), and a small fraction  
of core functions in C (obviously it's hard to post the code in  
question). It runs both in sequential and parallel versions using  
MPI, and uses heavily polymorphic variants, functors, and object  
oriented features, where each fit better our requirements.

I had the parallel version broken for a while, but it used to run  
without a problem. Few weeks ago, when I updated the code for  
parallel runs (using a master-slave distributed model), I started to  
observe slaves segfaulting after a while. I nailed down the problem  
to some marshal related issue that I can reproduce in the sequential  
versions by doing the following:

1. load some data in the program and marshal what I would have sent  
to a slave in a file
2. run the program in a loop that unmarshals the data from the file,  
and repeats a short script. The loop usually ends with a crash (few  
iterations).

The data structure being marshaled is pure OCaml (Sets and Maps of  
other ocaml structures), and so all C structures (wrapped with a  
custom tag), are produced locally. The segfault happens if the  
computations are concentrated in either one of the only two C custom  
types, which where programmed independently by two of us (extremely  
different computations).

If I don't do the unmarshal step, but run the previous loop by just  
reading the data from the input files, the program works flawlessly,  
and tools such as valgrind, watch points I have set in gdb, and lots  
of  assertions in our C and ocaml code, pass every test. I also have  
checks for every array access in our C side to ensure that each  
access and write occurs within bounds.

However, if the data comes from the marshaled channel, after few  
iterations the program segfaults, and the reason appears to be  
(according to valgrind, and all my attempts to detect a failure  as  
early as possible), that some custom type is free while still alive  
from the OCaml side (what I catch is a double free, or that the  
contents of a DNA sequence is invalid because it has been free  
already). Note, again, that I am completely unable to reproduce the  
issue (even a single warning or assertion failure), unless I  
unmarshal the data to start with. Moreover, the error occurs with two  
data structures that where programmed independently by two  
experienced OCaml programmers. I believe that OCaml is duplicating  
the custom type and therefore I get two ocaml values pointing at the  
same C structure, is that possible?. I though one of the C types uses  
a pool of arrays to speedup some computations, the other one only has  
one pointer, going from the Ocaml custom type to the C structure, and  
from there to a couple of arrays, that's it.   Also note that every  
type is treated as an immutable data structure, and we provide no in- 
place modifications in our OCaml interface.

Of course, I have been hunting a bug in my C functions and can't find  
anything that could cause the double free (the only way to call  
seq_CAML_free is from the garbage collector!), or an out of bounds  
write. Is there anything special about marshaling that could be  
causing this? Even some particular pattern in the way OCaml allocates  
memory for the unmarshaling step? Any ideas about what the problem  
could be or where should I look at?

As you see, I'm lost; I just don't see where else can I place a check  
in our code.

For those of you who reached this line of my email, thanks for the  
effort! I will listen at any ideas that could pop up in your minds.

best,

Andrés Varón
American Museum of Natural History

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] marshal and C structures crash
  2007-02-07 22:05 marshal and C structures crash Andres Varon
@ 2007-02-07 22:59 ` Robert Roessler
  2007-02-08  0:16   ` Andres Varon
  0 siblings, 1 reply; 3+ messages in thread
From: Robert Roessler @ 2007-02-07 22:59 UTC (permalink / raw)
  To: Caml-list

Andres Varon wrote:
> ...
> For those of you who reached this line of my email, thanks for the 
> effort! I will listen at any ideas that could pop up in your minds.

Hey, I will read the full message just to see what someone is doing 
with 70K lines of OCaml code! :)

The usual comment - you don't mention any version and platform 
details... especially with something that took as long as this 
probably did, those might be of interest (particularly since some 
teams doing a project of this size might have not been keeping up with 
OCaml releases).

It is not crystal clear that you are using "finalize" routines - if 
so, they are an obvious (and easy) place to position check code.  If 
not, why not?  It sounds like you might *need* to wrap some of your 
values created in C-land in smart-but-thin OCaml objects, if for 
nothing else than to more delicately handle lifetime issues.

These "popped up" for me on my initial reading. ;)

Robert Roessler
roessler@rftp.com
http://www.rftp.com


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] marshal and C structures crash
  2007-02-07 22:59 ` [Caml-list] " Robert Roessler
@ 2007-02-08  0:16   ` Andres Varon
  0 siblings, 0 replies; 3+ messages in thread
From: Andres Varon @ 2007-02-08  0:16 UTC (permalink / raw)
  To: Robert Roessler; +Cc: Caml-list


On Feb 7, 2007, at 5:59 PM, Robert Roessler wrote:

> Andres Varon wrote:
>> ...
>> For those of you who reached this line of my email, thanks for the  
>> effort! I will listen at any ideas that could pop up in your minds.
>
> Hey, I will read the full message just to see what someone is doing  
> with 70K lines of OCaml code! :)
>
jejeje, we detect very complex combinatorial events in DNA sequences,  
using different optimality criteria, over an evolutionary tree that  
we are searching for. The program was in version 3 and became painful  
to maintain (8 years of many hands working on it and - most important  
-, learning OCaml on it), so now it has been rewritten from scratch.

> The usual comment - you don't mention any version and platform  
> details... especially with something that took as long as this  
> probably did, those might be of interest (particularly since some  
> teams doing a project of this size might have not been keeping up  
> with OCaml releases).
>
I realized that afterwards! In part I didn't mention it because it's  
happening consistently in all versions of OCaml and platforms that  
are applicable to:
3.08.4 and 3.09.2, 3.09.3 running in the following platforms:

  Mac OS X - PPC / Intel, Linux x86, Linux AMD64, Linux EMT-64. I  
truly believe it is something I do wrong in my C side, but for the  
life of mine, I don't see what it is, and I don't understand why it  
shows up only in relation to successive marshals. Note that the  
marshalled structure do not include any of my C types wrapped in an  
OCaml abstract one. It did at the beginning (that was my first  
suspect), but before working around representations in pure ocaml to  
try to get rid of the problem, I even compared the output of separate  
marshals of the same values multiple times, unmarshaling and  
marshaling again, and comparing different repetitions, with no errors  
detected.

> It is not crystal clear that you are using "finalize" routines - if  
> so, they are an obvious (and easy) place to position check code.   
> If not, why not?  It sounds like you might *need* to wrap some of  
> your values created in C-land in smart-but-thin OCaml objects, if  
> for nothing else than to more delicately handle lifetime issues.
>
> These "popped up" for me on my initial reading. ;)

We malloc the C structures, and store the pointer to them in a custom  
type for which we provide the functions in OCaml. The registration of  
the custom type (using a custom_operations structure), includes a  
free function to deallocate whatever C allocated memory should be  
when the garbage collector does its job, and we provide them.

AFAIK, having a pointer to an allocated C structure wrapped in a  
custom type is safe, provided the C structure does not point back to  
the OCaml heap, and we don't: the pointers go in only one direction  
to the C side.

>
> Robert Roessler
> roessler@rftp.com
> http://www.rftp.com
>
>

Thanks!

Andres

> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-02-08  0:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-07 22:05 marshal and C structures crash Andres Varon
2007-02-07 22:59 ` [Caml-list] " Robert Roessler
2007-02-08  0:16   ` Andres Varon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).