From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mshinwell@janestreet.com>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id C32B3BC57
	for <caml-list@yquem.inria.fr>; Wed,  8 Sep 2010 12:58:43 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AjEBABYJh0wmachwl2dsb2JhbAChEBUBAQEBAQgVBzO9KIJxgkwE
X-IronPort-AV: E=Sophos;i="4.56,333,1280700000"; 
   d="scan'208";a="56843277"
Received: from mx1.janestreet.com ([38.105.200.112])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 08 Sep 2010 12:58:43 +0200
Received: from nyc-imap.janestreet.com ([172.25.22.57] helo=nyc-qsv-mail1.delacy.com)
	by mx1.janestreet.com with esmtp (Exim 4.71)
	(envelope-from <mshinwell@janestreet.com>)
	id 1OtIMS-0004ir-QV
	for caml-list@inria.fr; Wed, 08 Sep 2010 06:58:40 -0400
Received: from nyc-qsv-004.delacy.com ([172.25.22.194] helo=qsmtp.delacy.com)
	by nyc-qsv-mail1.delacy.com with esmtps (TLSv1:AES256-SHA:256)
	(Exim 4.71)
	(envelope-from <mshinwell@janestreet.com>)
	id 1OtIMS-0007IF-Ow; Wed, 08 Sep 2010 06:58:40 -0400
Received: from ldn-qws-r02.delacy.com ([172.23.65.102])
	by qsmtp.delacy.com with esmtps (TLSv1:AES256-SHA:256)
	(Exim 4.71)
	(envelope-from <mshinwell@janestreet.com>)
	id 1OtIMR-0003Cx-Ow; Wed, 08 Sep 2010 06:58:40 -0400
Received: from mshinwell by ldn-qws-r02.delacy.com with local (Exim 4.71)
	(envelope-from <mshinwell@janestreet.com>)
	id 1OtIMR-0007et-3J; Wed, 08 Sep 2010 11:58:39 +0100
Date: Wed, 8 Sep 2010 11:58:39 +0100
From: Mark Shinwell <mshinwell@janestreet.com>
To: Eray Ozkural <examachine@gmail.com>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Re: Possible ocamlmpi finalization error?
Message-ID: <20100908105839.GM25074@janestreet.com>
References: <AANLkTimYSznpy=Zs3adaRTG+2TWECHGdmjbPG1CxgqRb@mail.gmail.com>
 <slrni8eg43.skq.sylvain@gallu.homelinux.org>
 <AANLkTi=4qY8ES6GFMv0WQ0=r-w0fatr8Rh4Ev1NzdUYQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <AANLkTi=4qY8ES6GFMv0WQ0=r-w0fatr8Rh4Ev1NzdUYQ@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Spam: no; 0.00; shinwell:01 ocamlmpi:01 eray:01 ozkural:01 le-gall:01 eray:01 ozkural:01 ocamlmpi:01 bug:01 synchronous:01 afaict:01 printf:01 gdb:01 gdb:01 goodbye:98 

On Wed, Sep 08, 2010 at 01:31:29PM +0300, Eray Ozkural wrote:
> On Wed, Sep 8, 2010 at 10:51 AM, Sylvain Le Gall <sylvain@le-gall.net>wrote:
> 
> > On 08-09-2010, Eray Ozkural <examachine@gmail.com> wrote:
> > > I'm recently getting errors that are past MPI_Finalize. Since both
> > > init/final and communicator allocation is managed by ocamlmpi, is it
> > > possible this is a bug with the library? Have you ever seen something
> > like this?
> > >
> > > Using openmpi on OS X. Here is the log message:
> > >
> > > *** An error occurred in MPI_Comm_free
> > > *** after MPI was finalized
> > > *** MPI_ERRORS_ARE_FATAL (goodbye)
> > >
> > > In the code I'm using both point-to-point and collective communication,
> > and
> > > as far as I know the code is correct. Could this be due to memory
> > > corruption, or should this never happen?
> > >
> >
> > Maybe, you can give a minimal code to reproduce this error?
> >
> Hmm, not really its a complex code but I just ran the debug version in
> parallel with exactly the same parameters and there is absolutely no problem
> with that. All communication is synchronous so timing cannot be an issue
> (since the debug build is naturally slower). AFAICT it's not a memory
> problem because no bound errors are reported in the debug build (was it on
> by default?). I think it's a lower-level problem than  my code. This could
> happen if some of that resource allocation is done in different threads, for
> instance.
> 
> Can you give me any ideas to trace the source of this problem?

I know nothing about MPI, but here are some general ideas:

- Read and/or instrument the MPI source to find out more information than
"An error"...

- Wrap the MPI_Comm_free function and use printf() to display the arguments
(and maybe thread IDs).  This might catch situations such as attempting to
free something twice.  If MPI_Comm_free isn't called often, gdb is another
thing to try.

- If you have a function that isn't re-entrant and you worry that it might be
being called in a re-entrant manner, maybe you could catch that by wrapping
it in another function.  In this other function, protect the function call to
the original function by a mutex, and lock it using pthread_mutex_trylock().
If that function tells you the mutex is already locked, then use signal() to
send SIGSTOP to your own pid.  If this triggers (which can be seen by looking
at the state of the process using "ps"), gdb can be attached to the program and
you should be able to see what tried to call the function before a previous
call had finished.  If you are lucky, you might even see where the previous
call had come from.

Mark