From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id C32B3BC57 for ; Wed, 8 Sep 2010 12:58:43 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AjEBABYJh0wmachwl2dsb2JhbAChEBUBAQEBAQgVBzO9KIJxgkwE X-IronPort-AV: E=Sophos;i="4.56,333,1280700000"; d="scan'208";a="56843277" Received: from mx1.janestreet.com ([38.105.200.112]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 08 Sep 2010 12:58:43 +0200 Received: from nyc-imap.janestreet.com ([172.25.22.57] helo=nyc-qsv-mail1.delacy.com) by mx1.janestreet.com with esmtp (Exim 4.71) (envelope-from ) id 1OtIMS-0004ir-QV for caml-list@inria.fr; Wed, 08 Sep 2010 06:58:40 -0400 Received: from nyc-qsv-004.delacy.com ([172.25.22.194] helo=qsmtp.delacy.com) by nyc-qsv-mail1.delacy.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.71) (envelope-from ) id 1OtIMS-0007IF-Ow; Wed, 08 Sep 2010 06:58:40 -0400 Received: from ldn-qws-r02.delacy.com ([172.23.65.102]) by qsmtp.delacy.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.71) (envelope-from ) id 1OtIMR-0003Cx-Ow; Wed, 08 Sep 2010 06:58:40 -0400 Received: from mshinwell by ldn-qws-r02.delacy.com with local (Exim 4.71) (envelope-from ) id 1OtIMR-0007et-3J; Wed, 08 Sep 2010 11:58:39 +0100 Date: Wed, 8 Sep 2010 11:58:39 +0100 From: Mark Shinwell To: Eray Ozkural Cc: caml-list@inria.fr Subject: Re: [Caml-list] Re: Possible ocamlmpi finalization error? Message-ID: <20100908105839.GM25074@janestreet.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-Spam: no; 0.00; shinwell:01 ocamlmpi:01 eray:01 ozkural:01 le-gall:01 eray:01 ozkural:01 ocamlmpi:01 bug:01 synchronous:01 afaict:01 printf:01 gdb:01 gdb:01 goodbye:98 On Wed, Sep 08, 2010 at 01:31:29PM +0300, Eray Ozkural wrote: > On Wed, Sep 8, 2010 at 10:51 AM, Sylvain Le Gall wrote: > > > On 08-09-2010, Eray Ozkural wrote: > > > I'm recently getting errors that are past MPI_Finalize. Since both > > > init/final and communicator allocation is managed by ocamlmpi, is it > > > possible this is a bug with the library? Have you ever seen something > > like this? > > > > > > Using openmpi on OS X. Here is the log message: > > > > > > *** An error occurred in MPI_Comm_free > > > *** after MPI was finalized > > > *** MPI_ERRORS_ARE_FATAL (goodbye) > > > > > > In the code I'm using both point-to-point and collective communication, > > and > > > as far as I know the code is correct. Could this be due to memory > > > corruption, or should this never happen? > > > > > > > Maybe, you can give a minimal code to reproduce this error? > > > Hmm, not really its a complex code but I just ran the debug version in > parallel with exactly the same parameters and there is absolutely no problem > with that. All communication is synchronous so timing cannot be an issue > (since the debug build is naturally slower). AFAICT it's not a memory > problem because no bound errors are reported in the debug build (was it on > by default?). I think it's a lower-level problem than my code. This could > happen if some of that resource allocation is done in different threads, for > instance. > > Can you give me any ideas to trace the source of this problem? I know nothing about MPI, but here are some general ideas: - Read and/or instrument the MPI source to find out more information than "An error"... - Wrap the MPI_Comm_free function and use printf() to display the arguments (and maybe thread IDs). This might catch situations such as attempting to free something twice. If MPI_Comm_free isn't called often, gdb is another thing to try. - If you have a function that isn't re-entrant and you worry that it might be being called in a re-entrant manner, maybe you could catch that by wrapping it in another function. In this other function, protect the function call to the original function by a mutex, and lock it using pthread_mutex_trylock(). If that function tells you the mutex is already locked, then use signal() to send SIGSTOP to your own pid. If this triggers (which can be seen by looking at the state of the process using "ps"), gdb can be attached to the program and you should be able to see what tried to call the function before a previous call had finished. If you are lucky, you might even see where the previous call had come from. Mark