caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Possible ocamlmpi finalization error?
@ 2010-09-08  5:21 Eray Ozkural
  2010-09-08  7:51 ` Sylvain Le Gall
  0 siblings, 1 reply; 4+ messages in thread
From: Eray Ozkural @ 2010-09-08  5:21 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 745 bytes --]

I'm recently getting errors that are past MPI_Finalize. Since both
init/final and communicator allocation is managed by ocamlmpi, is it
possible this is a bug with the library? Have you ever seen something like
this?

Using openmpi on OS X. Here is the log message:

*** An error occurred in MPI_Comm_free
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (goodbye)

In the code I'm using both point-to-point and collective communication, and
as far as I know the code is correct. Could this be due to memory
corruption, or should this never happen?

Cheers,


-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct

[-- Attachment #2: Type: text/html, Size: 1088 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible ocamlmpi finalization error?
  2010-09-08  5:21 Possible ocamlmpi finalization error? Eray Ozkural
@ 2010-09-08  7:51 ` Sylvain Le Gall
  2010-09-08 10:31   ` [Caml-list] " Eray Ozkural
  0 siblings, 1 reply; 4+ messages in thread
From: Sylvain Le Gall @ 2010-09-08  7:51 UTC (permalink / raw)
  To: caml-list

On 08-09-2010, Eray Ozkural <examachine@gmail.com> wrote:
> --===============0522474025==
> Content-Type: multipart/alternative; boundary=0016369204e79d9601048fb8ae36
>
> --0016369204e79d9601048fb8ae36
> Content-Type: text/plain; charset=ISO-8859-1
>
> I'm recently getting errors that are past MPI_Finalize. Since both
> init/final and communicator allocation is managed by ocamlmpi, is it
> possible this is a bug with the library? Have you ever seen something like
> this?
>
> Using openmpi on OS X. Here is the log message:
>
> *** An error occurred in MPI_Comm_free
> *** after MPI was finalized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
>
> In the code I'm using both point-to-point and collective communication, and
> as far as I know the code is correct. Could this be due to memory
> corruption, or should this never happen?
>

Maybe, you can give a minimal code to reproduce this error?

Regards,
Sylvain Le Gall


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Re: Possible ocamlmpi finalization error?
  2010-09-08  7:51 ` Sylvain Le Gall
@ 2010-09-08 10:31   ` Eray Ozkural
  2010-09-08 10:58     ` Mark Shinwell
  0 siblings, 1 reply; 4+ messages in thread
From: Eray Ozkural @ 2010-09-08 10:31 UTC (permalink / raw)
  To: Sylvain Le Gall; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1809 bytes --]

On Wed, Sep 8, 2010 at 10:51 AM, Sylvain Le Gall <sylvain@le-gall.net>wrote:

> On 08-09-2010, Eray Ozkural <examachine@gmail.com> wrote:
> > --===============0522474025==
> > Content-Type: multipart/alternative;
> boundary=0016369204e79d9601048fb8ae36
> >
> > --0016369204e79d9601048fb8ae36
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > I'm recently getting errors that are past MPI_Finalize. Since both
> > init/final and communicator allocation is managed by ocamlmpi, is it
> > possible this is a bug with the library? Have you ever seen something
> like
> > this?
> >
> > Using openmpi on OS X. Here is the log message:
> >
> > *** An error occurred in MPI_Comm_free
> > *** after MPI was finalized
> > *** MPI_ERRORS_ARE_FATAL (goodbye)
> >
> > In the code I'm using both point-to-point and collective communication,
> and
> > as far as I know the code is correct. Could this be due to memory
> > corruption, or should this never happen?
> >
>
> Maybe, you can give a minimal code to reproduce this error?
>
>
Hmm, not really its a complex code but I just ran the debug version in
parallel with exactly the same parameters and there is absolutely no problem
with that. All communication is synchronous so timing cannot be an issue
(since the debug build is naturally slower). AFAICT it's not a memory
problem because no bound errors are reported in the debug build (was it on
by default?). I think it's a lower-level problem than  my code. This could
happen if some of that resource allocation is done in different threads, for
instance.

Can you give me any ideas to trace the source of this problem?

Best,


-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct

[-- Attachment #2: Type: text/html, Size: 2504 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Re: Possible ocamlmpi finalization error?
  2010-09-08 10:31   ` [Caml-list] " Eray Ozkural
@ 2010-09-08 10:58     ` Mark Shinwell
  0 siblings, 0 replies; 4+ messages in thread
From: Mark Shinwell @ 2010-09-08 10:58 UTC (permalink / raw)
  To: Eray Ozkural; +Cc: caml-list

On Wed, Sep 08, 2010 at 01:31:29PM +0300, Eray Ozkural wrote:
> On Wed, Sep 8, 2010 at 10:51 AM, Sylvain Le Gall <sylvain@le-gall.net>wrote:
> 
> > On 08-09-2010, Eray Ozkural <examachine@gmail.com> wrote:
> > > I'm recently getting errors that are past MPI_Finalize. Since both
> > > init/final and communicator allocation is managed by ocamlmpi, is it
> > > possible this is a bug with the library? Have you ever seen something
> > like this?
> > >
> > > Using openmpi on OS X. Here is the log message:
> > >
> > > *** An error occurred in MPI_Comm_free
> > > *** after MPI was finalized
> > > *** MPI_ERRORS_ARE_FATAL (goodbye)
> > >
> > > In the code I'm using both point-to-point and collective communication,
> > and
> > > as far as I know the code is correct. Could this be due to memory
> > > corruption, or should this never happen?
> > >
> >
> > Maybe, you can give a minimal code to reproduce this error?
> >
> Hmm, not really its a complex code but I just ran the debug version in
> parallel with exactly the same parameters and there is absolutely no problem
> with that. All communication is synchronous so timing cannot be an issue
> (since the debug build is naturally slower). AFAICT it's not a memory
> problem because no bound errors are reported in the debug build (was it on
> by default?). I think it's a lower-level problem than  my code. This could
> happen if some of that resource allocation is done in different threads, for
> instance.
> 
> Can you give me any ideas to trace the source of this problem?

I know nothing about MPI, but here are some general ideas:

- Read and/or instrument the MPI source to find out more information than
"An error"...

- Wrap the MPI_Comm_free function and use printf() to display the arguments
(and maybe thread IDs).  This might catch situations such as attempting to
free something twice.  If MPI_Comm_free isn't called often, gdb is another
thing to try.

- If you have a function that isn't re-entrant and you worry that it might be
being called in a re-entrant manner, maybe you could catch that by wrapping
it in another function.  In this other function, protect the function call to
the original function by a mutex, and lock it using pthread_mutex_trylock().
If that function tells you the mutex is already locked, then use signal() to
send SIGSTOP to your own pid.  If this triggers (which can be seen by looking
at the state of the process using "ps"), gdb can be attached to the program and
you should be able to see what tried to call the function before a previous
call had finished.  If you are lucky, you might even see where the previous
call had come from.

Mark


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-09-08 10:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08  5:21 Possible ocamlmpi finalization error? Eray Ozkural
2010-09-08  7:51 ` Sylvain Le Gall
2010-09-08 10:31   ` [Caml-list] " Eray Ozkural
2010-09-08 10:58     ` Mark Shinwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).