From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 9C5B7BB81 for ; Mon, 28 Nov 2005 07:26:08 +0100 (CET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jAS6Q8kg007991 for ; Mon, 28 Nov 2005 07:26:08 +0100 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id HAA07670 for ; Mon, 28 Nov 2005 07:26:07 +0100 (MET) Received: from smtp3.adl2.internode.on.net (smtp3.adl2.internode.on.net [203.16.214.203]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jAS6Q5Yc007983 for ; Mon, 28 Nov 2005 07:26:06 +0100 Received: from rosella (ppp33-4.lns1.syd2.internode.on.net [59.167.33.4] (may be forged)) by smtp3.adl2.internode.on.net (8.12.9/8.12.6) with ESMTP id jAS6Pxwq076075 for ; Mon, 28 Nov 2005 16:56:00 +1030 (CST) (envelope-from skaller@users.sourceforge.net) Subject: Help finding possible bug in 3.09/ia64 From: skaller To: caml-list@inria.fr Content-Type: text/plain Date: Mon, 28 Nov 2005 17:25:58 +1100 Message-Id: <1133159159.4593.75.camel@rosella> Mime-Version: 1.0 X-Mailer: Evolution 2.4.1 Content-Transfer-Encoding: 7bit X-Miltered: at concorde with ID 438AA300.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at concorde with ID 438AA2FD.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; bug:01 3.09:01 bug:01 ocaml:01 3.09:01 runtime:01 bindings:01 manifests:01 'make:01 test':01 segfault:01 garbage:01 gdb:01 oldify:01 minor:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 I am wondering if anyone has access to an ia64 and who is willing to help track down a possible bug in Ocaml 3.09 for that platform. Details and background follow. I would prefer confirmation before reporting a bug, also kind of hard to report a bug on an architecture I don't have access to .. :) Mike Furr has found a problem with one of the Felix regression tests on the ia64 platform. It appears to me this is most likely a bug in the ia64 runtime, and next most likely a bug in the native code generator for ia64. The code in question works fine on i386, amd64, ppc and some other architectures. The program contains no uses of any C bindings, no use of the Obj module, and no unsafe array accesses. The error manifests as: PATH=bin:"$PATH" LD_LIBRARY_PATH=rtl:"$LD_LIBRARY_PATH" bin/flxg -Ilib tut/examples/mac126 .. ERROR CODE 0xb TESTFILE -- ERROR! tut/examples/mac126 during 'make test'. A segfault results from what appears to be a runaway loop in the garbage collector: (gdb) bt #0 0x40000000002a04e0 in caml_oldify_local_roots () #1 0x40000000002a5100 in caml_empty_minor_heap () #2 0x40000000002a5360 in caml_minor_collection () #3 0x40000000002a1b50 in caml_garbage_collection () #4 0x40000000002c5ca0 in caml_call_gc () #5 0x40000000002a5100 in caml_empty_minor_heap () #6 0x40000000002c5ca0 in caml_call_gc () #7 0x40000000002a5100 in caml_empty_minor_heap () #8 0x40000000002c5ca0 in caml_call_gc () #9 0x40000000002a5100 in caml_empty_minor_heap () #10 0x40000000002c5ca0 in caml_call_gc () #11 0x40000000002a5100 in caml_empty_minor_heap () I don't have access to an ia64, so I am unable to do much about this. The fault occurs in a (not uploaded) Debian packaging for Felix 1.1.1, the original tarball is located here: http://felix.sourceforge.net/flx_1.1.0_src.tgz It should build on Unix (or Windows XP64 if Ocaml supports that, though I haven't tried it). Yes, it IS possible there is a bug in the source algorithm -- in fact, there definitely used to be an unchecked overrun -- however the test is deterministic, so it should fail on all architectures with the same word size at least --- it works fine on amd64. [The algorithm DOES contain a potentially infinite recursion which is supposed to be limited] The actual algorithm is probably part of the flx_macro module, since the test is exercising the macro processor. It is (just) possible a deep recursion is overflowing the stack, corrupting memory, and causing the gc to get stuck. Exactly how this could happen I don't know (since it doesn't on other platforms). The test has been around for a long time (over a year I think). -- John Skaller Felix, successor to C++: http://felix.sf.net