From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id 706F8BBBB for ; Wed, 1 Mar 2006 14:45:12 +0100 (CET) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.190]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id k21DjBXt022284 for ; Wed, 1 Mar 2006 14:45:12 +0100 Received: from [212.227.126.160] (helo=mrelayng.kundenserver.de) by moutng.kundenserver.de with esmtp (Exim 3.35 #1) id 1FERdf-0006fm-00; Wed, 01 Mar 2006 14:45:11 +0100 Received: from [84.58.140.93] (helo=gate.lan.gerd-stolpmann.de) by mrelayng.kundenserver.de with asmtp (Exim 3.35 #1) id 1FERdf-0005P6-00; Wed, 01 Mar 2006 14:45:11 +0100 Received: from flakew.lan.gerd-stolpmann.de (flakew.lan.gerd-stolpmann.de [192.168.0.32]) by gate.lan.gerd-stolpmann.de (Postfix) with ESMTP id B75DEC094; Wed, 1 Mar 2006 14:45:10 +0100 (CET) Subject: Re: [Caml-list] OCaml program crashes after computing fine for 2 days during grep on multiMB output file From: Gerd Stolpmann To: Andries Hekstra Cc: caml-list@yquem.inria.fr In-Reply-To: References: Content-Type: text/plain Date: Wed, 01 Mar 2006 14:45:09 +0100 Message-Id: <1141220710.10329.80.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.4.1 Content-Transfer-Encoding: 7bit X-Provags-ID: kundenserver.de abuse@kundenserver.de auth:a6865a839c0178d9aa0ce41878507ea2 X-Miltered: at nez-perce with ID 4405A568.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; caml-list:01 ocaml:01 gerd:01 stolpmann:01 ocaml-list:01 ocaml:01 computations:01 arrays:01 computations:01 succesfully:01 posix:01 semantics:01 grepping:01 reboots:01 gerd:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=FORGED_RCVD_HELO autolearn=disabled version=3.0.3 Am Mittwoch, den 01.03.2006, 12:03 +0100 schrieb Andries Hekstra: > > Dear OCaml-list, > > I use OCaml under 64-bit Linux to do signal processing simulations of > next generation optical storage devices. So far, I have really enjoyed > programming in OCaml, e.g. as program texts are considerable shorter > than in C++ for computations that involve many arrays. My computations > run for many days if not a week, and produce output files of ca. 20 > MB. I run them in a job queue. > > Recently I have been plagued by programs that crash when I do a "grep" > on the output file (opened with open_out). E.g. the program has been > running succesfully for a few days. I do a "grep @ *.out" in the > directory to monitor progress as important lines in the output file > start with a "@". A few minutes later I receive mails from the queuing > system saying that everything crashed. > > What is the cause of these crashes? Can somebody give me a clue? A stale NFS file handle normally means that the file disappeared on the NFS server. (The server does not keep files open while clients have them open in order to support proper POSIX semantics; it just re-opens them whenever clients access the files.) As you are grepping the file, this cannot be the case here. Stale handles may also result if the NFS server is rebooted and something goes wrong. Normally, the server keeps file handles across reboots, but there are many reports that this does not work for some users. Maybe these NFS servers are just buggy. (For example, some OS do not guarantee stable device numbers, so every time the system is booting the disks get new numbers, and all file handles become stale.) You should also ensure that you are hard-mounting (option "o=hard" in the mount command). Use NFS version 3 if possible. In general, I would advise not to use NFS for long-running processes. Write the file to /var/tmp and move it to its final location when it is fully written. Gerd > ------------------------------------------------------------ > # LSBATCH: User input > qtb -par Exp107.txt > Exp107.txt.log -codes > gallager_10b_1023l_1048576w.txt > ------------------------------------------------------------ > > Exited with exit code 2. > > Resource usage summary: > > CPU time : 163606.88 sec. > Max Memory : 3014 MB > Max Swap : 3044 MB > > Max Processes : 3 > > The output (if any) follows: > > Fatal error: exception Sys_error("Stale NFS file handle") > > > > > ------------------------------------------------------------------------ > Dr. Ir. Andries P. Hekstra > Philips Research > High Tech Campus 27 (WL-1-4.15) > 5656 AG Eindhoven > Tel./Fax/Secr. +31 40 27 42048/42566/44051 > * Good open source break software for computer users : > http://www.workrave.org > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------