From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id EAA07642; Wed, 10 Mar 2004 04:03:09 +0100 (MET) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id EAA08402 for ; Wed, 10 Mar 2004 04:03:07 +0100 (MET) Received: from smtp1.adl2.internode.on.net (smtp1.adl2.internode.on.net [203.16.214.181]) by concorde.inria.fr (8.12.10/8.12.10) with ESMTP id i2A334Hd016117 for ; Wed, 10 Mar 2004 04:03:05 +0100 Received: from [192.168.1.200] (ppp113-89.lns1.syd3.internode.on.net [150.101.113.89]) by smtp1.adl2.internode.on.net (8.12.9/8.12.9) with ESMTP id i2A32swn049160; Wed, 10 Mar 2004 13:32:55 +1030 (CST) Subject: Re: [Caml-list] Bug with really_input under cygwin From: skaller Reply-To: skaller@users.sourceforge.net To: Eric Dahlman Cc: caml-list@pauillac.inria.fr In-Reply-To: <51FE3429-7219-11D8-8BF5-000393914EAA@atcorp.com> References: <51FE3429-7219-11D8-8BF5-000393914EAA@atcorp.com> Content-Type: text/plain Message-Id: <1078888018.2452.52.camel@pelican.wigram> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-4) Date: 10 Mar 2004 14:06:59 +1100 Content-Transfer-Encoding: 7bit X-Miltered: at concorde by Joe's j-chkmail ("http://j-chkmail.ensmp.fr")! X-Loop: caml-list@inria.fr X-Spam: no; 0.00; caml-list:01 bug:01 cygwin:01 sourceforge:01 2004:99 howdy:01 retrieval:99 retrieval:99 abstraction:01 char:01 decoding:01 10646:01 latin-:01 scattered:01 9660:01 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk X-Status: X-Keywords: X-UID: 102 On Wed, 2004-03-10 at 09:30, Eric Dahlman wrote: > Howdy all, > > I have some code which is reads in a whole file in and returns it as a > string. The only correct way to do this is to read a block at a time until you get a partial block. This is so EVEN in 'binary' mode, which is just another ill conceived Unix hack :-) Generally speaking, every output method should specify a retrieval method or two, and you will only get well defined results if you use the specified retrieval method. It is unfortunate that C and Unix do not provide a coherent abstraction in this area. Even binary I/O is ill-conceived: who says the bytes get written in order and read in the same order? What if one channel is opened in 16 bit word mode, and the other 8 bit mode? C has been plagued by extremely ill considered functions. Even the basic IO operation is not correctly defined. In particular the function putc(int) is an invalid specification. What happens if int = char and you have 1's complement encoding? The bottom line is: if you wrote the file yourself, there should be no problem. Just use BASIC I/O operations. Functions like 'in_channel_length' are not properly defined in the Ocaml manual and therefore should not be used. There is no such thing as 'the number of characters in a file'. Perhaps there is a number of bytes in a file. Perhaps, using some decoding technique there is a well defined number of Unicode/ISO-10646 code points. In MS-DOS, files *always* consist of a number of 256 byte blocks. It is impossible to have a file with a non-256 byte multiple size. Of course, text files uses an encoding with a Ctrl-Z at the end. So the length of the file 'in bytes' is not the same as the length of the file 'in Latin-1'. The number of lines in the file isn't well defined: CR/LF marks end of line, but what happens if the CR and LF are scattered randomly? Under Linux, the Standard for text encoding is UTF-8. So 'characters' <> bytes unless the text is in the ASCII subset. Even that is not clear, since if you get a code point 0 (NUL) some C functions will return a false result, for example fgets(). I personally believe the easiest way to work around this quagmire of malspecification is to (a) ONLY use 8 bit binary I/O (b) ALWAYS read and write bytes even if you're processing text. Never depend on the language or OS conversion functions, its very unlikely they'll be right. Do all the conversions needed yourself. At least when you find a problem you're not handling correctly you can fix it. -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners