caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: skaller <skaller@users.sourceforge.net>
To: Eric Dahlman <edahlman@atcorp.com>
Cc: caml-list@pauillac.inria.fr
Subject: Re: [Caml-list] Bug with really_input under cygwin
Date: 10 Mar 2004 14:06:59 +1100	[thread overview]
Message-ID: <1078888018.2452.52.camel@pelican.wigram> (raw)
In-Reply-To: <51FE3429-7219-11D8-8BF5-000393914EAA@atcorp.com>

On Wed, 2004-03-10 at 09:30, Eric Dahlman wrote:
> Howdy all,
> 
> I have some code which is reads in a whole file in and returns it as a 
> string.  

The only correct way to do this is to read a block at a time
until you get a partial block.

This is so EVEN in 'binary' mode, which is just another
ill conceived Unix hack :-)

Generally speaking, every output method should specify
a retrieval method or two, and you will only get well
defined results if you use the specified retrieval method.

It is unfortunate that C and Unix do not provide a coherent
abstraction in this area. Even binary I/O is ill-conceived:
who says the bytes get written in order and read in the
same order? What if one channel is opened in 16 bit word
mode, and the other 8 bit mode?

C has been plagued by extremely ill considered functions.
Even the basic IO operation is not correctly defined.
In particular the function putc(int) is an invalid specification.
What happens if int = char and you have 1's complement encoding?

The bottom line is: if you wrote the file yourself,
there should be no problem. Just use BASIC I/O operations.
Functions like 'in_channel_length' are not properly defined
in the Ocaml manual and therefore should not be used.

There is no such thing as 'the number of characters
in a file'. Perhaps there is a number of bytes in a file.
Perhaps, using some decoding technique there is a well
defined number of Unicode/ISO-10646 code points.

In MS-DOS, files *always* consist of a number of 256
byte blocks. It is impossible to have a file with
a non-256 byte multiple size. Of course, text files
uses an encoding with a Ctrl-Z at the end. So the length
of the file 'in bytes' is not the same as the length
of the file 'in Latin-1'. The number of lines in the
file isn't well defined: CR/LF marks end of line,
but what happens if the CR and LF are scattered randomly?

Under Linux, the Standard for text encoding is UTF-8.
So 'characters' <> bytes unless the text is in the ASCII
subset. Even that is not clear, since if you get a 
code point 0 (NUL) some C functions will return
a false result, for example fgets().

I personally believe the easiest way to work around this
quagmire of malspecification is to 

(a) ONLY use 8 bit binary I/O
(b) ALWAYS read and write bytes

even if you're processing text. Never depend on the
language or OS conversion functions, its very unlikely
they'll be right. Do all the conversions needed yourself.
At least when you find a problem you're not handling
correctly you can fix it.

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net



-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


  parent reply	other threads:[~2004-03-10  3:03 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-03-09 22:30 Eric Dahlman
2004-03-09 22:52 ` Karl Zilles
2004-03-10  3:06 ` skaller [this message]
2004-03-10  4:10   ` David Brown
2004-03-10 13:14     ` Richard Zidlicky
2004-03-11  4:11       ` skaller
2004-03-11  3:24     ` skaller
2004-03-10 15:25   ` Nuutti Kotivuori
2004-03-11  3:42     ` skaller
2004-03-11  5:02       ` Nuutti Kotivuori
2004-03-11 15:21         ` skaller
2004-03-11  6:32       ` james woodyatt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1078888018.2452.52.camel@pelican.wigram \
    --to=skaller@users.sourceforge.net \
    --cc=caml-list@pauillac.inria.fr \
    --cc=edahlman@atcorp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).