caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Windows filenames and Unicode
@ 2010-09-29  5:05 Paul Steckler
  2010-09-29  6:23 ` [Caml-list] " David Allsopp
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Steckler @ 2010-09-29  5:05 UTC (permalink / raw)
  To: caml-list

In Windows, NTFS filenames are specified in Unicode (UTF-16).  Am I
right in thinking that
OCaml file primitives, like open_in, readdir, etc. cannot handle NTFS
filenames containing
characters with codepoints greater than 255?

I'm aware of the Camomile library, which gives the ability to
manipulate UTF-16 strings inside
of OCaml.  But it looks like crucial points of OCaml's I/O, like
Sys.argv and file primitives are
strictly limited to 8-bit characters.

Is there a way around this limitation, other than rewriting the file
I/O primitives?

-- Paul


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Caml-list] Windows filenames and Unicode
  2010-09-29  5:05 Windows filenames and Unicode Paul Steckler
@ 2010-09-29  6:23 ` David Allsopp
  2010-09-29  7:26   ` Paul Steckler
  0 siblings, 1 reply; 7+ messages in thread
From: David Allsopp @ 2010-09-29  6:23 UTC (permalink / raw)
  To: Paul Steckler, caml-list

Paul Steckler wrote:
> In Windows, NTFS filenames are specified in Unicode (UTF-16).  Am I right
> in thinking that OCaml file primitives, like open_in, readdir, etc. cannot
> handle NTFS filenames containing characters with codepoints greater than
> 255?

Given that the WinAPI "wide" functions use UTF-16, you can of course fake UTF-16 on top of normal OCaml strings but I think that you'll hit a brick wall because the I/O primitives are based on the underlying C library functions which at the end of the day will be using the ANSI versions of the Windows API functions, not the Unicode ones.

> I'm aware of the Camomile library, which gives the ability to manipulate
> UTF-16 strings inside of OCaml.  But it looks like crucial points of
> OCaml's I/O, like Sys.argv and file primitives are strictly limited to 8-
> bit characters.
> 
> Is there a way around this limitation, other than rewriting the file I/O
> primitives?

A way (but not foolproof on Windows 7 and Windows 2008 R2 because you can disable it) would be to wrap the GetShortPathName Windows API function[1] which will convert the pathname to its DOS 8.3 format which will not contain Unicode characters. Another way might be to wrap the Unicode version of CreateFileEx and convert the result into a handle compatible with the standard library functions but I reckon that could be tricky!


David

[1] http://msdn.microsoft.com/en-us/library/aa364989(v=VS.85).aspx


> 
> -- Paul
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Windows filenames and Unicode
  2010-09-29  6:23 ` [Caml-list] " David Allsopp
@ 2010-09-29  7:26   ` Paul Steckler
  2010-09-29  7:56     ` Michael Ekstrand
                       ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Paul Steckler @ 2010-09-29  7:26 UTC (permalink / raw)
  To: David Allsopp; +Cc: caml-list

On Wed, Sep 29, 2010 at 4:23 PM, David Allsopp <dra-news@metastack.com> wrote:
> A way (but not foolproof on Windows 7 and Windows 2008 R2 because you can disable it) would be to wrap the GetShortPathName Windows API function[1] which will convert the pathname to its DOS 8.3 format which will not contain Unicode characters. Another way might be to wrap the Unicode version of CreateFileEx and convert the result into a handle compatible with the standard library functions but I reckon that could be tricky!

For Linux, I was planning on enforcing the invariant that all strings
inside my program are UTF-8.
For Windows, I could use the same invariant, and modify the OCaml
runtime so that all calls to
Windows file primitives have those strings translated to UTF-16 (and
return values translated back
to UTF-8).  That is, I'd have to build a custom version of OCaml and
wrap CreateFile, etc. with
such Unicode translation functions.

All this is made slightly more complicated by the fact that I'm using
the MinGW version of OCaml.

Hmmm, I shouldn't have to do this.  Are there plans afoot to modernize
OCaml's string-handling?

-- Paul


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Windows filenames and Unicode
  2010-09-29  7:26   ` Paul Steckler
@ 2010-09-29  7:56     ` Michael Ekstrand
  2010-09-29  7:58     ` David Allsopp
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Michael Ekstrand @ 2010-09-29  7:56 UTC (permalink / raw)
  To: Paul Steckler; +Cc: caml-list

On Wed, 2010-09-29 at 17:26 +1000, Paul Steckler wrote:
> Hmmm, I shouldn't have to do this.  Are there plans afoot to modernize
> OCaml's string-handling?

The Batteries project aims to provide more modernized string handling,
and we already go a long way with the UTF8 module (from Camomile) and
ropes.  That does not, however, affect the file opening routines, as the
current Batteries design requires you to open files using platform
strings for their names.

It may be interesting to look at allowing files to be opened using
unicode names.  However, this is fraught with difficulties, particularly
in cross-platform situation.  Handling filenames in Unicode is incorrect
on Unix and Linux systems where the locale encoding does not have an
idempotent conversion to and from Unicode.  Therefore, the correct way
to handle filenames in a cross-platform fashion is to always store them
in the system filename encoding (any Unicode encoding on Windows when
the wide functions are supported, the current locale encoding on Unix)
and convert them to Unicode only for display.  IMO any enhanced string
design covering filenames must encourage this.  Fortunately, OCaml's
type system makes such an API fairly natural, it just needs to be
designed and implemented.

- Michael

-- 
Web/blog: http://elehack.net/michael
Jabber/Google Talk: this e-mail address
Twitter: http://twitter.com/elehack
mouse, n: a device for pointing at the xterm in which you want to type


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Caml-list] Windows filenames and Unicode
  2010-09-29  7:26   ` Paul Steckler
  2010-09-29  7:56     ` Michael Ekstrand
@ 2010-09-29  7:58     ` David Allsopp
  2010-09-29  8:14     ` Jerome Vouillon
  2010-09-30 19:27     ` ygrek
  3 siblings, 0 replies; 7+ messages in thread
From: David Allsopp @ 2010-09-29  7:58 UTC (permalink / raw)
  To: Paul Steckler; +Cc: caml-list

Paul Steckler wrote:
> On Wed, Sep 29, 2010 at 4:23 PM, David Allsopp <dra-news@metastack.com>
> wrote:
> > A way (but not foolproof on Windows 7 and Windows 2008 R2 because you
> > can disable it) would be to wrap the GetShortPathName Windows API
> > function[1] which will convert the pathname to its DOS 8.3 format which
> > will not contain Unicode characters. Another way might be to wrap the
> > Unicode version of CreateFileEx and convert the result into a handle
> > compatible with the standard library functions but I reckon that could be
> > tricky!
> 
> For Linux, I was planning on enforcing the invariant that all strings
> inside my program are UTF-8.
> For Windows, I could use the same invariant, and modify the OCaml runtime
> so that all calls to Windows file primitives have those strings translated
> to UTF-16 (and return values translated back to UTF-8).  That is, I'd have
> to build a custom version of OCaml and wrap CreateFile, etc. with such
> Unicode translation functions.

Rather than hacking the OCaml runtime (the relevant code is {byte,asm}run/sys.c, btw) personally I'd produce a separate module of my own with two implementations - one for Linux which just uses the built-in primitives and then one for Windows using WinAPI functions directly. A cursory glance at the runtime code suggests that hacking wide support onto the runtime is not a "one-liner".

> All this is made slightly more complicated by the fact that I'm using the
> MinGW version of OCaml.

Shouldn't make it (too much) harder - I use the MinGW build of OCaml without issue for C stubs, after a slightly steep learning-curve. The w32api package in Cygwin provides all of the headers and link libraries for Windows libraries (it's installed by default with the gcc-mingw). If you ever have to link with more exotic libraries, dlltool is (sort of) your friend (it with a little bit of help allows you to generate the .a files needed for the DLL you're linking with - 3rd party libraries on Windows tend only to ship the MSVC .lib files)

> Hmmm, I shouldn't have to do this.  Are there plans afoot to modernize
> OCaml's string-handling?

Can o' worms, I expect! Windows (by which I mean Windows NT) took the simplest route of 16-bit wchars for Unicode but it's not necessarily the best way of programming in general... the problem with going Unicode is *how* you go Unicode.


David


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Windows filenames and Unicode
  2010-09-29  7:26   ` Paul Steckler
  2010-09-29  7:56     ` Michael Ekstrand
  2010-09-29  7:58     ` David Allsopp
@ 2010-09-29  8:14     ` Jerome Vouillon
  2010-09-30 19:27     ` ygrek
  3 siblings, 0 replies; 7+ messages in thread
From: Jerome Vouillon @ 2010-09-29  8:14 UTC (permalink / raw)
  To: Paul Steckler; +Cc: David Allsopp, caml-list

On Wed, Sep 29, 2010 at 05:26:52PM +1000, Paul Steckler wrote:
[...]
> For Linux, I was planning on enforcing the invariant that all
> strings inside my program are UTF-8.  For Windows, I could use the
> same invariant, and modify the OCaml runtime so that all calls to
> Windows file primitives have those strings translated to UTF-16 (and
> return values translated back to UTF-8).  That is, I'd have to build
> a custom version of OCaml and wrap CreateFile, etc. with such
> Unicode translation functions.

The Unison file synchronizer (http://www.cis.upenn.edu/~bcpierce/unison/)
has binding for the UTF-16 Windows API.  You should have a look at it.
At the moment, this is fairly tied to Unison (in particular because we
want to still be able to access the 8bit API for compatibility with
previous versions of Unison), but it would be great to turn the code
into a standalone library.  One possibility would be to write a
Unicode version of the Unix library.

> All this is made slightly more complicated by the fact that I'm using
> the MinGW version of OCaml.

There is no difficulty accessing the UTF-16 Windows API with MinGW
version of OCaml.  Actually, I'm even crosscompiling from Linux
without any difficulty.

-- Jerome


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Windows filenames and Unicode
  2010-09-29  7:26   ` Paul Steckler
                       ` (2 preceding siblings ...)
  2010-09-29  8:14     ` Jerome Vouillon
@ 2010-09-30 19:27     ` ygrek
  3 siblings, 0 replies; 7+ messages in thread
From: ygrek @ 2010-09-30 19:27 UTC (permalink / raw)
  To: caml-list

On Wed, 29 Sep 2010 17:26:52 +1000
Paul Steckler <steck@stecksoft.com> wrote:

> For Windows, I could use the same invariant, and modify the OCaml
> runtime so that all calls to
> Windows file primitives have those strings translated to UTF-16 (and
> return values translated back
> to UTF-8).  That is, I'd have to build a custom version of OCaml and
> wrap CreateFile, etc. with
> such Unicode translation functions.

Have a look at http://savannah.nongnu.org/patch/?4515
(and http://ygrek.org.ua/p/ocaml_unicode.html for ocaml/msvc).

-- 
 ygrek
 http://ygrek.org.ua


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-09-30 19:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-29  5:05 Windows filenames and Unicode Paul Steckler
2010-09-29  6:23 ` [Caml-list] " David Allsopp
2010-09-29  7:26   ` Paul Steckler
2010-09-29  7:56     ` Michael Ekstrand
2010-09-29  7:58     ` David Allsopp
2010-09-29  8:14     ` Jerome Vouillon
2010-09-30 19:27     ` ygrek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).