caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Looking for a windows ocaml UTF-16 encoded filename aware library
@ 2016-02-05 10:54 Matthieu Dubuget
  2016-02-05 11:01 ` Alain Frisch
  2016-02-05 11:09 ` Bob Atkey
  0 siblings, 2 replies; 5+ messages in thread
From: Matthieu Dubuget @ 2016-02-05 10:54 UTC (permalink / raw)
  To: Caml-list

Hello,

I'm currently analysing a NTFS file-tree with a windows OCaml native application.

This application is using:
- Unix.{opendir,readdir,closedir}
- and Unix.LargeFile.lstat

The unix library of OCaml distribution is using ANSI variants of system functions. This is working fine until files or directories whose UTF-16 encoded name cannot be converted into the code page in use are reached.

I'm about to write a small library to solve this problem: it would mimic the corresponding code from OCaml unix library, but using WIDE variants of microsoft system functions in the C stub instead of ANSI variants.

Before going on: do you know of any library that already do this I could use?

Thanks for any link.

Salutations

-- 
Matthieu Dubuget
Conseil de lecture : Guide d’autodéfense numérique (http://guide.boum.org)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Looking for a windows ocaml UTF-16 encoded filename aware library
  2016-02-05 10:54 [Caml-list] Looking for a windows ocaml UTF-16 encoded filename aware library Matthieu Dubuget
@ 2016-02-05 11:01 ` Alain Frisch
  2016-02-05 11:09 ` Bob Atkey
  1 sibling, 0 replies; 5+ messages in thread
From: Alain Frisch @ 2016-02-05 11:01 UTC (permalink / raw)
  To: matthieu.dubuget, Caml-list

Hello,

The real solution is to fix OCaml so that it can interact properly with 
arbitrary filenames under Windows. See:

https://github.com/ocaml/ocaml/pull/153
http://caml.inria.fr/mantis/view.php?id=3771

The basic idea is that filenames are represented by OCaml strings 
representing an utf-8 encoding of the actual filename.  To reduce code 
breakage, a fallback interprets strings that are invalid utf-8 sequences 
using the current code page.  But this is still a rather intrusive 
change, since filenames received from readdir are always utf-8 encoded, 
which can break existing code.  (One could imagine providing two 
variants of readdir to smooth the migration path.)

Any help reviewing and testing the PR above would be very much appreciated!

Alain


On 05/02/2016 11:54, Matthieu Dubuget wrote:
> Hello,
>
> I'm currently analysing a NTFS file-tree with a windows OCaml native application.
>
> This application is using:
> - Unix.{opendir,readdir,closedir}
> - and Unix.LargeFile.lstat
>
> The unix library of OCaml distribution is using ANSI variants of system functions. This is working fine until files or directories whose UTF-16 encoded name cannot be converted into the code page in use are reached.
>
> I'm about to write a small library to solve this problem: it would mimic the corresponding code from OCaml unix library, but using WIDE variants of microsoft system functions in the C stub instead of ANSI variants.
>
> Before going on: do you know of any library that already do this I could use?
>
> Thanks for any link.
>
> Salutations
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Looking for a windows ocaml UTF-16 encoded filename aware library
  2016-02-05 10:54 [Caml-list] Looking for a windows ocaml UTF-16 encoded filename aware library Matthieu Dubuget
  2016-02-05 11:01 ` Alain Frisch
@ 2016-02-05 11:09 ` Bob Atkey
  2016-02-05 15:14   ` Matthieu Dubuget
  1 sibling, 1 reply; 5+ messages in thread
From: Bob Atkey @ 2016-02-05 11:09 UTC (permalink / raw)
  To: caml-list, matthieu.dubuget

Hi Matthieu,

I wrote a little C binding to do pretty much what you are asking:

   https://github.com/ContemplateLtd/filesystem-wrapper

My motivation was to be able to support long filenames (> 240 chars) on 
Windows, but this entails using the wide versions of the filesystem 
functions.

I based it on the patch that Alain posted a link to, but only supported 
the operations that we needed (openfile, opendir, readdir, closedir and 
is_directory). I also had to use an abstract type for pathnames to be 
able to handle the bizarre way that windows does long file names (you 
have to prefix the absolute name with '\\?\', as far as I can tell).

There is a little Makefile that assumes you are cross-compiling from 
Linux with the Debian-packaged cross compiler.

I am completely inexpert in Windows programming, so there are almost 
certainly bugs in it. It has been reasonably well tested with long 
filenames (we were doing static analysis of Java .class files, some of 
which are auto generated from XML Schemas and can have very long names), 
but I haven't tested it much on non-ASCII names. It converts back and 
forth between UTF-16 for Windows to UTF-8 for OCaml.

As Alain says, the full solution would be to fix OCaml itself.

Bob

On 05/02/16 10:54, Matthieu Dubuget wrote:
> Hello,
>
> I'm currently analysing a NTFS file-tree with a windows OCaml native application.
>
> This application is using:
> - Unix.{opendir,readdir,closedir}
> - and Unix.LargeFile.lstat
>
> The unix library of OCaml distribution is using ANSI variants of system functions. This is working fine until files or directories whose UTF-16 encoded name cannot be converted into the code page in use are reached.
>
> I'm about to write a small library to solve this problem: it would mimic the corresponding code from OCaml unix library, but using WIDE variants of microsoft system functions in the C stub instead of ANSI variants.
>
> Before going on: do you know of any library that already do this I could use?
>
> Thanks for any link.
>
> Salutations
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Looking for a windows ocaml UTF-16 encoded filename aware library
  2016-02-05 11:09 ` Bob Atkey
@ 2016-02-05 15:14   ` Matthieu Dubuget
  2016-02-09 11:10     ` Adrien Nader
  0 siblings, 1 reply; 5+ messages in thread
From: Matthieu Dubuget @ 2016-02-05 15:14 UTC (permalink / raw)
  To: caml-list

Thanks to both!

- short term: Bob's code is a good base for my short term solution;
- long term: I'll have a look at the current PR, and will try to find out how I could help.

Salutations

-- 
Matthieu Dubuget
Guide d’autodéfense numérique : http://guide.boum.org


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Looking for a windows ocaml UTF-16 encoded filename aware library
  2016-02-05 15:14   ` Matthieu Dubuget
@ 2016-02-09 11:10     ` Adrien Nader
  0 siblings, 0 replies; 5+ messages in thread
From: Adrien Nader @ 2016-02-09 11:10 UTC (permalink / raw)
  To: Matthieu Dubuget; +Cc: caml-list

Hi,

I would definitely welcome such improvements! As far as I'm concerned,
my actual usage is with UNC paths (the ones starting with \\?\ ) and
which are needed for network drives. With only the "ANSI" Win32 API,
handling these is completely impossible. Many thanks for you works and I
look forward to using them. :) 

-- 
Adrien Nader

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-02-09 11:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-05 10:54 [Caml-list] Looking for a windows ocaml UTF-16 encoded filename aware library Matthieu Dubuget
2016-02-05 11:01 ` Alain Frisch
2016-02-05 11:09 ` Bob Atkey
2016-02-05 15:14   ` Matthieu Dubuget
2016-02-09 11:10     ` Adrien Nader

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).