From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22295 invoked by alias); 1 Jun 2014 05:30:27 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 32645 Received: (qmail 23784 invoked from network); 1 Jun 2014 05:30:11 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED autolearn=ham version=3.3.2 MIME-version: 1.0 Content-type: multipart/signed; boundary="Apple-Mail=_BCF3BEAD-3009-4388-953B-42E06087F2A5"; protocol="application/pgp-signature"; micalg=pgp-sha512 Subject: Re: Unicode, Korean, normalization form, Mac OS X and tab completion From: Kwon Yeolhyun In-reply-to: <20140601022527.GD1820@tarsus.local2> Date: Sun, 01 Jun 2014 14:30:03 +0900 Cc: Zsh List Hackers' Message-id: References: <20140531201617.4ca60ab8@pws-pc.ntlworld.com> <140531142926.ZM556@torch.brasslantern.com> <20140601022527.GD1820@tarsus.local2> To: Daniel Shahaf X-Mailer: Apple Mail (2.1878.2) X-MANTSH: 1TEIXWV4bG1oaGkdHB0lGUkdDRl5PWBoaHxEKTEMXGx0EGx0YBBIZBBsTEBseGh8 aEQpYTRdLEQptfhcaEQpMWRcbGhsbEQpZSRcRClleF2hjeREKQ04XSxseGmJCHx1SHhN9GXhzB x8bGh8ZGhMRClhcFxkEGgQbGwdNTh8YGBgZSwUbHQQbHRgEEhkEGxMQGx4aHxsRCl5ZF2FdbB0 SEQpMRhdsa2sRCkNaFxseBBwZBBIbBB0aEQpEWBcYEQpESRcbEQpCRRdiG0Vef15sHloSYxEKQ k4XbHBgeUAdYlJpGmIRCkJMF2BABVJpY2VpYXl7EQpCbBdicxhna1BzSGxjThEKQkAXY31IW0Z dGHlfWRgRCnBnF2lmUh14W2NOQFNmEQpwaBdrXE9OT04cQXkBbxEKcGgXZkQFfXlLQGtaQGkRC nBoF2BrUktych59El9ZEQpwaBdjQ0UfG01EYUVZbBEKcGgXZWFZf1pyXk4TcF4RCnBrF2JQE2x PBVxgfXpzEQpwSxdiaXITWF1cZ21TcxEKcGsXZBgBX2YdGF1gcEARCnBsF2IZGHxdZkloa2kaE QpwTBdkElIcWEFfbUISSxE= X-CLX-Spam: false X-CLX-Score: 1011 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.11.96,1.0.14,0.0.0000 definitions=2014-06-01_01:2014-05-30,2014-06-01,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1406010081 --Apple-Mail=_BCF3BEAD-3009-4388-953B-42E06087F2A5 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 On Jun 1, 2014, at 11:25 AM, Daniel Shahaf = wrote: > Bart Schaefer wrote on Sat, May 31, 2014 at 14:29:26 -0700: >> On May 31, 8:16pm, Peter Stephenson wrote: >> } >> } I'm currently wondering if there is scope for normalising keyboard = input >> } really early --- before we feed it back to the shell --- and = turning it >> } back into the usual keyboard form right at the end >>=20 >> Per thread with Chet, I think normalizing the filesystem is the = easier >> way to go. Keyboard input is already as close to normalized as it = needs >> to be, I think, and with only a couple of exceptions all the names we >> get from the filesystem come through zreaddir(). >=20 > What about, say, people doing 'ls' and copy-pasting a filename from = the > output into a command line? Wouldn't that result in NFD keyboard > input? >=20 > FWIW, while OS X always returns NFD filenames, one could also imagine = an > OS that is normalization-aware (forbids creating a file if its > normalized name is the same as the normalized name of an existing = file) > but octet-sequence-preserving, and on such an OS both the readdir() > output and the user input would need to be normalized. >=20 > Also, other unixes allow you to have both the NFC-form and NFD-form in > the same directory, e.g., 'touch foo=C3=A1 fooa=CC=81' works just fine = on linux > ext4 (the first filename is composed, the second decomposed); in such > cases normalization magic should not be done. >=20 > Fun! :-) >=20 > Daniel Fortunately, I think Mac OS X can handle input in decomposed or composed = form. Here=E2=80=99s some code I tested: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D hangul.c = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= #include #include int main() { char *fname =3D "=ED=95=9C=EA=B8=80/=EA=B0=80=EB=82=98=EB=8B=A4"; char *dirname =3D "=ED=95=9C=EA=B8=80"; DIR *dirp =3D opendir(dirname); struct dirent *direntry =3D NULL; FILE *fp =3D fopen(fname, "r"); char buf[512]; if (dirp =3D=3D NULL) { printf("Failed to read the directory: %s\n", dirname); if (fp > 0) fclose(fp); return -1; } while ((direntry =3D readdir(dirp)) !=3D NULL) { printf("file name: %s\n", direntry->d_name); if (direntry->d_name[0] =3D=3D '.') continue; } closedir(dirp); if (fp =3D=3D NULL) { printf("Failed to read %s\n", fname); return -1; } else { fread(buf, sizeof(buf), 1, fp); printf("%s\n", buf); } fclose(fp); return 0; } =3D=3D=3D=3D=3D=3D=3D END =3D=3D=3D=3D=3D=3D=3D=3D And the output is=20 > mkdir =ED=95=9C=EA=B8=80 > touch =ED=95=9C=EA=B8=80/=EA=B0=80=EB=82=98=EB=8B=A4 > echo =E2=80=9Ctest success!=E2=80=9D > =ED=95=9C=EA=B8=80/=EA=B0=80=EB=82= =98=EB=8B=A4 > clang -g hangul.c > ./a.out file name: . file name: .. file name: =EA=B0=80=EB=82=98=EB=8B=A4 test success! I checked the contents of memory using lldb and I confirmed that fname = is UTF-8 composed chars and the returned filename from readdir is UTF-8 = decomposed chars. But file operation (reading in the above codes and writing is also = working) is working perfectly. So I think we can convert decomposed filenames into composed after = readdir. It will work at least for Korean. Detecting, composing, and decomposing hangul can be done easily. --Apple-Mail=_BCF3BEAD-3009-4388-953B-42E06087F2A5 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQIcBAEBCgAGBQJTirpbAAoJEDdY1K+v3Mu75j0QAJu3sXPTj6TPAQfbBqV+R4hs ldjBSv5O6Nqxujq1mq8GbMeoV/9KWGTtr6wYOD2rKsjJ1U3x5iuaHSKR8IVxa4GP PjQB+AENlrLwzIU/grLxQjwE3sO3NaUnp80i1KI8LLWgucsleHuTcuoHrm3pGRGp b0DHMptBB7/IGOu0C917fGHGf7aPjo2P7WSmQpCCNR140nf52tlMg1EDpd6nAf6h y0TAUR+T/DTqSaIbmO25yOIFJOhySkH/etR2pb98CbtZogirggwxI0h1BkNXymsj Li5HkqOcLUouyd/rLAnb0TuCEDhN6nxvaoZk8Gp7L2vR3PfuIZmOz8mm59TsuteT tEi9iYO5osLBQkbWa08HMfUbS0Yuds52qtXy6s1HnwfWxgTXerJ9uL8eMD3VivQn LLs9gr5pCJkgfkPGB+PiVYfpLn2DfYK+n/gKZgHi5aX1Jej68cjjeh76EbdFsyKA Efw9ZZNiLSdrriAqXeTd8KzELLBvmtoIqUVvFTGpxeIMB+PAdwS1qw3KpbIQslvX 6B0r7+l3FcUiuSFqyVKCgklpdtiMGtAgRXy9tjFNiN17Xc08YJAQDjjQGgRibQyR Bibh8l6EtRKA8yy/KkdQSPtWUmDxNFPV3Tr4ObIZXlwvPx6GbfNlRMQK7NzgxiA4 HomGk/w32rcMLlXj3zSm =WDCA -----END PGP SIGNATURE----- --Apple-Mail=_BCF3BEAD-3009-4388-953B-42E06087F2A5--