From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 25821 invoked by alias); 31 May 2014 19:16:38 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 32638 Received: (qmail 18853 invoked from network); 31 May 2014 19:16:23 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 X-Originating-IP: [86.6.157.246] X-Spam: 0 X-Authority: v=2.1 cv=TM7LSjVa c=1 sm=1 tr=0 a=BvYiZ/UW0Fmn8Wufq9dPrg==:117 a=BvYiZ/UW0Fmn8Wufq9dPrg==:17 a=NLZqzBF-AAAA:8 a=oH3Plqt-N8kA:10 a=uObrxnre4hsA:10 a=IkcTkHD0fZMA:10 a=HHGDD-5mAAAA:8 a=t5q5FQmuAAAA:8 a=y4PgyyFPOQ4WcZKqSogA:9 a=QEXdDO2ut3YA:10 a=HuSrhwKJP10A:10 a=i1zE5R4R5dEA:10 a=WP_fwRVwqg0A:10 Date: Sat, 31 May 2014 20:16:17 +0100 From: Peter Stephenson To: "Zsh List Hackers'" Subject: Re: Unicode, Korean, normalization form, Mac OS X and tab completion Message-ID: <20140531201617.4ca60ab8@pws-pc.ntlworld.com> In-Reply-To: References: X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.7; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Sat, 31 May 2014 12:56:06 +0900 Kwon Yeolhyun wrote: > 4) Mac OS X uses normalized string as filename. Assuming there=E2=80=99s a > file with the name of =EA=B0=80=EB=82=98=EB=8B=A4, it has the name of > =E3=84=B1=E3=85=8F=E3=84=B4=E3=85=8F=E3=84=B7=E3=85=8F(decomposed into ha= ngul jamos) internally. (Link to hangul > jamos: > http://www.utf8-chartable.de/unicode-utf8-table.pl?start=3D4352&number=3D= 1024) > 5) I guess the reason why the tab completion has failed is that zsh > compare the user input, =EA=B0=80=EB=82=98=EB=8B=A4, with the filename, = =E3=84=B1=E3=85=8F=E3=84=B4=E3=85=8F=E3=84=B7=E3=85=8F. > =EA=B0=80=EB=82=98=EB=8B=A4 and =E3=84=B1=E3=85=8F=E3=84=B4=E3=85=8F=E3= =84=B7=E3=85=8F are canonically equivalent but have different > binary representations. You're right, this is a real problem that could do with solving. The actual conversion between the two is easy enough --- though most of use here don't use MACs or character sets that show up the problem, so we'd need a volunteer to help with this (relatively) easy bit. The difficult bit, about which I suspect only Bart and I are likely to have detailed opinions, is where to do the conversion. Doing it at the point where data is read from the keyboard is problematic, since what we put back onto the command line is quite intricately tied to what we read from it in the first place, and arbitrary transformations at this point make it hard to know what to put back after the completion. Doing it right down in the guts is even harder --- there are some incredibly complicated things going on to support features like partial word completion that currently treat data simply as octet strings, and upgrading this is a huge job. So if we can guarantee the keyboard input is in one form (and I'm not sure we necessarily can) it might be easier to convert file names into that format. The trouble here is that to be consistent we need to convert all data passed into the completion system, e.g. from file contents passed as strings via functions. (In principle it's more correct to normalise all input anyway.) I'm currently wondering if there is scope for normalising keyboard input really early --- before we feed it back to the shell --- and turning it back into the usual keyboard form right at the end, perhaps not worrying too much if the original input was in a different form as long as they're equivalent. But I suspect it's not that easy. So this will take a certain amount of thought. pws