From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tristan Plumb <9p-st@imu.li> To: 9fans@cse.psu.edu Content-Type: text/plain; charset=UTF-8 Message-Id: <2573fd.277c0a14.xzi-bfuOlmWG.3397.mx@tulgey.imu.li> In-Reply-To: <39d22acfc53470335fdb74156c738feb@plan9.bell-labs.com> <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net> <5d375e920802271201n263476cbnf8929a92c1ba3177@mail.gmail.com> References: <39d22acfc53470335fdb74156c738feb@plan9.bell-labs.com> <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net> <5d375e920802271201n263476cbnf8929a92c1ba3177@mail.gmail.com> Date: Thu, 28 Feb 2008 14:06:57 -0500 User-Agent: mx-alpha Subject: [9fans] localization, unicode, regexps (was: awk, not utf aware...) Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: 6828fe12-ead3-11e9-9d60-3106f5b1d025 > erik | Sape * uriel I have been pondering character sets rather alot recently (mostly wishful thinking, by my estimation), so this conversation set me thinking more... > how does one deal with a multi-language file. By not dealing in languages? Unicode (however flawed) solves multi-script files, why mire ourselves in mutable (scripts are plenty) language rules? > for example, have a base-character folding switch that allows regexps > to fold codpoints into base codepoints so that =C3=AD=C3=AF=C3=AC=C3=AE= i -> i. I would favor decomposing codepoints (=C3=AD=E2=86=92i=CC=81, =C3=AF=E2=86= =92i=CC=88, =C3=AC=E2=86=92i=CC=80, =C3=AE=E2=86=92i=CC=82) with the swit= ch to ignore combining characters, that has the disadvantage of lengthening, by a byte or rune a time, your text, but does allow you to match accents. | Yes, and then there is locale: does [a-z] include =C4=B3 when you run i= t | in Holland (it should)? Does it include =C3=A1, =C3=A8, =C3=B4 in Fran= ce (it should)? | Does it include =C3=B8, =C3=A5 in Norway (it should not)? And what hap= pens when | you evaluate "=C3=A8"< "o" (it depends)? Does spanish [a-c] match the c in ch (depends on when and where you ask)? More Unicode-centric, does 'a' match (the first byte of) 'a=CC=80' (U0061= +0300) (or all three bytes, or not at all)? I would write [a-z] in a regexp upon two occations, a letter of the latin alphabet (better served by something like [[:latin:]] (so I needent add a bunch of other things ([=C3=BE=C3=B0=C3=A6=C5=93=C3=B8]))) or the bytes [= 61, 7a]. As any sort of a public project is stuck with Unicode (not advocating the hysteria before, just wishing Unicode left some of it behind), regexps reflecting Unicode, not the user's language, makes sense to me. Unicode is at least codified. * I think the plan9 tools demonstrate that it is not so hard to find a * 'good enough' solution; and the lunix locale debacle demonstrate that * if you want to get it 'right' you will end up with a nightmare. Yet some things that are good enough (I'll pick on Unicode) for one idea, lumping character sets together does a fine job to write multiple scripts in the same file, spawns nightmares, =C7=AD =3D =C7=AB=CC=84 =3D =C5=8D=CC= =A8 =3D o=CC=A8=CC=84 =3D o=CC=84=CC=A8, good enough being ill-thought-out. Yet mayhap you mean well-compromised (that seems right). To those who were at IWP9 this year: Cast your mind back to a question of plan9 people with vested intrest for RtL rendering and the like. I should have stood up then and cried out, I! Imagine either I did so or I do now. If anyone has interest in playing on this at a character set level, tell? enjoy, tristan --=20 All original matter is hereby placed immediately under the public domain.