From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <7ba10925935da3080b62c7cb6e2649d5@coraid.com> From: erik quanstrom Date: Mon, 17 Sep 2007 11:55:04 -0400 To: 9fans@cse.psu.edu Subject: Re: [9fans] simplicity In-Reply-To: <46EE9A41.7DD78E60@null.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Topicbox-Message-UUID: c050adde-ead2-11e9-9d60-3106f5b1d025 > erik quanstrom wrote: > > i think the devolution of gnu grep is quite instructive. ... > > it gets to the heart of why plan9's invention and use (thank's rob, ken) of > > utf-8 is so great. > > If the problem is that Gnu grep converts any non-8-bit character set > to wchar_t (the equivalent of Plan 9 "rune"), then it's not really a > fair criticism of the software. The conversion approach handles a > wide variety of character encoding scheme, whereas grepping the > encodings directly (the fast approach) doesn't work well for many > non-UTF-8 encodings. performance may suck, but that's just a symptom of a bigger problem. wchar_t is not the equivalent of Rune. Rune is always utf-8. wchar_t can be whatever. this is not a feature. it is a bug. suppose Linux user a and user b grep the same "text" file for the same string. results will depend on the users' locales. contrast plan 9. any two users grepping the same file for the same string will get the same results. in either case a character set conversion might be necessary to match the locale. but in the plan 9 case, one conversion will fix things for any plan 9 user. in the Linux case, there is no conversion that will fix things for any Linux user. - erik p.s. gnu grep does special-cases utf-8 and avoids wchar_t conversions