From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <1aad458c439864fdd227ffc52d1cf9fe@granite.cias.osakafu-u.ac.jp> To: 9fans@cse.psu.edu Subject: Re: [9fans] awk Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit From: okamoto@granite.cias.osakafu-u.ac.jp MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="upas-tmthlxvzkzzomvhnwqtpczltyj" Date: Thu, 7 Nov 2002 18:56:44 +0900 Topicbox-Message-UUID: 17a2154e-eacb-11e9-9e20-41e7f4b1d025 This is a multi-part message in MIME format. --upas-tmthlxvzkzzomvhnwqtpczltyj Content-Disposition: inline I'm not insulting you, but... As is seen here recently, we seem to have small developpersnow. Furthermore, this is an example of an application bug, and it's deeply related to consistency of usage of UTF-8 in an application. Taking into consideration of these facts, I think you'd better to report the fix for it, because I believe you can do it. I'm supposing this seems not to be a serious bug, probably just in a match function etc.. No I have no idea for this though. just my two cents, Kenji --upas-tmthlxvzkzzomvhnwqtpczltyj Content-Type: message/rfc822 Content-Disposition: inline Received: from granite.cias.osakafu-u.ac.jp ([192.168.1.3]) by diabase; Thu Nov 7 15:51:17 JST 2002 Received: from elmo.cias.osakafu-u.ac.jp (elmo.cias.osakafu-u.ac.jp [157.16.103.2]) by granite.cias.osakafu-u.ac.jp (8.9.3/8.9.3) with ESMTP id PAA00935 for ; Thu, 7 Nov 2002 15:47:15 +0900 Received: from mail.cse.psu.edu (psuvax1.cse.psu.edu [130.203.4.6]) by elmo.cias.osakafu-u.ac.jp (8.9.3/3.7W-02110515) with ESMTP id PAA28312 for ; Thu, 7 Nov 2002 15:47:18 +0900 (JST) Received: from psuvax1.cse.psu.edu (psuvax1.cse.psu.edu [130.203.30.6]) by mail.cse.psu.edu (CSE Mail Server) with ESMTP id D2303199BE; Thu, 7 Nov 2002 01:47:08 -0500 (EST) Delivered-To: 9fans@cse.psu.edu Received: from pc.aichi-u.ac.jp (a130035.usr.starcat.ne.jp [61.211.130.35]) by mail.cse.psu.edu (CSE Mail Server) with SMTP id 4C02B19995 for <9fans@cse.psu.edu>; Thu, 7 Nov 2002 01:46:32 -0500 (EST) Message-ID: From: "Kenji Arisawa" To: 9fans@cse.psu.edu MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Subject: [9fans] awk Sender: 9fans-admin@cse.psu.edu Errors-To: 9fans-admin@cse.psu.edu X-BeenThere: 9fans@cse.psu.edu X-Mailman-Version: 2.0.11 Precedence: bulk Reply-To: 9fans@cse.psu.edu X-Reply-To: "Kenji Arisawa" List-Id: Fans of the OS Plan 9 from Bell Labs <9fans.cse.psu.edu> List-Archive: Date: Thu, 7 Nov 2002 15:46:29 +0900 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by granite.cias.osakafu-u.ac.jp id PAA00935 I tested some awk string functions to examine if they can handle UFT-8 code well. The bollow is my text code: #!/bin/rc # # Can awk function handle UTF strings ? # echo '=E3=83=99=E3=83=AB:=E7=A0=94=E7=A9=B6=E6=89=80' | awk '{ print $0 # =E3=83=99=E3=83=AB:=E7=A0=94=E7=A9=B6=E6=89=80 print length($0) # 6 print index($0,":") # 3 print match($0,":.*"),RSTART, RLENGTH # 7 7 4 print substr($0,3) # :=E7=A0=94=E7=A9=B6=E6=89=80 a=3D$0; sub(":.+", "alice", a); print a # =E3=83=99=E3=83=ABalice }' Output is commented after `#' in each line. Function `match' returns byte position that is inconsitent with others. I believe this is a bug. Kenji Arisawa --upas-tmthlxvzkzzomvhnwqtpczltyj--