From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/15342 Path: main.gmane.org!not-for-mail From: Idris Samawi Hamid Newsgroups: gmane.comp.tex.context Subject: Arabic utf 2 transcription Date: Mon, 07 Jun 2004 13:31:10 -0600 Organization: Colorado State University Sender: ntg-context-admin@ntg.nl Message-ID: Reply-To: ntg-context@ntg.nl NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: sea.gmane.org 1086636858 18851 80.91.224.253 (7 Jun 2004 19:34:18 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Mon, 7 Jun 2004 19:34:18 +0000 (UTC) Cc: "aleph@ntg.nl" Original-X-From: ntg-context-admin@ntg.nl Mon Jun 07 21:34:08 2004 Return-path: Original-Received: from ref.vet.uu.nl ([131.211.172.13] helo=ref.ntg.nl) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1BXPsm-0002oj-00 for ; Mon, 07 Jun 2004 21:34:08 +0200 Original-Received: from ref.ntg.nl (localhost.localdomain [127.0.0.1]) by ref.ntg.nl (Postfix) with ESMTP id 4359510B4C; Mon, 7 Jun 2004 21:34:03 +0200 (MEST) Original-Received: from eagle.acns.ColoState.EDU (eagle.acns.colostate.edu [129.82.100.90]) by ref.ntg.nl (Postfix) with ESMTP id AB86B10B3A; Mon, 7 Jun 2004 21:31:12 +0200 (MEST) Original-Received: from lamar.colostate.edu (lamar.acns.colostate.edu [129.82.100.75]) by eagle.acns.ColoState.EDU (AIX5.1/8.11.6p2/8.11.0) with ESMTP id i57JVBL1069892; Mon, 7 Jun 2004 13:31:11 -0600 Original-Received: from IHAMID (ihamid.libarts.colostate.edu [129.82.187.166]) by lamar.colostate.edu (AIX5.1/8.11.6p2/8.11.0) with ESMTP id i57JVBe530106; Mon, 7 Jun 2004 13:31:11 -0600 Original-To: ntg-context@ntg.nl User-Agent: Opera7.23/Win32 M2 build 3227 Errors-To: ntg-context-admin@ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.0.13 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.comp.tex.context:15342 X-Report-Spam: http://spam.gmane.org/gmane.comp.tex.context:15342 Hi cohorts, Thank you to everyone who helped me with this. I now have a working script for converting Arabic utf-8 to transcription. This is still pretty xperimental but I've tried it on some rather large files. I added a couple of small features to make some things like the definite article more readable. This experiment has also led to some useful improvements in my own transcription scheme (based on Lagally's ArabTeX). For example, Lagally's system incorporates some Arabic orthography rules for hamzah; these are moot when dealing with utf-8 for the most part, so I added a direct transcription for each hamzah-carrier. Of course u will have to adjust what follows to fit your own pet transcription. For some of the small features to work (like for making the definite article more explicit) one should move the entire file being processed over by at least one space (so every word beginning with the definite article is affected but middle-word occurrences are ignored). Persian has not been extensively tested, and urdu, etc is completely missing. Hope this is useful in any case. U can download some real-life unicode arabic samples here: http://www.alabrar.info/ar/booklist.aspx Usage: perl utf2tex.pl creates file "new.tex" Thanks again to everyone, especially Thomas A. Schmitz who got the ball rolling. Best Idris ======================================================== #!/usr/bin/perl use strict; use warnings; open(NEW,">new.tex"); #opens file to print out the result while (<>) { #this opens the file for reading # Allah $_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x87/al-llaah/g; $_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x91\xD9\x87/al-llaah/g; $_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x91\xD9\xB0\xD9\x87/al-llaah/g; #$_ =~ s/\xD9\x84\xD9\x84\xD9\x87/li-llaah/g; #$_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/li-llaah/g; #$_ =~ s/\xD9\x84\xD9\x84\xD9\x87/la-llaah/g; #$_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/la-llaah/g; $_ =~ s/\xD9\x84\xD9\x84\xD9\x87/l-llaah/g; $_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/l-llaah/g; # # begin exceptions # $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x8A\x20/\x20 'ilY/g; $_ =~ s/\x20الي\x20/\x20'ilY\x20/g; $_ =~ s/\x20الي/\x20'ily/g; $_ =~ s/\x20و\x20/\x20wa\x20/g; # $_ =~ s/\x20\x/\x20wa\x20/g; $_ =~ s/\xD8\xA7\xD9\x8B/aN/g; # end exceptions # definite article $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAA\xD9\x91/\x20al-tt/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAB\xD9\x91/\x20al-_t_t/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAF\xD9\x91/\x20al-dd/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB0\xD9\x91/\x20al-_d_d/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB1\xD9\x91/\x20al-rr/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB2\xD9\x91/\x20al-zz/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB3\xD9\x91/\x20al-ss/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB4\xD9\x91/\x20al-^s^s/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB5\xD9\x91/\x20al-.s.s/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB6\xD9\x91/\x20al-.d.d/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB7\xD9\x91/\x20al-.t.t/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB8\xD9\x91/\x20al-.z.z/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x84\xD9\x91/\x20al-ll/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x86\xD9\x91/\x20al-nn/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xA7/\x20al-A/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xA8/\x20al-b/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAA/\x20al-t/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAB/\x20al-_t/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAC/\x20al-j/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAD/\x20al-.h/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAE/\x20al-_h/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAF/\x20al-d/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB0/\x20al-_d/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB1/\x20al-r/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB2/\x20al-z/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB3/\x20al-s/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB4/\x20al-^s/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB5/\x20al-.s/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB6/\x20al-.d/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB7/\x20al-.t/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB8/\x20al-.z/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB9/\x20al-`/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xBA/\x20al-.g/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x81/\x20al-f/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x82/\x20al-q/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x83/\x20al-k/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x84/\x20al-l/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x85/\x20al-m/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x86/\x20al-n/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x87/\x20al-h/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x88/\x20al-w/g; $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x8A/\x20al-y/g; # 0601--060F $_ =~ s/\xD8\x8C/,/g; # 0610--061F $_ =~ s/\xD8\x9B/;/g; $_ =~ s/\xD8\x9F/?/g; # 0620--062F $_ =~ s/\xD8\xA1\xD9\x91/''/g; $_ =~ s/\xD8\xA1/'/g; $_ =~ s/\x20\xD8\xA2/ 'aa/g; $_ =~ s/\xD8\xA2/~A/g; $_ =~ s/\xD8\xA3\xD9\x91/xx/g; $_ =~ s/\xD8\xA3/x/g; $_ =~ s/\xD8\xA4\xD9\x91/oo/g; $_ =~ s/\xD8\xA4/o/g; $_ =~ s/\xD8\xA5\xD9\x91/cc/g; $_ =~ s/\xD8\xA5/c/g; $_ =~ s/\xD8\xA6\xD9\x91/CC/g; $_ =~ s/\xD8\xA6/C/g; $_ =~ s/\xD8\xA7/A/g; $_ =~ s/\xD8\xA8\xD9\x91/bb/g; $_ =~ s/\xD8\xA8/b/g; $_ =~ s/\xD8\xA9/T/g; $_ =~ s/\xD8\xAA\xD9\x91/tt/g; $_ =~ s/\xD8\xAA/t/g; $_ =~ s/\xD8\xAB\xD9\x91/_t_t/g; $_ =~ s/\xD8\xAB/_t/g; $_ =~ s/\xD8\xAC\xD9\x91/jj/g; $_ =~ s/\xD8\xAC/j/g; $_ =~ s/\xD8\xAD\xD9\x91/.h.h/g; $_ =~ s/\xD8\xAD/.h/g; $_ =~ s/\xD8\xAE\xD9\x91/_h_h/g; $_ =~ s/\xD8\xAE/_h/g; $_ =~ s/\xD8\xAF\xD9\x91/dd/g; $_ =~ s/dّ/dd/g; $_ =~ s/\xD8\xAF/d/g; # 0630--063F $_ =~ s/\xD8\xB0\xD9\x91/_d_d/g; $_ =~ s/\xD8\xB0/_d/g; $_ =~ s/\xD8\xB1\xD9\x91/rr/g; $_ =~ s/\xD8\xB1/r/g; $_ =~ s/\xD8\xB2\xD9\x91/zz/g; $_ =~ s/\xD8\xB2/z/g; $_ =~ s/\xD8\xB3\xD9\x91/ss/g; $_ =~ s/\xD8\xB3/s/g; $_ =~ s/\xD8\xB4\xD9\x91/^s^s/g; $_ =~ s/\xD8\xB4/^s/g; $_ =~ s/\xD8\xB5\xD9\x91/.s.s/g; $_ =~ s/\xD8\xB5/.s/g; $_ =~ s/\xD8\xB6\xD9\x91/.d.d/g; $_ =~ s/\xD8\xB6/.d/g; $_ =~ s/\xD8\xB7\xD9\x91/.t.t/g; $_ =~ s/\xD8\xB7/.t/g; $_ =~ s/\xD8\xB8\xD9\x91/.z.z/g; $_ =~ s/\xD8\xB8/.z/g; $_ =~ s/\xD8\xB9\xD9\x91/``/g; $_ =~ s/\xD8\xB9/`/g; $_ =~ s/\xD8\xBA\xD9\x91/.g.g/g; $_ =~ s/\xD8\xBA/.g/g; # 0640--064F $_ =~ s/\xD9\x80/--/g; $_ =~ s/\xD9\x81\xD9\x91/ff/g; $_ =~ s/\xD9\x81/f/g; $_ =~ s/\xD9\x82\xD9\x91/qq/g; $_ =~ s/\xD9\x82/q/g; $_ =~ s/\xD9\x83\xD9\x91/kk/g; $_ =~ s/\xD9\x83/k/g; $_ =~ s/\xD9\x84\xD9\x91/ll/g; $_ =~ s/\xD9\x84/l/g; $_ =~ s/\xD9\x85\xD9\x91/mm/g; $_ =~ s/\xD9\x85/m/g; $_ =~ s/\xD9\x86\xD9\x91/nn/g; $_ =~ s/\xD9\x86/n/g; $_ =~ s/\xD9\x87\xD9\x91/hh/g; $_ =~ s/\xD9\x87/h/g; $_ =~ s/\xD9\x88\xD9\x91/ww/g; $_ =~ s/\xD9\x88/w/g; $_ =~ s/\xD9\x89\xD9\x91/YY/g; $_ =~ s/\xD9\x89/Y/g; $_ =~ s/\xD9\x8A\xD9\x91/yy/g; $_ =~ s/\xD9\x8A/y/g; $_ =~ s/\xD9\x8B/aN/g; $_ =~ s/\xD9\x8C/uN/g; $_ =~ s/\xD9\x8D/iN/g; $_ =~ s/\xD9\x8E/a/g; $_ =~ s/\xD9\x8F/u/g; # 0650--065F $_ =~ s/\xD9\x90/i/g; $_ =~ s/\xD9\x92//g; $_ =~ s/\xD9\x92/~/g; # 0660--066F $_ =~ s/\xD9\xA0/0/g; $_ =~ s/\xD9\xA1/1/g; $_ =~ s/\xD9\xA2/2/g; $_ =~ s/\xD9\xA3/3/g; $_ =~ s/\xD9\xA4/4/g; $_ =~ s/\xD9\xA5/5/g; $_ =~ s/\xD9\xA6/6/g; $_ =~ s/\xD9\xA7/7/g; $_ =~ s/\xD9\xA8/8/g; $_ =~ s/\xD9\xA9/9/g; $_ =~ s/\xD9\xAA/%/g; $_ =~ s/\xD9\xAB/./g; $_ =~ s/\xD9\xAC/,/g; # 0670--067F $_ =~ s/\xD9\xB0/aa/g; $_ =~ s/\xD9\xB1/WA/g; $_ =~ s/\xD9\xBE/p/g; # 0680--068F $_ =~ s/\xDA\x86/^c/g; # 0690--069F $_ =~ s/\xDA\x98/^z/g; # 06A0--06AF $_ =~ s/\xDA\xA4/v/g; $_ =~ s/\xDA\xAF/g/g; # 06B0--06BF $_ =~ s/\xDA\xBE/h/g; # 06C0--06CF $_ =~ s/\xDB\x80/e/g; $_ =~ s/\xDB\x81/h/g; $_ =~ s/\xDB\x82/e/g; $_ =~ s/\xDB\x83/T/g; $_ =~ s/\xDB\xAA/ii/g; # 06D0--06DF $_ =~ s/\xDB\x93/E/g; $_ =~ s/\xDB\x92/c/g; # 06E0--06EF # $_ =~ s/\xD8\xA7\xDB\xA4/~A/g; $_ =~ s/\xDB\xA4/~/g; # weird $_ =~ s/\xE2\x80\x8C/@/g; # temporary tag because this 0-width non joiner space creates an ambiguity print NEW "$_"; #and this writes the result into file "new.tex" } close(NEW); ======================================================== -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523