From: Idris Samawi Hamid <ishamid@colostate.edu>
Cc: "aleph@ntg.nl" <aleph@ntg.nl>
Subject: Arabic utf 2 transcription
Date: Mon, 07 Jun 2004 13:31:10 -0600 [thread overview]
Message-ID: <opr88oh8ibu9mfh0@lamar.colostate.edu> (raw)
Hi cohorts,
Thank you to everyone who helped me with this. I now have a working script
for converting Arabic utf-8 to transcription. This is still pretty
xperimental but I've tried it on some rather large files.
I added a couple of small features to make some things like the definite
article more readable. This experiment has also led to some useful
improvements in my own transcription scheme (based on Lagally's ArabTeX).
For example, Lagally's system incorporates some Arabic orthography rules
for hamzah; these are moot when dealing with utf-8 for the most part, so I
added a direct transcription for each hamzah-carrier. Of course u will
have to adjust what follows to fit your own pet transcription.
For some of the small features to work (like for making the definite
article more explicit) one should move the entire file being processed
over by at least one space (so every word beginning with the definite
article is affected but middle-word occurrences are ignored).
Persian has not been extensively tested, and urdu, etc is completely
missing. Hope this is useful in any case. U can download some real-life
unicode arabic samples here:
http://www.alabrar.info/ar/booklist.aspx
Usage: perl utf2tex.pl <your utf file>
creates file "new.tex"
Thanks again to everyone, especially Thomas A. Schmitz who got the ball
rolling.
Best
Idris
========================================================
#!/usr/bin/perl
use strict;
use warnings;
open(NEW,">new.tex"); #opens file to print out the result
while (<>) { #this opens the file for reading
# Allah
$_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x87/al-llaah/g;
$_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x91\xD9\x87/al-llaah/g;
$_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x91\xD9\xB0\xD9\x87/al-llaah/g;
#$_ =~ s/\xD9\x84\xD9\x84\xD9\x87/li-llaah/g;
#$_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/li-llaah/g;
#$_ =~ s/\xD9\x84\xD9\x84\xD9\x87/la-llaah/g;
#$_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/la-llaah/g;
$_ =~ s/\xD9\x84\xD9\x84\xD9\x87/l-llaah/g;
$_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/l-llaah/g; #
# begin exceptions
# $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x8A\x20/\x20 'ilY/g;
$_ =~ s/\x20الي\x20/\x20'ilY\x20/g;
$_ =~ s/\x20الي/\x20'ily/g;
$_ =~ s/\x20Ùˆ\x20/\x20wa\x20/g;
# $_ =~ s/\x20\x/\x20wa\x20/g;
$_ =~ s/\xD8\xA7\xD9\x8B/aN/g;
# end exceptions
# definite article
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAA\xD9\x91/\x20al-tt/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAB\xD9\x91/\x20al-_t_t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAF\xD9\x91/\x20al-dd/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB0\xD9\x91/\x20al-_d_d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB1\xD9\x91/\x20al-rr/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB2\xD9\x91/\x20al-zz/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB3\xD9\x91/\x20al-ss/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB4\xD9\x91/\x20al-^s^s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB5\xD9\x91/\x20al-.s.s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB6\xD9\x91/\x20al-.d.d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB7\xD9\x91/\x20al-.t.t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB8\xD9\x91/\x20al-.z.z/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x84\xD9\x91/\x20al-ll/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x86\xD9\x91/\x20al-nn/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xA7/\x20al-A/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xA8/\x20al-b/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAA/\x20al-t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAB/\x20al-_t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAC/\x20al-j/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAD/\x20al-.h/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAE/\x20al-_h/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAF/\x20al-d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB0/\x20al-_d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB1/\x20al-r/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB2/\x20al-z/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB3/\x20al-s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB4/\x20al-^s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB5/\x20al-.s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB6/\x20al-.d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB7/\x20al-.t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB8/\x20al-.z/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB9/\x20al-`/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xBA/\x20al-.g/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x81/\x20al-f/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x82/\x20al-q/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x83/\x20al-k/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x84/\x20al-l/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x85/\x20al-m/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x86/\x20al-n/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x87/\x20al-h/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x88/\x20al-w/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x8A/\x20al-y/g;
# 0601--060F
$_ =~ s/\xD8\x8C/,/g;
# 0610--061F
$_ =~ s/\xD8\x9B/;/g;
$_ =~ s/\xD8\x9F/?/g;
# 0620--062F
$_ =~ s/\xD8\xA1\xD9\x91/''/g;
$_ =~ s/\xD8\xA1/'/g;
$_ =~ s/\x20\xD8\xA2/ 'aa/g;
$_ =~ s/\xD8\xA2/~A/g;
$_ =~ s/\xD8\xA3\xD9\x91/xx/g;
$_ =~ s/\xD8\xA3/x/g;
$_ =~ s/\xD8\xA4\xD9\x91/oo/g;
$_ =~ s/\xD8\xA4/o/g;
$_ =~ s/\xD8\xA5\xD9\x91/cc/g;
$_ =~ s/\xD8\xA5/c/g;
$_ =~ s/\xD8\xA6\xD9\x91/CC/g;
$_ =~ s/\xD8\xA6/C/g;
$_ =~ s/\xD8\xA7/A/g;
$_ =~ s/\xD8\xA8\xD9\x91/bb/g;
$_ =~ s/\xD8\xA8/b/g;
$_ =~ s/\xD8\xA9/T/g;
$_ =~ s/\xD8\xAA\xD9\x91/tt/g;
$_ =~ s/\xD8\xAA/t/g;
$_ =~ s/\xD8\xAB\xD9\x91/_t_t/g;
$_ =~ s/\xD8\xAB/_t/g;
$_ =~ s/\xD8\xAC\xD9\x91/jj/g;
$_ =~ s/\xD8\xAC/j/g;
$_ =~ s/\xD8\xAD\xD9\x91/.h.h/g;
$_ =~ s/\xD8\xAD/.h/g;
$_ =~ s/\xD8\xAE\xD9\x91/_h_h/g;
$_ =~ s/\xD8\xAE/_h/g;
$_ =~ s/\xD8\xAF\xD9\x91/dd/g;
$_ =~ s/dÙ‘/dd/g;
$_ =~ s/\xD8\xAF/d/g;
# 0630--063F
$_ =~ s/\xD8\xB0\xD9\x91/_d_d/g;
$_ =~ s/\xD8\xB0/_d/g;
$_ =~ s/\xD8\xB1\xD9\x91/rr/g;
$_ =~ s/\xD8\xB1/r/g;
$_ =~ s/\xD8\xB2\xD9\x91/zz/g;
$_ =~ s/\xD8\xB2/z/g;
$_ =~ s/\xD8\xB3\xD9\x91/ss/g;
$_ =~ s/\xD8\xB3/s/g;
$_ =~ s/\xD8\xB4\xD9\x91/^s^s/g;
$_ =~ s/\xD8\xB4/^s/g;
$_ =~ s/\xD8\xB5\xD9\x91/.s.s/g;
$_ =~ s/\xD8\xB5/.s/g;
$_ =~ s/\xD8\xB6\xD9\x91/.d.d/g;
$_ =~ s/\xD8\xB6/.d/g;
$_ =~ s/\xD8\xB7\xD9\x91/.t.t/g;
$_ =~ s/\xD8\xB7/.t/g;
$_ =~ s/\xD8\xB8\xD9\x91/.z.z/g;
$_ =~ s/\xD8\xB8/.z/g;
$_ =~ s/\xD8\xB9\xD9\x91/``/g;
$_ =~ s/\xD8\xB9/`/g;
$_ =~ s/\xD8\xBA\xD9\x91/.g.g/g;
$_ =~ s/\xD8\xBA/.g/g;
# 0640--064F
$_ =~ s/\xD9\x80/--/g;
$_ =~ s/\xD9\x81\xD9\x91/ff/g;
$_ =~ s/\xD9\x81/f/g;
$_ =~ s/\xD9\x82\xD9\x91/qq/g;
$_ =~ s/\xD9\x82/q/g;
$_ =~ s/\xD9\x83\xD9\x91/kk/g;
$_ =~ s/\xD9\x83/k/g;
$_ =~ s/\xD9\x84\xD9\x91/ll/g;
$_ =~ s/\xD9\x84/l/g;
$_ =~ s/\xD9\x85\xD9\x91/mm/g;
$_ =~ s/\xD9\x85/m/g;
$_ =~ s/\xD9\x86\xD9\x91/nn/g;
$_ =~ s/\xD9\x86/n/g;
$_ =~ s/\xD9\x87\xD9\x91/hh/g;
$_ =~ s/\xD9\x87/h/g;
$_ =~ s/\xD9\x88\xD9\x91/ww/g;
$_ =~ s/\xD9\x88/w/g;
$_ =~ s/\xD9\x89\xD9\x91/YY/g;
$_ =~ s/\xD9\x89/Y/g;
$_ =~ s/\xD9\x8A\xD9\x91/yy/g;
$_ =~ s/\xD9\x8A/y/g;
$_ =~ s/\xD9\x8B/aN/g;
$_ =~ s/\xD9\x8C/uN/g;
$_ =~ s/\xD9\x8D/iN/g;
$_ =~ s/\xD9\x8E/a/g;
$_ =~ s/\xD9\x8F/u/g;
# 0650--065F
$_ =~ s/\xD9\x90/i/g;
$_ =~ s/\xD9\x92//g;
$_ =~ s/\xD9\x92/~/g;
# 0660--066F
$_ =~ s/\xD9\xA0/0/g;
$_ =~ s/\xD9\xA1/1/g;
$_ =~ s/\xD9\xA2/2/g;
$_ =~ s/\xD9\xA3/3/g;
$_ =~ s/\xD9\xA4/4/g;
$_ =~ s/\xD9\xA5/5/g;
$_ =~ s/\xD9\xA6/6/g;
$_ =~ s/\xD9\xA7/7/g;
$_ =~ s/\xD9\xA8/8/g;
$_ =~ s/\xD9\xA9/9/g;
$_ =~ s/\xD9\xAA/%/g;
$_ =~ s/\xD9\xAB/./g;
$_ =~ s/\xD9\xAC/,/g;
# 0670--067F
$_ =~ s/\xD9\xB0/aa/g;
$_ =~ s/\xD9\xB1/WA/g;
$_ =~ s/\xD9\xBE/p/g;
# 0680--068F
$_ =~ s/\xDA\x86/^c/g;
# 0690--069F
$_ =~ s/\xDA\x98/^z/g;
# 06A0--06AF
$_ =~ s/\xDA\xA4/v/g;
$_ =~ s/\xDA\xAF/g/g;
# 06B0--06BF
$_ =~ s/\xDA\xBE/h/g;
# 06C0--06CF
$_ =~ s/\xDB\x80/e/g;
$_ =~ s/\xDB\x81/h/g;
$_ =~ s/\xDB\x82/e/g;
$_ =~ s/\xDB\x83/T/g;
$_ =~ s/\xDB\xAA/ii/g;
# 06D0--06DF
$_ =~ s/\xDB\x93/E/g;
$_ =~ s/\xDB\x92/c/g;
# 06E0--06EF
# $_ =~ s/\xD8\xA7\xDB\xA4/~A/g;
$_ =~ s/\xDB\xA4/~/g;
# weird
$_ =~ s/\xE2\x80\x8C/@/g; # temporary tag because this 0-width non joiner
space creates an ambiguity
print NEW "$_"; #and this writes the result into file "new.tex"
}
close(NEW);
========================================================
--
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523
reply other threads:[~2004-06-07 19:31 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=opr88oh8ibu9mfh0@lamar.colostate.edu \
--to=ishamid@colostate.edu \
--cc=aleph@ntg.nl \
--cc=ntg-context@ntg.nl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).