ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* Arabic utf 2  transcription
@ 2004-06-07 19:31 Idris Samawi Hamid
  0 siblings, 0 replies; only message in thread
From: Idris Samawi Hamid @ 2004-06-07 19:31 UTC (permalink / raw)
  Cc: aleph

Hi cohorts,

Thank you to everyone who helped me with this. I now have a working script 
for converting Arabic utf-8 to transcription. This is still pretty 
xperimental but I've tried it on some rather large files.

I added a couple of small features to make some things like the definite 
article more readable. This experiment has also led to some useful 
improvements in my own transcription scheme (based on Lagally's ArabTeX). 
For example, Lagally's system incorporates some Arabic orthography rules 
for hamzah; these are moot when dealing with utf-8 for the most part, so I 
added a direct transcription for each hamzah-carrier. Of course u will 
have to adjust what follows to fit your own pet transcription.

For some of the small features to work (like for making the definite 
article more explicit) one should move the entire file being processed 
over by at least one space (so every word beginning with the definite 
article is affected but middle-word occurrences are ignored).

Persian has not been extensively tested, and urdu, etc is completely 
missing. Hope this is useful in any case. U can download some real-life 
unicode arabic samples here:

http://www.alabrar.info/ar/booklist.aspx

Usage: perl utf2tex.pl <your utf file>

creates file "new.tex"

Thanks again to everyone, especially Thomas A. Schmitz who got the ball 
rolling.

Best
Idris

========================================================
#!/usr/bin/perl

use strict;
use warnings;

open(NEW,">new.tex"); #opens file to print out the result

while (<>) { #this opens the file for reading

# Allah

$_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x87/al-llaah/g;
$_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x91\xD9\x87/al-llaah/g;
$_ =~ s/\xD8\xA7\xD9\x84\xD9\x84\xD9\x91\xD9\xB0\xD9\x87/al-llaah/g;
#$_ =~ s/\xD9\x84\xD9\x84\xD9\x87/li-llaah/g;
#$_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/li-llaah/g;
#$_ =~ s/\xD9\x84\xD9\x84\xD9\x87/la-llaah/g;
#$_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/la-llaah/g;
$_ =~ s/\xD9\x84\xD9\x84\xD9\x87/l-llaah/g;
$_ =~ s/\xD9\x84\xD9\x84\xD9\x91\xD9\x87/l-llaah/g; #

# begin exceptions

# $_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x8A\x20/\x20 'ilY/g;
$_ =~ s/\x20الي\x20/\x20'ilY\x20/g;
$_ =~ s/\x20الي/\x20'ily/g;
$_ =~ s/\x20Ùˆ\x20/\x20wa\x20/g;
# $_ =~ s/\x20\x/\x20wa\x20/g;
$_ =~ s/\xD8\xA7\xD9\x8B/aN/g;

# end exceptions

# definite article

$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAA\xD9\x91/\x20al-tt/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAB\xD9\x91/\x20al-_t_t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAF\xD9\x91/\x20al-dd/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB0\xD9\x91/\x20al-_d_d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB1\xD9\x91/\x20al-rr/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB2\xD9\x91/\x20al-zz/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB3\xD9\x91/\x20al-ss/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB4\xD9\x91/\x20al-^s^s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB5\xD9\x91/\x20al-.s.s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB6\xD9\x91/\x20al-.d.d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB7\xD9\x91/\x20al-.t.t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB8\xD9\x91/\x20al-.z.z/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x84\xD9\x91/\x20al-ll/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x86\xD9\x91/\x20al-nn/g;

$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xA7/\x20al-A/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xA8/\x20al-b/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAA/\x20al-t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAB/\x20al-_t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAC/\x20al-j/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAD/\x20al-.h/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAE/\x20al-_h/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xAF/\x20al-d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB0/\x20al-_d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB1/\x20al-r/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB2/\x20al-z/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB3/\x20al-s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB4/\x20al-^s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB5/\x20al-.s/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB6/\x20al-.d/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB7/\x20al-.t/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB8/\x20al-.z/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xB9/\x20al-`/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD8\xBA/\x20al-.g/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x81/\x20al-f/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x82/\x20al-q/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x83/\x20al-k/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x84/\x20al-l/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x85/\x20al-m/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x86/\x20al-n/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x87/\x20al-h/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x88/\x20al-w/g;
$_ =~ s/\x20\xD8\xA7\xD9\x84\xD9\x8A/\x20al-y/g;

# 0601--060F

$_ =~ s/\xD8\x8C/,/g;

# 0610--061F

$_ =~ s/\xD8\x9B/;/g;
$_ =~ s/\xD8\x9F/?/g;

# 0620--062F

$_ =~ s/\xD8\xA1\xD9\x91/''/g;
$_ =~ s/\xD8\xA1/'/g;
$_ =~ s/\x20\xD8\xA2/ 'aa/g;
$_ =~ s/\xD8\xA2/~A/g;
$_ =~ s/\xD8\xA3\xD9\x91/xx/g;
$_ =~ s/\xD8\xA3/x/g;
$_ =~ s/\xD8\xA4\xD9\x91/oo/g;
$_ =~ s/\xD8\xA4/o/g;
$_ =~ s/\xD8\xA5\xD9\x91/cc/g;
$_ =~ s/\xD8\xA5/c/g;
$_ =~ s/\xD8\xA6\xD9\x91/CC/g;
$_ =~ s/\xD8\xA6/C/g;
$_ =~ s/\xD8\xA7/A/g;
$_ =~ s/\xD8\xA8\xD9\x91/bb/g;
$_ =~ s/\xD8\xA8/b/g;
$_ =~ s/\xD8\xA9/T/g;
$_ =~ s/\xD8\xAA\xD9\x91/tt/g;
$_ =~ s/\xD8\xAA/t/g;
$_ =~ s/\xD8\xAB\xD9\x91/_t_t/g;
$_ =~ s/\xD8\xAB/_t/g;
$_ =~ s/\xD8\xAC\xD9\x91/jj/g;
$_ =~ s/\xD8\xAC/j/g;
$_ =~ s/\xD8\xAD\xD9\x91/.h.h/g;
$_ =~ s/\xD8\xAD/.h/g;
$_ =~ s/\xD8\xAE\xD9\x91/_h_h/g;
$_ =~ s/\xD8\xAE/_h/g;
$_ =~ s/\xD8\xAF\xD9\x91/dd/g;
$_ =~ s/dÙ‘/dd/g;
$_ =~ s/\xD8\xAF/d/g;

# 0630--063F

$_ =~ s/\xD8\xB0\xD9\x91/_d_d/g;
$_ =~ s/\xD8\xB0/_d/g;
$_ =~ s/\xD8\xB1\xD9\x91/rr/g;
$_ =~ s/\xD8\xB1/r/g;
$_ =~ s/\xD8\xB2\xD9\x91/zz/g;
$_ =~ s/\xD8\xB2/z/g;
$_ =~ s/\xD8\xB3\xD9\x91/ss/g;
$_ =~ s/\xD8\xB3/s/g;
$_ =~ s/\xD8\xB4\xD9\x91/^s^s/g;
$_ =~ s/\xD8\xB4/^s/g;
$_ =~ s/\xD8\xB5\xD9\x91/.s.s/g;
$_ =~ s/\xD8\xB5/.s/g;
$_ =~ s/\xD8\xB6\xD9\x91/.d.d/g;
$_ =~ s/\xD8\xB6/.d/g;
$_ =~ s/\xD8\xB7\xD9\x91/.t.t/g;
$_ =~ s/\xD8\xB7/.t/g;
$_ =~ s/\xD8\xB8\xD9\x91/.z.z/g;
$_ =~ s/\xD8\xB8/.z/g;
$_ =~ s/\xD8\xB9\xD9\x91/``/g;
$_ =~ s/\xD8\xB9/`/g;
$_ =~ s/\xD8\xBA\xD9\x91/.g.g/g;
$_ =~ s/\xD8\xBA/.g/g;

# 0640--064F

$_ =~ s/\xD9\x80/--/g;
$_ =~ s/\xD9\x81\xD9\x91/ff/g;
$_ =~ s/\xD9\x81/f/g;
$_ =~ s/\xD9\x82\xD9\x91/qq/g;
$_ =~ s/\xD9\x82/q/g;
$_ =~ s/\xD9\x83\xD9\x91/kk/g;
$_ =~ s/\xD9\x83/k/g;
$_ =~ s/\xD9\x84\xD9\x91/ll/g;
$_ =~ s/\xD9\x84/l/g;
$_ =~ s/\xD9\x85\xD9\x91/mm/g;
$_ =~ s/\xD9\x85/m/g;
$_ =~ s/\xD9\x86\xD9\x91/nn/g;
$_ =~ s/\xD9\x86/n/g;
$_ =~ s/\xD9\x87\xD9\x91/hh/g;
$_ =~ s/\xD9\x87/h/g;
$_ =~ s/\xD9\x88\xD9\x91/ww/g;
$_ =~ s/\xD9\x88/w/g;
$_ =~ s/\xD9\x89\xD9\x91/YY/g;
$_ =~ s/\xD9\x89/Y/g;
$_ =~ s/\xD9\x8A\xD9\x91/yy/g;
$_ =~ s/\xD9\x8A/y/g;
$_ =~ s/\xD9\x8B/aN/g;
$_ =~ s/\xD9\x8C/uN/g;
$_ =~ s/\xD9\x8D/iN/g;
$_ =~ s/\xD9\x8E/a/g;
$_ =~ s/\xD9\x8F/u/g;

# 0650--065F

$_ =~ s/\xD9\x90/i/g;
$_ =~ s/\xD9\x92//g;
$_ =~ s/\xD9\x92/~/g;

# 0660--066F

$_ =~ s/\xD9\xA0/0/g;
$_ =~ s/\xD9\xA1/1/g;
$_ =~ s/\xD9\xA2/2/g;
$_ =~ s/\xD9\xA3/3/g;
$_ =~ s/\xD9\xA4/4/g;
$_ =~ s/\xD9\xA5/5/g;
$_ =~ s/\xD9\xA6/6/g;
$_ =~ s/\xD9\xA7/7/g;
$_ =~ s/\xD9\xA8/8/g;
$_ =~ s/\xD9\xA9/9/g;
$_ =~ s/\xD9\xAA/%/g;
$_ =~ s/\xD9\xAB/./g;
$_ =~ s/\xD9\xAC/,/g;

# 0670--067F

$_ =~ s/\xD9\xB0/aa/g;
$_ =~ s/\xD9\xB1/WA/g;
$_ =~ s/\xD9\xBE/p/g;

# 0680--068F

$_ =~ s/\xDA\x86/^c/g;

# 0690--069F

$_ =~ s/\xDA\x98/^z/g;

# 06A0--06AF

$_ =~ s/\xDA\xA4/v/g;
$_ =~ s/\xDA\xAF/g/g;

# 06B0--06BF

$_ =~ s/\xDA\xBE/h/g;

# 06C0--06CF

$_ =~ s/\xDB\x80/e/g;
$_ =~ s/\xDB\x81/h/g;
$_ =~ s/\xDB\x82/e/g;
$_ =~ s/\xDB\x83/T/g;
$_ =~ s/\xDB\xAA/ii/g;

# 06D0--06DF

$_ =~ s/\xDB\x93/E/g;
$_ =~ s/\xDB\x92/c/g;

# 06E0--06EF

# $_ =~ s/\xD8\xA7\xDB\xA4/~A/g;
$_ =~ s/\xDB\xA4/~/g;

# weird

$_ =~ s/\xE2\x80\x8C/@/g; # temporary tag because this 0-width non joiner 
space creates an ambiguity

print NEW "$_"; #and this writes the result into file "new.tex"
}

close(NEW);
========================================================

-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2004-06-07 19:31 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-06-07 19:31 Arabic utf 2 transcription Idris Samawi Hamid

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).