ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* Arabic-utf-8 (plus a sample)
@ 2004-06-05 19:32 Idris Samawi Hamid
  2004-06-05 20:41 ` Thomas A. Schmitz
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Idris Samawi Hamid @ 2004-06-05 19:32 UTC (permalink / raw)
  Cc: aleph

Hi gang,

For Arabic we use a Latin transcription in Aleph/(e-)Omega (or even 
ArabTeX) unless one of the encoding filters like utf-8 is used. Even for 
utf-8 files, however, it would be very useful to be able to convert a 
utf-8 file to Latin transcription for further processing by 
Aleph/(e-)Omega. For example, adding diacritics is much easier to do in 
Latin than in an Arabic script editor because Latin transcription is 
one-dimensional and adding diacritics to Arabic is a 2-dimen affair.

The best thing would be a perl script but I don't know perl at all (except 
to run some some precreated scripts). If someone out of the kindness of 
their heart could write a short and simple script for just seven 
characters I could do the rest myself and present it back here.

Now all of the Arabic charachters in utf-8 can be represented by extended 
ascii. I need something like this, that converts every extended ascii 
representation of Arabic utf-8 into a Latin transcription:

ا => A

ب => b

ج => j

د => d

Ù‡ => h

Ùˆ => w

ز => z

If someone could write a perl script that can accomplish the above 
conversion, I can manually fill in the rest of the script. Basically I use 
a modified version of the ArabTeX transcription.

Here is a "gift" in return: a sample utf-8 Arabic file that can be 
processed by Aleph/(e-)Omega in ConTeXt (you will probably need to dvips 
this, though some dvi-viewers can do the postscript/16-bit thing):

==============================================
\hoffset=0pt % for Omega bug: has this been fixed?

\def\ArabicUTF{\ocp\UTFArUni=inutf8 %% in88596
                %\ocp\UTFArUni=in88596
                \ocp\UniCUni=uni2cuni
                \ocp\CUniArab=cuni2oar
                \ocplist\UTFArOCP=
                \addbeforeocplist 1 \UTFArUni
                \addbeforeocplist 1 \UniCUni
                \addbeforeocplist 1 \CUniArab
                \nullocplist
                \pushocplist\UTFArOCP}

\input m-gamma.tex
\input type-omg.tex
\switchtobodyfont[omarb,12pt] %

\textdir TRT%
\pardir TRT%
\ArabicUTF

\starttext

، ؛ ؟ ء آ أ ؤ إ ئ ا ب ة ت ث ج ح خ د ذ ر ز س
ش ص ض ط ظ ع غ ـ ف ق ك ل م ن ه و ى ي

\blank[big]

%ً  ٌ  ٍ َ  ُ ِ ّ

Ù’ Ù  Ù¡ Ù¢ Ù£ Ù¤ Ù¥ Ù¦ Ù§ Ù¨ Ù© Ùª Ù« Ù¬ Ù° Ù± Ù² Ù³ Ù´ Ùµ Ù¶ Ù·
ٸ ٹ ٺ ٻ ټ ٽ پ ٿ ڀ ځ ڂ ڃ ڄ څ چ ڇ ڈ ډ ڊ ڋ ڌ ڍ
ڎ ڏ ڐ ڑ ڒ ړ ڔ ڕ ږ ڗ ژ ڙ ښ ڛ ڜ ڝ ڞ ڟ  ڢ ڡ ڢ ڣ
Ú¤ Ú¥ Ú¦ Ú§ Ú¨ Ú© Úª Ú« Ú¬ Ú­ Ú® Ú¯ Ú° Ú± Ú² Ú³ Ú´ Úµ Ú¶ Ú· Úº Ú»
ڼ  ھ ۀ ہ ۃ ۄ ۅ ۆ ۇ ۈ  ۉ ۊ ۋ ی ې ۑ ے ۓ ۔ ە ۰
Û± Û² Û³ Û´ Ûµ Û¶ Û· Û¸ Û¹

\blank[big]

ـً ـٌ  ـٍ ـَ  ـُ ـِ ـّ  ـْ ـٰ

ا ب ج د ه و ز

\stoptext

==============================================

Best
Idris

-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Arabic-utf-8 (plus a sample)
  2004-06-05 19:32 Arabic-utf-8 (plus a sample) Idris Samawi Hamid
@ 2004-06-05 20:41 ` Thomas A. Schmitz
  2004-06-05 21:33   ` Idris Samawi Hamid
  2004-06-05 23:08 ` [SPAM: 3.411] Arabic-utf-8 (plus a sample) Richard MAHONEY
  2004-06-06 13:22 ` Arabic-utf-8 " George N. White III
  2 siblings, 1 reply; 18+ messages in thread
From: Thomas A. Schmitz @ 2004-06-05 20:41 UTC (permalink / raw)


Idris,

I know a bit of perl and would love to help. However, I fear that
sending us your stuff via mail will be a bit difficult because the utf-8
chracters get transformed into gibberish. Could you send the hexadecimal
code of the characters you want to convert? Or I could simply give you
the syntax, you'll know what to do. So here comes a perl script that
works for my greek stuff; I don't see why it shouldn't work with Arabic:

==================================cut here

#!/usr/bin/perl -w

use strict;
use open ':utf8';

open(NEW,">new.tex"); #opens file to print out the result 

while (<>); { #this opens the file for reading

$_ =~
s/\x{HEXADECIMAL_VALUE_OF_CHARACTER}/\x{HEXADECIMAL_VALUE_OF_NEW_CHARACTER}/esg;
#this is the actual conversion

print NEW "$_";
#and this writes the result into file "new.tex"
}

close(NEW);

==================================and here

Make the script executable and call it with the name of a file as an
argument.

HTH

Thomas

On Sat, 2004-06-05 at 21:32, Idris Samawi Hamid wrote:
> Hi gang,
> 
> For Arabic we use a Latin transcription in Aleph/(e-)Omega (or even 
> ArabTeX) unless one of the encoding filters like utf-8 is used. Even for 
> utf-8 files, however, it would be very useful to be able to convert a 
> utf-8 file to Latin transcription for further processing by 
> Aleph/(e-)Omega. For example, adding diacritics is much easier to do in 
> Latin than in an Arabic script editor because Latin transcription is 
> one-dimensional and adding diacritics to Arabic is a 2-dimen affair.
> 
> The best thing would be a perl script but I don't know perl at all (except 
> to run some some precreated scripts). If someone out of the kindness of 
> their heart could write a short and simple script for just seven 
> characters I could do the rest myself and present it back here.
> 
> Now all of the Arabic charachters in utf-8 can be represented by extended 
> ascii. I need something like this, that converts every extended ascii 
> representation of Arabic utf-8 into a Latin transcription:
> 
> ا => A
> 
> ب => b
> 
> ج => j
> 
> د => d
> 
> Ù‡ => h
> 
> Ùˆ => w
> 
> ز => z
> 
> If someone could write a perl script that can accomplish the above 
> conversion, I can manually fill in the rest of the script. Basically I use 
> a modified version of the ArabTeX transcription.
> 
> Here is a "gift" in return: a sample utf-8 Arabic file that can be 
> processed by Aleph/(e-)Omega in ConTeXt (you will probably need to dvips 
> this, though some dvi-viewers can do the postscript/16-bit thing):
> 
> ==============================================
> \hoffset=0pt % for Omega bug: has this been fixed?
> 
> \def\ArabicUTF{\ocp\UTFArUni=inutf8 %% in88596
>                 %\ocp\UTFArUni=in88596
>                 \ocp\UniCUni=uni2cuni
>                 \ocp\CUniArab=cuni2oar
>                 \ocplist\UTFArOCP=
>                 \addbeforeocplist 1 \UTFArUni
>                 \addbeforeocplist 1 \UniCUni
>                 \addbeforeocplist 1 \CUniArab
>                 \nullocplist
>                 \pushocplist\UTFArOCP}
> 
> \input m-gamma.tex
> \input type-omg.tex
> \switchtobodyfont[omarb,12pt] %
> 
> \textdir TRT%
> \pardir TRT%
> \ArabicUTF
> 
> \starttext
> 
> ، ؛ ؟ ء آ أ ؤ إ ئ ا ب ة ت ث ج ح خ د ذ ر ز س
> ش ص ض ط ظ ع غ ـ ف ق ك ل م ن ه و ى ي
> 
> \blank[big]
> 
> %ً  ٌ  ٍ َ  ُ ِ ّ
> 
> Ù’ Ù  Ù¡ Ù¢ Ù£ Ù¤ Ù¥ Ù¦ Ù§ Ù¨ Ù© Ùª Ù« Ù¬ Ù° Ù± Ù² Ù³ Ù´ Ùµ Ù¶ Ù·
> ٸ ٹ ٺ ٻ ټ ٽ پ ٿ ڀ ځ ڂ ڃ ڄ څ چ ڇ ڈ ډ ڊ ڋ ڌ ڍ
> ڎ ڏ ڐ ڑ ڒ ړ ڔ ڕ ږ ڗ ژ ڙ ښ ڛ ڜ ڝ ڞ ڟ  ڢ ڡ ڢ ڣ
> Ú¤ Ú¥ Ú¦ Ú§ Ú¨ Ú© Úª Ú« Ú¬ Ú­ Ú® Ú¯ Ú° Ú± Ú² Ú³ Ú´ Úµ Ú¶ Ú· Úº Ú»
> ڼ  ھ ۀ ہ ۃ ۄ ۅ ۆ ۇ ۈ  ۉ ۊ ۋ ی ې ۑ ے ۓ ۔ ە ۰
> Û± Û² Û³ Û´ Ûµ Û¶ Û· Û¸ Û¹
> 
> \blank[big]
> 
> ـً ـٌ  ـٍ ـَ  ـُ ـِ ـّ  ـْ ـٰ
> 
> ا ب ج د ه و ز
> 
> \stoptext
> 
> ==============================================
> 
> Best
> Idris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Arabic-utf-8 (plus a sample)
  2004-06-05 20:41 ` Thomas A. Schmitz
@ 2004-06-05 21:33   ` Idris Samawi Hamid
  2004-06-05 21:48     ` Thomas A. Schmitz
  0 siblings, 1 reply; 18+ messages in thread
From: Idris Samawi Hamid @ 2004-06-05 21:33 UTC (permalink / raw)


On Sat, 05 Jun 2004 22:41:39 +0200, Thomas A. Schmitz 
<thomas.schmitz@uni-bonn.de> wrote:

> Idris,
>
> I know a bit of perl and would love to help. However, I fear that
> sending us your stuff via mail will be a bit difficult because the utf-8
> chracters get transformed into gibberish.

Thnx 4 such a speedy reply! I don't think you are getting gibberish 
though; you should be getting the extended ascii representation. So the 
letter alif (hex 0627) should look like this:

ا

Do you get a forward-slashed circle and a section symbol? If so, that's 
the ascii representation I'm trying to convert to the letter `A'.

Here are the codes you want:

ا [0627] => A

ب [0628] => b

ج [062C] => j

د [062F] => d

Ù‡ [0647] => h

Ùˆ [0648] => w

ز [0632] => z

Let me explain my situation more clearly:-)

I have a unicode editor, Unitype Global Writer. I save a unicode document 
as a utf *.txt file. When I open that saved file in my TeX editor 
(WinEdt), it comes out as extended ascii (that's the "gibberish"). So what 
I wanted to do was convert the ascii "gibberish" to my Latin 
transcription. It seems that what you are suggesting is to use the hex 
representation and convert the unicode txt file into a Latin transcription 
file directly and bypass the gibberish.

On your perl file, can you give me an example of how to use it? I tried 
(in windows, with name
utf2tex.pl and unicode text in unicode-utf.txt) and get

=========================
> perl utf2tex.pl unicode-utf.txt
Unknown discipline class ':utf8' at C:/Perl/lib/open.pm line 18.
BEGIN failed--compilation aborted at utf2tex.pl line 4.
=========================

 from your script I tried, e.g.

============================
$_ =~
s/\x{0627}/\x{0041}/esg;
# from alif to `A'
============================

Your guidance will be greatly appreciated!

Thnx a million!
Idris

-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Arabic-utf-8 (plus a sample)
  2004-06-05 21:33   ` Idris Samawi Hamid
@ 2004-06-05 21:48     ` Thomas A. Schmitz
  2004-06-05 22:51       ` Idris Samawi Hamid
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas A. Schmitz @ 2004-06-05 21:48 UTC (permalink / raw)


Just a quick reply (it's bedtime over here): there may be 2 problems. 1
is  that the mail program put in an unwanted linebreak after the =~
part, just remove it; it should all be one line. And then: you'll need a
fairly recent version of perl for it to work, what do you get when you
do
perl --version
I guess for utf to work, it should be at least 5.8.0. Your basic idea of
the usage is right (I'm not a windows person, but I  assume it should be
the same): save the scipt as utf2tex.pl, make it executable and call it
as utf2tex.pl FILENAME.txt.

I guess it would be easiest to convert the utf to ascii directly - that
would mean you could later convert it back. I have a set of scripts that
do just that -- convert babel Greek into utf-8 and back.

If you need more help, I'll look into it tomorrow!

Best

Thomas

On Sat, 2004-06-05 at 23:33, Idris Samawi Hamid wrote:
> On Sat, 05 Jun 2004 22:41:39 +0200, Thomas A. Schmitz 
> <thomas.schmitz@uni-bonn.de> wrote:
> 
> > Idris,
> >
> > I know a bit of perl and would love to help. However, I fear that
> > sending us your stuff via mail will be a bit difficult because the utf-8
> > chracters get transformed into gibberish.
> 
> Thnx 4 such a speedy reply! I don't think you are getting gibberish 
> though; you should be getting the extended ascii representation. So the 
> letter alif (hex 0627) should look like this:
> 
> ا
> 
> Do you get a forward-slashed circle and a section symbol? If so, that's 
> the ascii representation I'm trying to convert to the letter `A'.
> 
> Here are the codes you want:
> 
> ا [0627] => A
> 
> ب [0628] => b
> 
> ج [062C] => j
> 
> د [062F] => d
> 
> Ù‡ [0647] => h
> 
> Ùˆ [0648] => w
> 
> ز [0632] => z
> 
> Let me explain my situation more clearly:-)
> 
> I have a unicode editor, Unitype Global Writer. I save a unicode document 
> as a utf *.txt file. When I open that saved file in my TeX editor 
> (WinEdt), it comes out as extended ascii (that's the "gibberish"). So what 
> I wanted to do was convert the ascii "gibberish" to my Latin 
> transcription. It seems that what you are suggesting is to use the hex 
> representation and convert the unicode txt file into a Latin transcription 
> file directly and bypass the gibberish.
> 
> On your perl file, can you give me an example of how to use it? I tried 
> (in windows, with name
> utf2tex.pl and unicode text in unicode-utf.txt) and get
> 
> =========================
> > perl utf2tex.pl unicode-utf.txt
> Unknown discipline class ':utf8' at C:/Perl/lib/open.pm line 18.
> BEGIN failed--compilation aborted at utf2tex.pl line 4.
> =========================
> 
>  from your script I tried, e.g.
> 
> ============================
> $_ =~
> s/\x{0627}/\x{0041}/esg;
> # from alif to `A'
> ============================
> 
> Your guidance will be greatly appreciated!
> 
> Thnx a million!
> Idris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Arabic-utf-8 (plus a sample)
  2004-06-05 21:48     ` Thomas A. Schmitz
@ 2004-06-05 22:51       ` Idris Samawi Hamid
  2004-06-05 23:15         ` Re[2]: " Giuseppe Bilotta
  0 siblings, 1 reply; 18+ messages in thread
From: Idris Samawi Hamid @ 2004-06-05 22:51 UTC (permalink / raw)


On Sat, 05 Jun 2004 23:48:18 +0200, Thomas A. Schmitz 
<thomas.schmitz@uni-bonn.de> wrote:

> Just a quick reply (it's bedtime over here): there may be 2 problems.

Ok, get some sleep;-) Anyhow, I fixed the line break (is the space between 
tilda and `s' correct?)

==============================
$_ =~ s/\x{0627}/\x{0041}/esg;
#this is the actual conversion
==============================

did not work though:-(

My perl version is v5.6.1; I went to the ActivePerl
website and the only version they had is
v5.6.1.638; so from perl.org I found Indigoperl and switched;-)

This solves part of the problem:-) Now I get

> perl utf2tex.pl unicode-utf.txt
syntax error at utf2tex.pl line 8, near ");"
Execution of utf2tex.pl aborted due to compilation errors.

line 8 is

while (<>); { #this opens the file for reading

Here is the whole file once again:

==================================
#!/usr/bin/perl -w

use strict;
use open ':utf8';

open(NEW,">new.tex"); #opens file to print out the result

while (<>); { #this opens the file for reading

$_ =~ s/\x{0627}/\x{0041}/esg;
#this is the actual conversion

print NEW "$_";
#and this writes the result into file "new.tex"
}

close(NEW);
==================================

> the usage is right (I'm not a windows person,

If WinEdt (and some other things) worked under WINE, I would not be a 
windows person either:-(

And will attempt yet another switch (I've lost count) to Linux-KDE 
sometime this Summer...

Thnx a million
Idris

-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [SPAM: 3.411] Arabic-utf-8 (plus a sample)
  2004-06-05 19:32 Arabic-utf-8 (plus a sample) Idris Samawi Hamid
  2004-06-05 20:41 ` Thomas A. Schmitz
@ 2004-06-05 23:08 ` Richard MAHONEY
  2004-06-06  0:19   ` Idris Samawi Hamid
  2004-06-06 13:22 ` Arabic-utf-8 " George N. White III
  2 siblings, 1 reply; 18+ messages in thread
From: Richard MAHONEY @ 2004-06-05 23:08 UTC (permalink / raw)


On Sat, Jun 05, 2004 at 01:32:35PM -0600, Idris Samawi Hamid wrote:
> Hi gang,
> 
> For Arabic we use a Latin transcription in Aleph/(e-)Omega (or even 
> ArabTeX) unless one of the encoding filters like utf-8 is used. Even for 
> utf-8 files, however, it would be very useful to be able to convert a 
> utf-8 file to Latin transcription for further processing by 
> Aleph/(e-)Omega. For example, adding diacritics is much easier to do in 
> Latin than in an Arabic script editor because Latin transcription is 
> one-dimensional and adding diacritics to Arabic is a 2-dimen affair.
> 
> The best thing would be a perl script but I don't know perl at all (except 
> to run some some precreated scripts). If someone out of the kindness of 
> their heart could write a short and simple script for just seven 
> characters I could do the rest myself and present it back here.

You might like to look at some of the encoding conversion scripts at:

 http://homepages.comnet.co.nz/~r-mahoney/scripts/scripts.html

N.B. For sorting utf-8 Arabic you might find the perl module
`Sort::ArbBiLex' useful


Best regards,

 Richard Mahoney


-- 
Richard MAHONEY | internet: homepages.comnet.net.nz/~r-mahoney
Littledene      | telephone / telefax (man.): ++64 3 312 1699
Bay Road        | cellular: ++64 25 829 986
OXFORD, NZ      | e-mail: r.mahoney[use"@"]comnet.net.nz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re[2]: Arabic-utf-8 (plus a sample)
  2004-06-05 22:51       ` Idris Samawi Hamid
@ 2004-06-05 23:15         ` Giuseppe Bilotta
  2004-06-05 23:31           ` Idris Samawi Hamid
  0 siblings, 1 reply; 18+ messages in thread
From: Giuseppe Bilotta @ 2004-06-05 23:15 UTC (permalink / raw)


Sunday, June 6, 2004 Idris Samawi Hamid wrote:

> Here is the whole file once again:

> ==================================
> #!/usr/bin/perl -w

> use strict;
> use open ':utf8';

> open(NEW,">new.tex"); #opens file to print out the result

> while (<>); { #this opens the file for reading

> $_ =~ s/\x{0627}/\x{0041}/esg;
> #this is the actual conversion

> print NEW "$_";
> #and this writes the result into file "new.tex"
> }

> close(NEW);
> ==================================

My take: try the following (should work even with ActiveState
5.6)

===
#!/usr/bin/perl

use strict;
#D comment the following, I think we can do without
# use open ':utf8';

open(NEW,">new.tex"); #opens file to print out the result

while (<>); { #this opens the file for reading

$_ =~ s/\x06\x27/A/esg; #this is the actual conversion

print NEW "$_"; #and this writes the result into file "new.tex"
}

close(NEW);
===

Save as e.g. idris_conv.pl and issue as


perl idris_conv.pl < filename.txt

where filename.txt is the filename to convert.

-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Re[2]: Arabic-utf-8 (plus a sample)
  2004-06-05 23:15         ` Re[2]: " Giuseppe Bilotta
@ 2004-06-05 23:31           ` Idris Samawi Hamid
  2004-06-05 23:58             ` Re[4]: " Giuseppe Bilotta
  0 siblings, 1 reply; 18+ messages in thread
From: Idris Samawi Hamid @ 2004-06-05 23:31 UTC (permalink / raw)


On Sun, 6 Jun 2004 01:15:56 +0200, Giuseppe Bilotta <gip.bilotta@iol.it> 
wrote:

> My take: try the following (should work even with ActiveState
> 5.6)
>
> ===
> #!/usr/bin/perl
>
> use strict;
> #D comment the following, I think we can do without
> # use open ':utf8';
>
> open(NEW,">new.tex"); #opens file to print out the result
>
> while (<>); { #this opens the file for reading
>
> $_ =~ s/\x06\x27/A/esg; #this is the actual conversion
>
> print NEW "$_"; #and this writes the result into file "new.tex"
> }
>
> close(NEW);
> ===

Hi Giuseppe (Is it not way past your bedtime;->),

Here's my result:

> perl utf2tex2.pl < unicode-utf.txt
syntax error at utf2tex2.pl line 9, near ");"
Bareword "A" not allowed while "strict subs" in use at utf2tex2.pl line 11.
Execution of utf2tex2.pl aborted due to compilation errors.

please advise;->

best
Idris

-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re[4]: Arabic-utf-8 (plus a sample)
  2004-06-05 23:31           ` Idris Samawi Hamid
@ 2004-06-05 23:58             ` Giuseppe Bilotta
  2004-06-06  0:19               ` Idris Samawi Hamid
  0 siblings, 1 reply; 18+ messages in thread
From: Giuseppe Bilotta @ 2004-06-05 23:58 UTC (permalink / raw)


Sunday, June 6, 2004 Idris Samawi Hamid wrote:

> Hi Giuseppe (Is it not way past your bedtime;->),

Yes it is, and it shows. But since I'm up and not having any
particular urge to go to bed in this very moment, here's a
tested alternative that works here:

==
#!/usr/bin/perl

use strict;
use warnings;

open(NEW,">new.tex"); #opens file to print out the result

while (<>) { #this opens the file for reading

$_ =~ s/\xD8\xA7/A/g; #this is the actual conversion
$_ =~ s/\xD8\xA8/b/g; #this is the actual conversion
$_ =~ s/\xD8\xAC/j/g; #this is the actual conversion
$_ =~ s/\xD8\xAF/d/g; #this is the actual conversion
$_ =~ s/\xD9\x87/h/g; #this is the actual conversion
$_ =~ s/\xD9\x88/w/g; #this is the actual conversion
$_ =~ s/\xD8\xB2/z/g; #this is the actual conversion

print NEW "$_"; #and this writes the result into file "new.tex"
}

close(NEW);
===

to be used as

utf2tex filename

If you want to add more conversions, open your unicode file in
an hex editor and check the actual byte-per-byte hex value of
the utf text for the other characters you want to add. This
should be enough for your needs.

-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Arabic-utf-8 (plus a sample)
  2004-06-05 23:08 ` [SPAM: 3.411] Arabic-utf-8 (plus a sample) Richard MAHONEY
@ 2004-06-06  0:19   ` Idris Samawi Hamid
  0 siblings, 0 replies; 18+ messages in thread
From: Idris Samawi Hamid @ 2004-06-06  0:19 UTC (permalink / raw)


On Sun, 6 Jun 2004 11:08:46 +1200, Richard MAHONEY 
<rbm49@ext.canterbury.ac.nz> wrote:

> You might like to look at some of the encoding conversion scripts at:
>
>  http://homepages.comnet.co.nz/~r-mahoney/scripts/scripts.html
>
> N.B. For sorting utf-8 Arabic you might find the perl module
> `Sort::ArbBiLex' useful

Thnx Richard; most of the scripts are beyond my capability;->, but the 
UTF8 to TeX / LaTeX scripts seem very useful for a possible 
transliteration module, which is something else I need to do for my 
journal, where people send in different transliteration conventions that I 
have to convert to the journal's convention. I'm ashamed to say it; I've 
been doing this kind of thing manually up to now;

\startsuperhero

must... learn... scripting... language...

\stopsuperhero

Thnx
Idris

-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Re[4]: Arabic-utf-8 (plus a sample)
  2004-06-05 23:58             ` Re[4]: " Giuseppe Bilotta
@ 2004-06-06  0:19               ` Idris Samawi Hamid
  2004-06-06  0:26                 ` Idris Samawi Hamid
  2004-06-06  9:09                 ` Perl scripting (was: Arabic-utf-8) Henning Hraban Ramm
  0 siblings, 2 replies; 18+ messages in thread
From: Idris Samawi Hamid @ 2004-06-06  0:19 UTC (permalink / raw)


On Sun, 6 Jun 2004 01:58:44 +0200, Giuseppe Bilotta <gip.bilotta@iol.it> 
wrote:

> ==
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> open(NEW,">new.tex"); #opens file to print out the result
>
> while (<>) { #this opens the file for reading
>
> $_ =~ s/\xD8\xA7/A/g; #this is the actual conversion
> $_ =~ s/\xD8\xA8/b/g; #this is the actual conversion
> $_ =~ s/\xD8\xAC/j/g; #this is the actual conversion
> $_ =~ s/\xD8\xAF/d/g; #this is the actual conversion
> $_ =~ s/\xD9\x87/h/g; #this is the actual conversion
> $_ =~ s/\xD9\x88/w/g; #this is the actual conversion
> $_ =~ s/\xD8\xB2/z/g; #this is the actual conversion
>
> print NEW "$_"; #and this writes the result into file "new.tex"
> }
>
> close(NEW);
> ===

It works! I'll try to finish a basic script that works for Lagally's 
ArabTeX transcription (that I use) and post it here and on the aleph list.

One question: The hex for e.g. alif is 0627; how did you get D8A7 from 
that for purposes of the script (so I can follow along for the rest)?

Best
Idris

-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Re[4]: Arabic-utf-8 (plus a sample)
  2004-06-06  0:19               ` Idris Samawi Hamid
@ 2004-06-06  0:26                 ` Idris Samawi Hamid
  2004-06-06  9:09                 ` Perl scripting (was: Arabic-utf-8) Henning Hraban Ramm
  1 sibling, 0 replies; 18+ messages in thread
From: Idris Samawi Hamid @ 2004-06-06  0:26 UTC (permalink / raw)


On Sat, 05 Jun 2004 18:19:22 -0600, Idris Samawi Hamid 
<ishamid@colostate.edu> wrote:

> One question: The hex for e.g. alif is 0627; how did you get D8A7 from 
> that for purposes of the script (so I can follow along for the rest)?

Ok, I found it:

>> If you want to add more conversions, open your unicode file in
>> an hex editor and check the actual byte-per-byte hex value of
>> the utf text for the other characters you want to add. This
>> should be enough for your needs.

I just downloaded XVI32. Hmm... never heard of or needed a hex editor 
before now...

Best
Idris

-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Perl scripting (was: Arabic-utf-8)
  2004-06-06  0:19               ` Idris Samawi Hamid
  2004-06-06  0:26                 ` Idris Samawi Hamid
@ 2004-06-06  9:09                 ` Henning Hraban Ramm
  2004-06-06 21:03                   ` Idris Samawi Hamid
  1 sibling, 1 reply; 18+ messages in thread
From: Henning Hraban Ramm @ 2004-06-06  9:09 UTC (permalink / raw)



Am Sonntag, 06.06.04, um 02:19 Uhr (Europe/Zurich) schrieb Idris Samawi 
Hamid:

>> open(NEW,">new.tex"); #opens file to print out the result

better:
open NEW, ">", "new.tex" || die $!;

>> $_ =~ s/\xD8\xA7/A/g; #this is the actual conversion

if you work with $_ you can leave it out, simply:
s/\xD8\xA7/A/g;

But for a series of conversions I'd suggest an hash for better overview.
Whole script like this:

-----

#!/usr/bin/perl -w
use strict;
use warnings;

my ($Source, $Target) = (shift, shift); # gets 2 file names from 
command line

my %conv = (	# enhance as needed
	"\xD8xA7" => "A",
	"\xD8xA8" => "b",
	"\xD8xAC" => "j",
	"\xD8xAF" => "d"
);

open SOURCE, "<", $Source || die $!;
open TARGET, ">", $Target || die $!;
# there are ways to read a whole file in one scalar,
# e.g. with File::Slurp, but I don't know them by heart...
while (my $line = <SOURCE>) {
	foreach my $key (keys %conv) {
		$line =~ s/$key/$conv{$key}/g;
	} # foreach
	print TARGET $line;
} # while
close SOURCE;
close TARGET;

-----

BTW: ActiveState has Perl 5.8.4, at least for Windows (I use it at 
work).


Grüßlis vom Hraban!
-- 
http://www.fiee.net/texnique/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Arabic-utf-8 (plus a sample)
  2004-06-05 19:32 Arabic-utf-8 (plus a sample) Idris Samawi Hamid
  2004-06-05 20:41 ` Thomas A. Schmitz
  2004-06-05 23:08 ` [SPAM: 3.411] Arabic-utf-8 (plus a sample) Richard MAHONEY
@ 2004-06-06 13:22 ` George N. White III
  2 siblings, 0 replies; 18+ messages in thread
From: George N. White III @ 2004-06-06 13:22 UTC (permalink / raw)
  Cc: aleph

On Sat, 5 Jun 2004, Idris Samawi Hamid wrote:

> Hi gang,
>
> For Arabic we use a Latin transcription in Aleph/(e-)Omega (or even ArabTeX) 
> unless one of the encoding filters like utf-8 is used. Even for utf-8 files, 
> however, it would be very useful to be able to convert a utf-8 file to Latin 
> transcription for further processing by Aleph/(e-)Omega. For example, adding 
> diacritics is much easier to do in Latin than in an Arabic script editor 
> because Latin transcription is one-dimensional and adding diacritics to 
> Arabic is a 2-dimen affair.
>
> The best thing would be a perl script but I don't know perl at all (except to 
> run some some precreated scripts). If someone out of the kindness of their 
> heart could write a short and simple script for just seven characters I could 
> do the rest myself and present it back here.

Can you use (or extend) GNU recode?  It does include support for
utf-8 and several TeX encodings.

From the manual: "It is easy for a programmer to add a new charset to 
`recode'.  All it requires is making a few functions kept in a single `.c' 
file, adjusting `Makefile.am' and remaking `recode'."

-- 
   George N. White III <aa056@chebucto.ns.ca>
   Head of St. Margarets Bay, Nova Scotia, Canada

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Perl scripting (was: Arabic-utf-8)
  2004-06-06  9:09                 ` Perl scripting (was: Arabic-utf-8) Henning Hraban Ramm
@ 2004-06-06 21:03                   ` Idris Samawi Hamid
  2004-06-06 21:28                     ` Thomas A. Schmitz
  0 siblings, 1 reply; 18+ messages in thread
From: Idris Samawi Hamid @ 2004-06-06 21:03 UTC (permalink / raw)


On Sun, 6 Jun 2004 11:09:32 +0200, Henning Hraban Ramm <hraban@fiee.net> 
wrote:

> -----
>
> #!/usr/bin/perl -w
> use strict;
> use warnings;
>
> my ($Source, $Target) = (shift, shift); # gets 2 file names from command 
> line
>
> my %conv = (	# enhance as needed
> 	"\xD8xA7" => "A",
> 	"\xD8xA8" => "b",
> 	"\xD8xAC" => "j",
> 	"\xD8xAF" => "d"
> );
>
> open SOURCE, "<", $Source || die $!;
> open TARGET, ">", $Target || die $!;
> # there are ways to read a whole file in one scalar,
> # e.g. with File::Slurp, but I don't know them by heart...
> while (my $line = <SOURCE>) {
> 	foreach my $key (keys %conv) {
> 		$line =~ s/$key/$conv{$key}/g;
> 	} # foreach
> 	print TARGET $line;
> } # while
> close SOURCE;
> close TARGET;
>
> -----

Thnx; I'll play around with this as well. BTW: is there any way to do this 
without the hex editor and just enter the full 4-digit character (a la 
Thomas's original suggestion) e.g.,

"\x0627" => "A"

While the hex editor certainly works it is really slow and tedious work...

> BTW: ActiveState has Perl 5.8.4, at least for Windows (I use it at work).

Ok, I found it:

http://downloads.activestate.com/ActivePerl/Windows/5.8/ActivePerl-5.8.3.809-MSWin32-x86.zip

But the web site (at first glance) sure gives one the impression that 
their latest release is
5.6.1.638

http://www.activestate.com/

http://www.activestate.com/Products/ActivePerl/

Best
Idris



-- 
Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Perl scripting (was: Arabic-utf-8)
  2004-06-06 21:03                   ` Idris Samawi Hamid
@ 2004-06-06 21:28                     ` Thomas A. Schmitz
  2004-06-07 19:45                       ` Henning Hraban Ramm
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas A. Schmitz @ 2004-06-06 21:28 UTC (permalink / raw)


Well, if you put the 
use open ':utf8';
in the header of your perl script, it should work without the hex editor
(btw: I would recommend using emacs in hex mode (M-x hexl-find-file). 

And just for the record: to put the entire file in one array, use this:
my @lines = <>;
my $text = join "", @lines;

   $text =~ s/PUT_YOUR/SUBSTITUIONS_HERE/esg;

But it looks like you got a working solution now, so have fun playing
around with it. And boy does it make one feel good when you realize that
you windoze people are still working with perl 5.6 -- that's the stone
age, man ;-)

Best

Thomas

On Sun, 2004-06-06 at 23:03, Idris Samawi Hamid wrote:
> On Sun, 6 Jun 2004 11:09:32 +0200, Henning Hraban Ramm <hraban@fiee.net> 
> wrote:
> 
> > -----
> >
> > #!/usr/bin/perl -w
> > use strict;
> > use warnings;
> >
> > my ($Source, $Target) = (shift, shift); # gets 2 file names from command 
> > line
> >
> > my %conv = (	# enhance as needed
> > 	"\xD8xA7" => "A",
> > 	"\xD8xA8" => "b",
> > 	"\xD8xAC" => "j",
> > 	"\xD8xAF" => "d"
> > );
> >
> > open SOURCE, "<", $Source || die $!;
> > open TARGET, ">", $Target || die $!;
> > # there are ways to read a whole file in one scalar,
> > # e.g. with File::Slurp, but I don't know them by heart...
> > while (my $line = <SOURCE>) {
> > 	foreach my $key (keys %conv) {
> > 		$line =~ s/$key/$conv{$key}/g;
> > 	} # foreach
> > 	print TARGET $line;
> > } # while
> > close SOURCE;
> > close TARGET;
> >
> > -----
> 
> Thnx; I'll play around with this as well. BTW: is there any way to do this 
> without the hex editor and just enter the full 4-digit character (a la 
> Thomas's original suggestion) e.g.,
> 
> "\x0627" => "A"
> 
> While the hex editor certainly works it is really slow and tedious work...
> 
> > BTW: ActiveState has Perl 5.8.4, at least for Windows (I use it at work).
> 
> Ok, I found it:
> 
> http://downloads.activestate.com/ActivePerl/Windows/5.8/ActivePerl-5.8.3.809-MSWin32-x86.zip
> 
> But the web site (at first glance) sure gives one the impression that 
> their latest release is
> 5.6.1.638
> 
> http://www.activestate.com/
> 
> http://www.activestate.com/Products/ActivePerl/
> 
> Best
> Idris
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Perl scripting (was: Arabic-utf-8)
  2004-06-06 21:28                     ` Thomas A. Schmitz
@ 2004-06-07 19:45                       ` Henning Hraban Ramm
  2004-06-07 20:53                         ` Thomas A.Schmitz
  0 siblings, 1 reply; 18+ messages in thread
From: Henning Hraban Ramm @ 2004-06-07 19:45 UTC (permalink / raw)


Am Sonntag, 06.06.04, um 23:28 Uhr (Europe/Zurich) schrieb Thomas A. 
Schmitz:

> Well, if you put the
> use open ':utf8';
> in the header of your perl script, it should work without the hex 
> editor

Not needed with Perl 5.8.x and a proper UTF8 file.

> And just for the record: to put the entire file in one array, use this:
> my @lines = <>;
> my $text = join "", @lines;
>
>    $text =~ s/PUT_YOUR/SUBSTITUIONS_HERE/esg;

Thank you, I always forget the really simple solutions. ;-)

And with File::Slurp you get it directly into a scalar.

> But it looks like you got a working solution now, so have fun playing
> around with it. And boy does it make one feel good when you realize 
> that
> you windoze people are still working with perl 5.6 -- that's the stone
> age, man ;-)

MacOS X has also only 5.6 if you don't install a newer one yourself,
and with newer than 5.8.0 you get endless trouble...


Grüßlis vom Hraban!
-- 
http://www.fiee.net/texnique/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Perl scripting (was: Arabic-utf-8)
  2004-06-07 19:45                       ` Henning Hraban Ramm
@ 2004-06-07 20:53                         ` Thomas A.Schmitz
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas A.Schmitz @ 2004-06-07 20:53 UTC (permalink / raw)


Hey Hraban,

not to be a PITA, but with OS X 10.3, we moved up to perl 5.8.1. And in 
my gentoo installation, I'm now at perl 5.8.4, yessirre! ;-)


Best

Thomas

On Jun 7, 2004, at 9:45 PM, Henning Hraban Ramm wrote:

> MacOS X has also only 5.6 if you don't install a newer one yourself,
> and with newer than 5.8.0 you get endless trouble...

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-06-07 20:53 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-06-05 19:32 Arabic-utf-8 (plus a sample) Idris Samawi Hamid
2004-06-05 20:41 ` Thomas A. Schmitz
2004-06-05 21:33   ` Idris Samawi Hamid
2004-06-05 21:48     ` Thomas A. Schmitz
2004-06-05 22:51       ` Idris Samawi Hamid
2004-06-05 23:15         ` Re[2]: " Giuseppe Bilotta
2004-06-05 23:31           ` Idris Samawi Hamid
2004-06-05 23:58             ` Re[4]: " Giuseppe Bilotta
2004-06-06  0:19               ` Idris Samawi Hamid
2004-06-06  0:26                 ` Idris Samawi Hamid
2004-06-06  9:09                 ` Perl scripting (was: Arabic-utf-8) Henning Hraban Ramm
2004-06-06 21:03                   ` Idris Samawi Hamid
2004-06-06 21:28                     ` Thomas A. Schmitz
2004-06-07 19:45                       ` Henning Hraban Ramm
2004-06-07 20:53                         ` Thomas A.Schmitz
2004-06-05 23:08 ` [SPAM: 3.411] Arabic-utf-8 (plus a sample) Richard MAHONEY
2004-06-06  0:19   ` Idris Samawi Hamid
2004-06-06 13:22 ` Arabic-utf-8 " George N. White III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).