UTF8 problems with Hangul Syllables

ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed

* UTF8 problems with Hangul Syllables
@ 2002-12-18  4:49 Cho, Jin-Hwan
  2002-12-18 13:39 ` Hans Hagen
  0 siblings, 1 reply; 5+ messages in thread
From: Cho, Jin-Hwan @ 2002-12-18  4:49 UTC (permalink / raw)


Here is my comment and question on the new feature of ConTeXt supporting
the UTF8 encoding.

Actually I tried to test the following short ConTeXt document containing
two Korean characters. At the second line I used the Bitstream Cyberbit
font and the corresponding TFM files were generated by ttf2tfm with Unicode.sfd
(the same way as the UTF8 support in CJK-LaTeX).

\enableregime [utf]
\definefontsynonym [UnicodeRegular] [cyberb]
\chardef\utfunihashmode=1
\starttext
^^eb^^bf^^a1
^^ec^^80^^80
\stoptext

Here, ^^eb^^bf^^a1 = U+BFE1 and ^^ec^^80^^80 = U+C000. 

1. Without the third line (\chardef\utfunihashmode=1), I could not see
   any characters. Why?

2. After enabling \utfunihashmode, I could see the first character. But
   not the second character. The difference was that the value of \unidiv
   were 191 for the first character and 192 for the second character.
   In fact, all characters with \unidiv >= 192 and \unidiv <= 223
   (from U+C000 to U+DFFF; half of Hangul Syllables) were not shown
   correctly. Why?
   
Anyway, it is now possible to get a PDF file containing several different
languages with ConTeXt + dvipdfmx. Furthermore, the texts in the PDF file
can be searched and extracted. Bookmarks and text annotations too!

I used the following map entry (usually in cid-x.map) for dvipdfmx.

cyberb@Unicode@ Identity-H :0:cyberbit.ttf

Best, ChoF.
-- 
~~~~~~~~~~~~~~~~~~~~~~~~~     ***
| Cho, Jin-Hwan == ChoF |     ^ ^
~~~~~~~~~~~~~~~~~~~~~~~~~      o
| Research Fellow       |     ~~~
| School of Mathematics ~~~~~~~~~~~~~~
| Korea Institute for Advanced Study |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| chofchof@ktug.or.kr                |
| http://free.kaist.ac.kr/ChoF/      |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UTF8 problems with Hangul Syllables
  2002-12-18  4:49 UTF8 problems with Hangul Syllables Cho, Jin-Hwan
@ 2002-12-18 13:39 ` Hans Hagen
  2002-12-19  1:53   ` Cho, Jin-Hwan
  0 siblings, 1 reply; 5+ messages in thread
From: Hans Hagen @ 2002-12-18 13:39 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 1646 bytes --]

At 01:49 PM 12/18/2002 +0900, you wrote:
>Here is my comment and question on the new feature of ConTeXt supporting
>the UTF8 encoding.
>
>Actually I tried to test the following short ConTeXt document containing
>two Korean characters. At the second line I used the Bitstream Cyberbit
>font and the corresponding TFM files were generated by ttf2tfm with 
>Unicode.sfd
>(the same way as the UTF8 support in CJK-LaTeX).
>
>\enableregime [utf]
>\definefontsynonym [UnicodeRegular] [cyberb]
>\chardef\utfunihashmode=1
>\starttext
>^^eb^^bf^^a1
>^^ec^^80^^80
>\stoptext
>
>Here, ^^eb^^bf^^a1 = U+BFE1 and ^^ec^^80^^80 = U+C000.
>
>1. Without the third line (\chardef\utfunihashmode=1), I could not see
>    any characters. Why?
>
>2. After enabling \utfunihashmode, I could see the first character. But
>    not the second character. The difference was that the value of \unidiv
>    were 191 for the first character and 192 for the second character.
>    In fact, all characters with \unidiv >= 192 and \unidiv <= 223
>    (from U+C000 to U+DFFF; half of Hangul Syllables) were not shown
>    correctly. Why?

I'll work this out asap; this is what i use as test file (unfortunately 
this font does not show chars, so i have do download a proper font first); 
i attached a script that i apply to a ttf file ( ttftfmxx.pl htfs.ttf 0 255 )


\chardef\utfunihashmode=1

\pdfmapfile{+htfsxx.map}

\definefontsynonym [TestRegular] [htfs]

\defineunicodefont [SomeFont] [Test]

\SomeFont \enableregime[utf] % todo: autoutf, else problem

\starttekst

%^^eb^^bf^^a1
%^^ec^^80^^80
\utfunifontglyph{\numexpr("BFE1)}
\utfunifontglyph{\numexpr("C000)}

\stoptekst

[-- Attachment #2: TTFTFMXX.PL --]
[-- Type: text/plain, Size: 1149 bytes --]

# author : Hans Hagen / PRAGMA-ADE / www.pragma-ade.com / 07-12-2003
# script : splits ttf file into series pfb's using ttf2pt1 & afm2tfm 
# usage  : ttftfmxx filename.ttf [from plane] [to plane] 

unless ($ARGV[0]) 
  { print "provide ttf filename\n" ; exit } 

($filename,$filetype) = split(/\./, $ARGV[0]) ;  

$filetype = 'ttf' unless $filetype ;

unless (lc $filetype eq 'ttf') 
  { print "provide ttf filename\n" ; exit } 

$mapfile = $filename . "xx.map" ;

$from = $ARGV[1] ; 
$to   = $ARGV[2] ; 

if ($from eq '') 
  { $from = 0 ; $to = 2 } 
elsif ($to eq '') 
  { $to = $from ; $from = 0 }  

$from = 255 if ($from>255) ; 
$to   = 255 if ($to  >255) ;

open(MAP,">$mapfile") ;

for ($i=$from;$i<=$to;$i++)
  { $str = sprintf("%02x",$i) ;
    $splitname = "$filename$str" ; 
    system("ttf2pt1 -l plane+0x$str -W 0 -b $filename.$filetype $splitname") ; 
    system("afm2tfm $splitname.afm $splitname.tfm") ; 
    unlink "$splitname.afm" ; 
    print MAP "$splitname <$splitname.pfb\n" }

close(MAP) ;

print "ttffile : $filename.$filetype\n" ; 
print "tfmfile : $filename$from.tfm .. $filename$to.tfm\n" ; 
print "mapfile : $mapfile\n" ; 

[-- Attachment #3: Type: text/plain, Size: 597 bytes --]

-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------
                        information: http://www.pragma-ade.com/roadmap.pdf
                     documentation: http://www.pragma-ade.com/showcase.pdf
-------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UTF8 problems with Hangul Syllables
  2002-12-18 13:39 ` Hans Hagen
@ 2002-12-19  1:53   ` Cho, Jin-Hwan
  2002-12-19  9:16     ` Hans Hagen
  0 siblings, 1 reply; 5+ messages in thread
From: Cho, Jin-Hwan @ 2002-12-19  1:53 UTC (permalink / raw)


Hans Hagen wrote:
> 
> I'll work this out asap; this is what i use as test file (unfortunately
> this font does not show chars, so i have do download a proper font first);
> i attached a script that i apply to a ttf file ( ttftfmxx.pl htfs.ttf 0 255 )
> 
> \chardef\utfunihashmode=1
> \pdfmapfile{+htfsxx.map}
> \definefontsynonym [TestRegular] [htfs]
> \defineunicodefont [SomeFont] [Test]
> \SomeFont \enableregime[utf] % todo: autoutf, else problem
> \starttekst
> %^^eb^^bf^^a1
> %^^ec^^80^^80
> \utfunifontglyph{\numexpr("BFE1)}
> \utfunifontglyph{\numexpr("C000)}
> \stoptekst

Even though I used \utfunifontglyph{\numexpr("C000)} instead of ^^ec^^80^^80,
the result was the same, that is, the character was not shown correctly
(= empty).

I forgot one thing to comment. Bitstream Cyberbit font does not have the glyph
for the character U+C000. So it may be better to test the character U+C0C1
(= ^^ec^^83^^81).

Bitstream Cyberbit font (Cyberbit.ZIP) can be download from
http://ftp.netscape.com/pub/communicator/extras/fonts/windows/

Here is the log message after turnning on \tracingmacros. The difference is
that "BFE1 calls \unicodeglyph, but "C0C1 calls \doutfunihsh.

1. Log message for \utfunifontglyph{\numexpr("BFE1)}
   =================================================

\utfunifontglyph #1->\xdef \unidiv {\number \utfdiv {#1}}\xdef \unimod {\number
 \utfmod {#1}}\ifnum #1<\utf@i \char \unimod \else \ifcsname \@@univector \unid
iv \endcsname \csname \doutfunihash {\unidiv }{#1}\endcsname \else \unicodeglyp
h \unidiv \unimod \fi \fi 
#1<-\numexpr ("BFE1)

\utfdiv #1->\number \numexpr ((#1-\utf@g )/\utf@h )
#1<-\numexpr ("BFE1)

\utfmod #1->\number \numexpr (#1-\utf@h *((#1-\utf@g )/\utf@h ))
#1<-\numexpr ("BFE1)

\@@univector ->univ

\unidiv ->191

\unicodeglyph #1#2->\bgroup \getvalue {@@\currentucharmapping \strippedcsname \
uchar }{#1}{#2}\bodyfontsize \unicodescale \bodyfontsize \font \unicodefont =\t
ruefontname {\truefontname \unicodestyle \unicodeone } at \currentfontscale \bo
dyfontsize \unicodestrut \unicodefont \unicodecharcommand {\char \unicodetwo \r
elax }\egroup 
#1<-\unidiv 
#2<-\unimod 

... [REMOVED]

2. Log message for \utfunifontglyph{\numexpr("C0C1)}
   =================================================

\utfunifontglyph #1->\xdef \unidiv {\number \utfdiv {#1}}\xdef \unimod {\number
 \utfmod {#1}}\ifnum #1<\utf@i \char \unimod \else \ifcsname \@@univector \unid
iv \endcsname \csname \doutfunihash {\unidiv }{#1}\endcsname \else \unicodeglyp
h \unidiv \unimod \fi \fi 
#1<-\numexpr ("C0C1)

\utfdiv #1->\number \numexpr ((#1-\utf@g )/\utf@h )
#1<-\numexpr ("C0C1)

\utfmod #1->\number \numexpr (#1-\utf@h *((#1-\utf@g )/\utf@h ))
#1<-\numexpr ("C0C1)

\@@univector ->univ

\unidiv ->192

\doutfunihash #1#2->\ifcsname \@@univector \number #1\endcsname \csname \@@univ
ector #1\endcsname {\utfmod {#2}}\else \@@unknownchar \fi 
#1<-\unidiv 
#2<-\numexpr ("C0C1)

\@@univector ->univ

\unidiv ->192

\@@univector ->univ

\unidiv ->192

\univ192 #1->
#1<-\utfmod {\numexpr ("C0C1)}

[NO MESSAGE FURTHER]

Best, ChoF.
-- 
  ***  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ***
  ^ ^  |  ChoF  :=  Jin-Hwan Cho  | *^ ^*
   o   |   chofchof@ktug.or.kr    | * o *
  ***  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ***
 *^ ^* |    Project Manager of    |  ^ ^
 * O * | DVIPDFMx and MiKTeX-KTUG |   O
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Research Fellow, School of Mathematics |
|   Korea Institute for Advanced Study   |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UTF8 problems with Hangul Syllables
  2002-12-19  1:53   ` Cho, Jin-Hwan
@ 2002-12-19  9:16     ` Hans Hagen
  2002-12-20  8:18       ` Cho, Jin-Hwan
  0 siblings, 1 reply; 5+ messages in thread
From: Hans Hagen @ 2002-12-19  9:16 UTC (permalink / raw)


At 10:53 AM 12/19/2002 +0900, Cho, Jin-Hwan wrote:

>Here is the log message after turnning on \tracingmacros. The difference is
>that "BFE1 calls \unicodeglyph, but "C0C1 calls \doutfunihsh.

can you adapt regi-utf.tex to :

\dostepwiserecurse{192}{223}{1}
   {\expanded{\defineactiveinspector{\recurselevel} % space delimited
      {\noexpand\utftwouniglph{\recurselevel}}}%
    }%\letvalue{\@@univector\recurselevel}\gobbleoneargument}

\dostepwiserecurse{224}{239}{1}
   {\expanded{\defineactiveinspector{\recurselevel} % space delimited
      {\noexpand\utfthreeuniglph{\recurselevel}}}%
    }%\letvalue{\@@univector\recurselevel}\gobbetwoarguments}

\dostepwiserecurse{240}{247}{1}
   {\expanded{\defineactiveinspector{\recurselevel} % space delimited
      {\noexpand\utffouruniglph{\recurselevel}}}%
    }%\letvalue{\@@univector\recurselevel}\gobblethreearguments}

i.e. comment the last lines

i do get something now, but somehow ttf2pt1 does not like this font so i 
get invalid pfb's

Hans
-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------
                        information: http://www.pragma-ade.com/roadmap.pdf
                     documentation: http://www.pragma-ade.com/showcase.pdf
-------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UTF8 problems with Hangul Syllables
  2002-12-19  9:16     ` Hans Hagen
@ 2002-12-20  8:18       ` Cho, Jin-Hwan
  0 siblings, 0 replies; 5+ messages in thread
From: Cho, Jin-Hwan @ 2002-12-20  8:18 UTC (permalink / raw)


Hans Hagen wrote:
> 
> can you adapt regi-utf.tex to :
> 
> \dostepwiserecurse{192}{223}{1}
>    {\expanded{\defineactiveinspector{\recurselevel} % space delimited
>       {\noexpand\utftwouniglph{\recurselevel}}}%
>     }%\letvalue{\@@univector\recurselevel}\gobbleoneargument}
> 
> (... skipped ...)
> 
> i.e. comment the last lines

Good news. After commenting out the last lines, it worked.

Best, ChoF.
-- 
  ***  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ***
  ^ ^  |  ChoF  :=  Jin-Hwan Cho  | *^ ^*
   o   |   chofchof@ktug.or.kr    | * o *
  ***  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ***
 *^ ^* |    Project Manager of    |  ^ ^
 * O * | DVIPDFMx and MiKTeX-KTUG |   O
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Research Fellow, School of Mathematics |
|   Korea Institute for Advanced Study   |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-12-20  8:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-18  4:49 UTF8 problems with Hangul Syllables Cho, Jin-Hwan
2002-12-18 13:39 ` Hans Hagen
2002-12-19  1:53   ` Cho, Jin-Hwan
2002-12-19  9:16     ` Hans Hagen
2002-12-20  8:18       ` Cho, Jin-Hwan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).