Chinese in utf-8

ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed

* Chinese in utf-8
@ 2005-10-10 15:40 Duncan Hothersall
  2005-10-10 16:53 ` Radhelorn
  2005-10-17  4:48 ` Lutz Haseloff
  0 siblings, 2 replies; 7+ messages in thread
From: Duncan Hothersall @ 2005-10-10 15:40 UTC (permalink / raw)


Hi all.

I have ConTeXt set up to output Chinese using usemodule[chinese], all
fonts, encodings and maps are installed and the sample file works well.

Now I have a whole load of Chinese text in utf-8 encoding. Can ConTeXt
process this, or do I have to convert it to another encoding? I tried
\enableregime[utf] and \useencoding[uc] but it just produced black blobs
instead of Chinese characters.

I hope ConTeXt can do it? :-)

Thanks,

Duncan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chinese in utf-8
  2005-10-10 15:40 Chinese in utf-8 Duncan Hothersall
@ 2005-10-10 16:53 ` Radhelorn
  2005-10-10 20:35   ` Hans Hagen
  2005-10-17  4:48 ` Lutz Haseloff
  1 sibling, 1 reply; 7+ messages in thread
From: Radhelorn @ 2005-10-10 16:53 UTC (permalink / raw)


Duncan Hothersall wrote:
> Hi all.
> 
> I have ConTeXt set up to output Chinese using usemodule[chinese], all
> fonts, encodings and maps are installed and the sample file works well.
> 
> Now I have a whole load of Chinese text in utf-8 encoding. Can ConTeXt
> process this, or do I have to convert it to another encoding? I tried
> \enableregime[utf] and \useencoding[uc] but it just produced black blobs
> instead of Chinese characters.
> 
> I hope ConTeXt can do it? :-)
> 
> Thanks,
> 
> Duncan

Please post output of texexec command. Maybe ConTeXt fails to find some 
files?

-- 
Radhelorn <radhelorn@mail.ru>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chinese in utf-8
  2005-10-10 16:53 ` Radhelorn
@ 2005-10-10 20:35   ` Hans Hagen
  0 siblings, 0 replies; 7+ messages in thread
From: Hans Hagen @ 2005-10-10 20:35 UTC (permalink / raw)


Radhelorn wrote:

> Duncan Hothersall wrote:
>
>> Hi all.
>>
>> I have ConTeXt set up to output Chinese using usemodule[chinese], all
>> fonts, encodings and maps are installed and the sample file works well.
>>
>> Now I have a whole load of Chinese text in utf-8 encoding. Can ConTeXt
>> process this, or do I have to convert it to another encoding? I tried
>> \enableregime[utf] and \useencoding[uc] but it just produced black blobs
>> instead of Chinese characters.
>>
>> I hope ConTeXt can do it? :-)
>>
>> Thanks,
>>
>> Duncan
>
>
> Please post output of texexec command. Maybe ConTeXt fails to find 
> some files?

that's tricky. the utf handler assumes named glyphs and noone named the 5000 chinese ones so far 

(some day pdftex will be unicode award so then problems will disappear) 

in the current utf handling mechanism i can envision something: 

- the utf code results in an expansion of the vector 
- instead of using a named glyph, we use a trick

some variant on: 

\startunicodevector chinese_unicode_page_number_1    
  getglyph{ChineseFont1}{#1}%
\stopunicodevector

or probably due to some used trickery (untested) something like the following (not sure, best make a new command): 

\startunicodevector chinese_unicode_page_number_1    
  getglyph\endcsname{ChineseFont1}{#1}\gobbleoneargument 
\stopunicodevector

so, then you only need to define the right fonts i.e. 

\definefont[ChineseFont1][whateverchinesefont_1]

which has the right glyphs in the right slots 

so ... it's actually simple, once you have the fonts split up 

probably the getgyph needs to be replaced by a more clever one that handles special chinese thingies, 

another option is to write another mapper analogue to the ones already there for chinese, i.e. is there some mapping from utf to big5 or so and  hook that into the utf handler. 

(beware, the font-chi modules talk about unicode while actually it's about dedicated mapings resembling a unicode approach; this \defineucharmapping stuff) 

Hans 

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chinese in utf-8
  2005-10-10 15:40 Chinese in utf-8 Duncan Hothersall
  2005-10-10 16:53 ` Radhelorn
@ 2005-10-17  4:48 ` Lutz Haseloff
  1 sibling, 0 replies; 7+ messages in thread
From: Lutz Haseloff @ 2005-10-17  4:48 UTC (permalink / raw)


Hi Duncan,



Duncan Hothersall schrieb:
> Hi all.
> 
> I have ConTeXt set up to output Chinese using usemodule[chinese], all
> fonts, encodings and maps are installed and the sample file works well.
> 
> Now I have a whole load of Chinese text in utf-8 encoding. Can ConTeXt
> process this, or do I have to convert it to another encoding? I tried
> \enableregime[utf] and \useencoding[uc] but it just produced black blobs
> instead of Chinese characters.
> 
> I hope ConTeXt can do it? :-)
> 
> Thanks,
> 
> Duncan


i prepared a small perl script to convert chinese utf-8 encoded
tex-files to gbk coded tex-files. I call it right
before using texexec.pl to create a pdf from the resulting
tex-file. It has the advantage that you can use both simplified
and traditional characters in one file, if you have full gbk
enabled font files. (all chinese ht*.ttf)
You can easy see all chinese characters on the screen with any
unicode enabled Editor (Scite)

Here you are:

utf82gbk.pl

-----------------------------

#!/usr\bin\perl -w

use strict;
use utf8;
use Encode::HanConvert;

our ($filename, $recoded);

$filename = $ARGV[0];
$filename=~ s/\.tex$//io ;
if (open(INP,"<:utf8","$filename.tex"))
     {
       print "processing file $filename.tex\n" ;
       $/ = "\0777" ;
       $_ = <INP> ;
       close(INP) ;
        simp_to_gb($_);
use bytes;
if ((open(OUT,">","$filename-gbk.tex")))
         { print OUT $_ ;
           close(OUT) ;
           }
       }
   else
     { print "invalid filename\n" }
if (-e "$filename-gbk.tex") {print "created file $filename-gbk.tex\n"}

sub unirecode
  { my ($a,$b) = @_ ;
    if ((ord($b)<0x80)&&($b !~ /[a-zA-Z0-9]/))
      { print "$b" ; ++$recoded ;
        return "\\uc\{" . ord($a) . "\}\{". ord($b) . "\}" }
    else
      { return "$a$b" } }

if (open(INP,"$filename-gbk.tex"))
     { $recoded  = 0 ;
       print "processing file $filename-gbk.tex " ;
       $/ = "\0777" ;
       $_ = <INP> ;
       close(INP) ;
       s/([\x80-\xFF])(.)/unirecode($1,$2)/mgoe ;
       if (($recoded)&&(open(OUT,">$filename.tmp")))
         {  print OUT $_ ;
            close(OUT) ;
            unlink "$filename-gbk.tex" ;
            rename "$filename-gbk.tmp", "$filename-gbk.tex" ;
            unlink "$filename-gbk.tmp" ;
            }
       if ($recoded)
         { print " - $recoded glyphs recoded - original saved as
$filename-gbk.tec\n" }
       else
         { print "- no glyphs recoded\n" } }
   else
     { print "invalid filename\n" }


-----------------------------
usage:
utf82tex filename.tex
texexec filename-gbk.tex

It's a combination of Hans Hagens tex2uc.pl wich converts
codes including tex related characters (\, {, } ...) into
\unicodeglyph commands and an easy utf-8 to gbk converter.
It needs the module Encode::HanConvert.

I created 2 new Menuentries in my Scite Editor.
"Create gbk texfile" wich creates filename-gbk.tex and
"Process gbk texfile" wich runs texexec on this new file.
It works for me very well.

I hope this helps a bit until pdftex can handle unicode.

Greetings from Potsdam, Germany

Lutz

P.S. Excuse my bad english

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chinese in utf-8
       [not found] <20051017082445.BB967127C6@ronja.ntg.nl>
@ 2005-10-17 14:21 ` Duncan Hothersall
  0 siblings, 0 replies; 7+ messages in thread
From: Duncan Hothersall @ 2005-10-17 14:21 UTC (permalink / raw)


Lutz Haseloff said:

> i prepared a small perl script to convert chinese utf-8 encoded
> tex-files to gbk coded tex-files. 

Thanks so much, I look forward to trying it out next week when I get 
back to work, and will let you know how I get on. Thanks for taking the 
time.

Duncan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chinese in utf-8
       [not found] <20051013100003.016F312797@ronja.ntg.nl>
@ 2005-10-14 10:52 ` Duncan Hothersall
  0 siblings, 0 replies; 7+ messages in thread
From: Duncan Hothersall @ 2005-10-14 10:52 UTC (permalink / raw)


>>(beware, the font-chi modules talk about unicode while actually it's
>>about dedicated mapings resembling a unicode approach; this
>>\defineucharmapping stuff)
> 
> Yes indeed, that had me going... :-) Oh well.
> 
> Thanks for the insight, I'll feedback further.

I have to say I'm unable to make any sense of it at the moment. I think
I understand the logic of what is needed but understanding the current
implementation is way beyond my current capacity.

Does anyone else have a need to process Simplified Chinese encoded in
UTF-8? If not, perhaps I should just explore getting my sources changed
into GBK.

If there was someone else with the same need we could perhaps share the
burden...

Thanks again,

Duncan

(PS I'm away for the next week, please don't think I'm ignoring you if I
don't reply over that period.)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Chinese in utf-8
       [not found] <20051010203439.5D7AA127B3@ronja.ntg.nl>
@ 2005-10-12 14:04 ` Duncan Hothersall
  0 siblings, 0 replies; 7+ messages in thread
From: Duncan Hothersall @ 2005-10-12 14:04 UTC (permalink / raw)

Hans wrote:

> that's tricky. the utf handler assumes named glyphs and noone named
> the 5000 chinese ones so far
...
> some variant on:
...

> \startunicodevector chinese_unicode_page_number_1 
> getglyph\endcsname{ChineseFont1}{#1}\gobbleoneargument 
> \stopunicodevector
> 
> so, then you only need to define the right fonts i.e.
> 
> \definefont[ChineseFont1][whateverchinesefont_1]
> 
> which has the right glyphs in the right slots
> 
> so ... it's actually simple, once you have the fonts split up
> 
> probably the getgyph needs to be replaced by a more clever one that
> handles special chinese thingies,

Wow. At the moment I have no idea how most of the Chinese module or
font-handling works, nor how I would implement something using the
tricks you describe. I guess I would need some hand-holding if I were to
embark on this, I guess also I would need to understand the mechanism
used to re-use a TTF font many times with different encodings to create
multiple 256 char tfms.

> another option is to write another mapper analogue to the ones
> already there for chinese, i.e. is there some mapping from utf to
> big5 or so and  hook that into the utf handler.

This sounds like something I can at least understand a bit better. I
will start here, and see what success I have. Perhaps it will help
eventually with an attempt to do it the "right" way above.

> (beware, the font-chi modules talk about unicode while actually it's
> about dedicated mapings resembling a unicode approach; this
> \defineucharmapping stuff)

Yes indeed, that had me going... :-) Oh well.

Thanks for the insight, I'll feedback further.

Duncan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-10-17 14:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-10 15:40 Chinese in utf-8 Duncan Hothersall
2005-10-10 16:53 ` Radhelorn
2005-10-10 20:35   ` Hans Hagen
2005-10-17  4:48 ` Lutz Haseloff
     [not found] <20051010203439.5D7AA127B3@ronja.ntg.nl>
2005-10-12 14:04 ` Duncan Hothersall
     [not found] <20051013100003.016F312797@ronja.ntg.nl>
2005-10-14 10:52 ` Duncan Hothersall
     [not found] <20051017082445.BB967127C6@ronja.ntg.nl>
2005-10-17 14:21 ` Duncan Hothersall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).