ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* <mtext> UTF further problems
@ 2005-06-20 13:46 Duncan Hothersall
  2005-06-20 20:32 ` James Cloos
  0 siblings, 1 reply; 11+ messages in thread
From: Duncan Hothersall @ 2005-06-20 13:46 UTC (permalink / raw)


I'm still having trouble with UTF contents in <mtext> tags in MathML
content.

It's difficult to send a sample because UTF isn't preserved well either
by email or when sending files to the ConTeXt Live server, but I'd
really appreciate it if somebody could help. When I replace 'HERE' in
the following snippet with any UTF encoded accented character, it comes
out wrong:

\useXMLfilter[utf]\usemodule[mathml]
\starttext\startXMLdata
<formula><math><mtext>HERE</mtext></math></formula>
\stopXMLdata\stoptext

Can anyone help?

Thanks a million.

Duncan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
  2005-06-20 13:46 <mtext> UTF further problems Duncan Hothersall
@ 2005-06-20 20:32 ` James Cloos
  0 siblings, 0 replies; 11+ messages in thread
From: James Cloos @ 2005-06-20 20:32 UTC (permalink / raw)


>>>>> "Duncan" == Duncan Hothersall <dh@capdm.com> writes:

Duncan> I'm still having trouble with UTF contents in <mtext> tags in
Duncan> MathML content.

For whatever it is worth, I just tried that.  A double-acute u (U+0171)
came through w/o problem.  I'm using a gentoo box w/ tetex 3.0.

I tried both dvi and pdf output.  Both worked.

Are you sure your file is in utf-8 and not, eg, utf-16?

What platform are you on?  

The exact text I ended up with is:

    \useXMLfilter[utf]\usemodule[mathml]
    \starttext\startXMLdata
    <formula><math><mtext>ű</mtext></math></formula>
    \stopXMLdata\stoptext

-JimC
-- 
James H. Cloos, Jr. <cloos@jhcloos.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
  2005-06-22  2:18   ` Mojca Miklavec
  2005-06-22  8:06     ` Hans Hagen
@ 2005-06-22  8:35     ` Patrick Gundlach
  1 sibling, 0 replies; 11+ messages in thread
From: Patrick Gundlach @ 2005-06-22  8:35 UTC (permalink / raw)



[...]

> An interesting observation: I tested on live.contextgarden.com, on the
> latest ConTeXt in MikTeX distribution and in an old minimal ConTeXt
> distribution for Windows (6.12.2004). The results from MikTeX and
> live.contextgarden.net were equal. \v{c} resulted in something like
> "leftdoubleguillemont" overlapped with "c", \"{a}, \"{o} and \"{u}
> resulted in a dash over the letter a/o/u. 

If there is something I have to change on contextgarden, please tell
me off list (or the dev list). I might miss it here.

Patrick

-- 
ConTeXt wiki and more: http://contextgarden.net

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
  2005-06-22  2:18   ` Mojca Miklavec
@ 2005-06-22  8:06     ` Hans Hagen
  2005-06-22  8:35     ` Patrick Gundlach
  1 sibling, 0 replies; 11+ messages in thread
From: Hans Hagen @ 2005-06-22  8:06 UTC (permalink / raw)


Mojca Miklavec wrote:

> \usemodule[mathml]
> \starttext\startXMLdata
> <formula><math><mtext>\"{a}\"{o}\"{u}\v{c}\v{s}\v{z}</mtext></math></formula> 
> \stopXMLdata\stoptext
> 
> fails as well.
> 
> An interesting observation: I tested on live.contextgarden.com, on the 
> latest ConTeXt in MikTeX distribution and in an old minimal ConTeXt 
> distribution for Windows (6.12.2004). The results from MikTeX and 
> live.contextgarden.net were equal. \v{c} resulted in something like
> "leftdoubleguillemont" overlapped with "c", \"{a}, \"{o} and \"{u} 
> resulted in a dash over the letter a/o/u. In the minimal ConTeXt 
> distribution the line
>     \"{a}\"{o}\"{u}\v{c}\v{s}\v{z}
> resulted in
>     "{a}"{o}"{u}v {c}v {s}v {z}
> (literally).

Let me tell you that it's even stranger:

\usemodule[mathml]

\starttext

\chardef\XMLtokensreduction\zerocount

\startXMLdata
<formula><math><mtext>\"{a}\"{o}\"{u}\v{c}\v{s}\v{z}</mtext></math></formula>
\stopXMLdata

\chardef\XMLtokensreduction\plustwo

\startXMLdata
<formula><math><mtext>\"{a}\"{o}\"{u}\v{c}\v{s}\v{z}</mtext></math></formula>
\stopXMLdata

\stoptext

since currently the mml code reenters the tokenizer, the text inside mtext is 
actually seen as tex code

the only way to get this fixed is to rewrite the mml parser (i've partially done 
that already using the normal xml handler instead of the messy mapper); it seems 
that i need to speed up that port.

Hans


-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
      tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
       [not found] <20050621234837.018CB127C9@ronja.ntg.nl>
@ 2005-06-22  7:49 ` Duncan Hothersall
  0 siblings, 0 replies; 11+ messages in thread
From: Duncan Hothersall @ 2005-06-22  7:49 UTC (permalink / raw)


Hans wrote:
> I attached a small test file. Some trickery is needed to get utf working in mathml
> 
> - the map patch goes into xtag-map.tex
> - the other one into xtag-mmp
> 
> part of the problem is that the current font must provide the characters

Hans, thank you very much. I've ben banging my head off that one for 
quite a while. This fixes it perfectly.

> You're lucky that i hav eto run soem big boring files in the background -)

I'm really glad you had a boring evening :)

Duncan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
  2005-06-21 15:43 ` Duncan Hothersall
  2005-06-21 23:48   ` Hans Hagen
  2005-06-22  0:40   ` James Cloos
@ 2005-06-22  2:18   ` Mojca Miklavec
  2005-06-22  8:06     ` Hans Hagen
  2005-06-22  8:35     ` Patrick Gundlach
  2 siblings, 2 replies; 11+ messages in thread
From: Mojca Miklavec @ 2005-06-22  2:18 UTC (permalink / raw)


Duncan Hothersall wrote:
> I think maybe it does. Is there anyone who is running the *minimal 
> install* (from Hans' zip files) on either windows or linux who could 
> test this for me? I just need you to try out a unicode accented 
> character within an <mtext> element inside MathML. Here's my template 
> again - put an unicode accented char where 'HERE' appears:
> 
> \useXMLfilter[utf]\usemodule[mathml]
> \starttext\startXMLdata
> <formula><math><mtext>HERE</mtext></math></formula>
> \stopXMLdata\stoptext
> 
> It's usually my first port of call, but AFAIK it's not possible to 
> control the way the web browser re-encodes stuff before it is submitted, 
> so the results are not reliable. This is a real shame - TeX-Live is how 
> I usually confirm all my queries.

I'm not sure about it, but my experience is that if I ask for an UTF-8 
encoded page (at least in Mozilla), my input is also submitted as UTF-8.

(When I worked with phpMyAdmin for example, I had to use the same 
encoding as the database which was sometimes annoying - the pages had to 
be shown in the wrong encoding in order to be able to work with the data 
properly.)

The rest of the contextgarden pages already has UTF-8 encoding, but I'm 
not sure if that wouldn't disturb some newbies forgetting to add 
\enableregime[utf] at the beginning of the document.

I tried this out:

\enableregime[utf]
\useXMLfilter[utf]\usemodule[mathml]
\starttext
äöüčćšđž % rendered properly
\startXMLdata
<formula><math><mtext>äöüčćšđž</mtext></math></formula>
\stopXMLdata\stoptext

It doesn't work here either, but the problem doesn't seem to be in 
unicode, but in the rendering of accented characters.

This example:

\usemodule[mathml]
\starttext\startXMLdata
<formula><math><mtext>\"{a}\"{o}\"{u}\v{c}\v{s}\v{z}</mtext></math></formula>
\stopXMLdata\stoptext

fails as well.

An interesting observation: I tested on live.contextgarden.com, on the 
latest ConTeXt in MikTeX distribution and in an old minimal ConTeXt 
distribution for Windows (6.12.2004). The results from MikTeX and 
live.contextgarden.net were equal. \v{c} resulted in something like
"leftdoubleguillemont" overlapped with "c", \"{a}, \"{o} and \"{u} 
resulted in a dash over the letter a/o/u. In the minimal ConTeXt 
distribution the line
     \"{a}\"{o}\"{u}\v{c}\v{s}\v{z}
resulted in
     "{a}"{o}"{u}v {c}v {s}v {z}
(literally).

The example with utf didn't even mind to compile there.

Mojca

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
  2005-06-21 15:43 ` Duncan Hothersall
  2005-06-21 23:48   ` Hans Hagen
@ 2005-06-22  0:40   ` James Cloos
  2005-06-22  2:18   ` Mojca Miklavec
  2 siblings, 0 replies; 11+ messages in thread
From: James Cloos @ 2005-06-22  0:40 UTC (permalink / raw)


>>>>> "Duncan" == Duncan Hothersall <dh@capdm.com> writes:

>> You may want to give TeX-Live a test.

Duncan> It's usually my first port of call, but AFAIK it's not
Duncan> possible to control the way the web browser re-encodes stuff
Duncan> before it is submitted, so the results are not reliable.

Is there a collision in the name TeX Live?  I was thinking of the
CD/DVD sets.  They have TeX for a wide range of systems.  And you
can run it directly from the CD for at least windows & linux/x86.
(Hense Live.)

-JimC

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
  2005-06-21 15:43 ` Duncan Hothersall
@ 2005-06-21 23:48   ` Hans Hagen
  2005-06-22  0:40   ` James Cloos
  2005-06-22  2:18   ` Mojca Miklavec
  2 siblings, 0 replies; 11+ messages in thread
From: Hans Hagen @ 2005-06-21 23:48 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 2519 bytes --]

Duncan Hothersall wrote:

> Thank you for your patience Jim.
> 
>> Try this at a shell prompt:
>>
>>     env LANG=C LC_ALL=C cat --show-all FileName
>>
>> where FileName is the file in question. The non-ascii characters will
>> be output as strings that look M-? where ? is a single ascii character.
>> If you see a single M-? triplet in place of each non-ascii character
>> you do not have utf-8. If you see between two and five such triplets
>> for each non-ascii character in the document it is probably utf-8.
>> (If you see ^@ pairs separating the ascii chars you have utf-16.)
> 
> 
> Okay, this gives me some comfort as it seems to confirm that I do have 
> UTF-8 as I thought. I'm seeing twos, threes and fours of the triplets 
> you describe, and no evidence of high-ascii single chars nor of ^@. So 
> I'm pretty sure it is UTF-8. Thanks for this.
> 
>> I've only tested on tetex-3. That may make a difference....
> 
> 
> I think maybe it does. Is there anyone who is running the *minimal 
> install* (from Hans' zip files) on either windows or linux who could 
> test this for me? I just need you to try out a unicode accented 
> character within an <mtext> element inside MathML. Here's my template 
> again - put an unicode accented char where 'HERE' appears:
> 
> \useXMLfilter[utf]\usemodule[mathml]
> \starttext\startXMLdata
> <formula><math><mtext>HERE</mtext></math></formula>
> \stopXMLdata\stoptext
> 
>> You may want to give TeX-Live a test.
> 
> 
> It's usually my first port of call, but AFAIK it's not possible to 
> control the way the web browser re-encodes stuff before it is submitted, 
> so the results are not reliable. This is a real shame - TeX-Live is how 
> I usually confirm all my queries.
> 
> Thanks again Jim; can anyone running the minimal install help me?
> 
> Duncan

I attached a small test file. Some trickery is needed to get utf working in mathml

- the map patch goes into xtag-map.tex
- the other one into xtag-mmp

part of the problem is that the current font must provide the characters

You're lucky that i hav eto run soem big boring files in the background -)

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
      tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------

[-- Attachment #2: test.zip --]
[-- Type: application/x-zip-compressed, Size: 23080 bytes --]

[-- Attachment #3: Type: text/plain, Size: 139 bytes --]

_______________________________________________
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
       [not found] <20050621100002.D406C127BF@ronja.ntg.nl>
  2005-06-21 10:25 ` Duncan Hothersall
@ 2005-06-21 15:43 ` Duncan Hothersall
  2005-06-21 23:48   ` Hans Hagen
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Duncan Hothersall @ 2005-06-21 15:43 UTC (permalink / raw)


Thank you for your patience Jim.

> Try this at a shell prompt:
> 
>     env LANG=C LC_ALL=C cat --show-all FileName
> 
> where FileName is the file in question. The non-ascii characters will
> be output as strings that look M-? where ? is a single ascii character.
> If you see a single M-? triplet in place of each non-ascii character
> you do not have utf-8. If you see between two and five such triplets
> for each non-ascii character in the document it is probably utf-8.
> (If you see ^@ pairs separating the ascii chars you have utf-16.)

Okay, this gives me some comfort as it seems to confirm that I do have 
UTF-8 as I thought. I'm seeing twos, threes and fours of the triplets 
you describe, and no evidence of high-ascii single chars nor of ^@. So 
I'm pretty sure it is UTF-8. Thanks for this.

> I've only tested on tetex-3. That may make a difference....

I think maybe it does. Is there anyone who is running the *minimal 
install* (from Hans' zip files) on either windows or linux who could 
test this for me? I just need you to try out a unicode accented 
character within an <mtext> element inside MathML. Here's my template 
again - put an unicode accented char where 'HERE' appears:

\useXMLfilter[utf]\usemodule[mathml]
\starttext\startXMLdata
<formula><math><mtext>HERE</mtext></math></formula>
\stopXMLdata\stoptext

> You may want to give TeX-Live a test.

It's usually my first port of call, but AFAIK it's not possible to 
control the way the web browser re-encodes stuff before it is submitted, 
so the results are not reliable. This is a real shame - TeX-Live is how 
I usually confirm all my queries.

Thanks again Jim; can anyone running the minimal install help me?

Duncan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
  2005-06-21 10:25 ` Duncan Hothersall
@ 2005-06-21 12:55   ` James Cloos
  0 siblings, 0 replies; 11+ messages in thread
From: James Cloos @ 2005-06-21 12:55 UTC (permalink / raw)


>>>>> "Duncan" == Duncan Hothersall <dh@capdm.com> writes:

>> Are you sure your
>> file is in utf-8 and not, eg, utf-16?

Duncan> I was, but I'm no longer sure of anything. :-) Is there a
Duncan> foolproof way of finding out?

(First, I cannot comment usefully wrt this topic and windows.)

Try this at a shell prompt:

    env LANG=C LC_ALL=C cat --show-all FileName

where FileName is the file in question.  The non-ascii characters will
be output as strings that look M-? where ? is a single ascii character.
If you see a single M-? triplet in place of each non-ascii character
you do not have utf-8.  If you see between two and five such triplets
for each non-ascii character in the document it is probably utf-8.
(If you see ^@ pairs separating the ascii chars you have utf-16.)

Of course, context would not be able to deal with utf16 on linux;
tex would just get confused by the interspersed NULLs (represented
as ^@ in the --show-all output described above) in the initial lines.

So if it is an encoding problem, it is more likely that you are ending
up with a file in one of the iso8859 8-bit encodings.  

A (not-so-?)quick test is this.  Save it w/o the leading blanks
and run it, passing a filename as a single argument.

  #!/bin/bash
  # change foo.tex in the next line to your filename
  for ij in $(seq 1 15); do
      iconv -f iso8859-${ij} -t utf8 <$1 >from-${ij}-$1 && \
          texexec from-${ij}-$1
  done

Then test all of the generated dvi files to see whether any worked.

Duncan> I tend to use emacs, which I thought was a pretty safe bet,
Duncan> but maybe I should try something else?

I also use emacs, but from cvs.  (Gentoo has an emacs-cvs ebuild that
makes that easy.)  I also run with LANG=en_US.UTF-8 and several
settings in emacs to prefer utf8.  The emacs-unicode-2 branch in CVS
(what will become emacs-23; CVS HEAD will become emacs-22) is even
better for this since it uses unicode as its internal representation.

Duncan> I'm testing on both Windows and (Redhat) linux, both with the
Duncan> current minimal ConTeXt installations (i.e. mswintex.zip and
Duncan> linuxtex.zip). They exhibit the same behaviour.

I've only tested on tetex-3.  That may make a difference....

You may want to give TeX-Live a test.

-JimC
-- 
James H. Cloos, Jr. <cloos@jhcloos.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: <mtext> UTF further problems
       [not found] <20050621100002.D406C127BF@ronja.ntg.nl>
@ 2005-06-21 10:25 ` Duncan Hothersall
  2005-06-21 12:55   ` James Cloos
  2005-06-21 15:43 ` Duncan Hothersall
  1 sibling, 1 reply; 11+ messages in thread
From: Duncan Hothersall @ 2005-06-21 10:25 UTC (permalink / raw)


> Duncan> I'm still having trouble with UTF contents in <mtext> tags in
> Duncan> MathML content.
> 
> For whatever it is worth, I just tried that.  A double-acute u (U+0171)
> came through w/o problem.  I'm using a gentoo box w/ tetex 3.0.

Thanks very much for trying it out.

> I tried both dvi and pdf output.  Both worked.
> 
> Are you sure your file is in utf-8 and not, eg, utf-16?

I was, but I'm no longer sure of anything. :-) Is there a foolproof way 
of finding out?

It seems that lots of editors try to 'help' by doing automatic guessing 
and automatic translations into other encodings, making it very 
difficult to tie things down. (And web browsers do that same when 
submitting things over the web, so I can't do what I usually do and test 
on Live.) I tend to use emacs, which I thought was a pretty safe bet, 
but maybe I should try something else?

> What platform are you on?  

I'm testing on both Windows and (Redhat) linux, both with the current 
minimal ConTeXt installations (i.e. mswintex.zip and linuxtex.zip). They 
exhibit the same behaviour.

Thanks for any advice you can give.

Duncan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-06-22  8:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-06-20 13:46 <mtext> UTF further problems Duncan Hothersall
2005-06-20 20:32 ` James Cloos
     [not found] <20050621100002.D406C127BF@ronja.ntg.nl>
2005-06-21 10:25 ` Duncan Hothersall
2005-06-21 12:55   ` James Cloos
2005-06-21 15:43 ` Duncan Hothersall
2005-06-21 23:48   ` Hans Hagen
2005-06-22  0:40   ` James Cloos
2005-06-22  2:18   ` Mojca Miklavec
2005-06-22  8:06     ` Hans Hagen
2005-06-22  8:35     ` Patrick Gundlach
     [not found] <20050621234837.018CB127C9@ronja.ntg.nl>
2005-06-22  7:49 ` Duncan Hothersall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).