[9fans] [RFC] fonts and unicode/utf [TeX]

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] [RFC] fonts and unicode/utf [TeX]
@ 2011-06-16 12:17 tlaronde
  2011-06-16 16:49 ` Russ Cox
                   ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: tlaronde @ 2011-06-16 12:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Hello,

I'm currently exploring, for kerTeX, the area I have the least knowledge
till now: fonts.

It seems that the TeX community has spent a huge amount of time, and
produced a huge amount of tricks to try to use fonts that have glyphes
the Computer Modern have not, specially accented letters.

In 1990, D.E. Knuth wrote an article to explain the correct (both
simpler and most versatile) solution: virtual fonts.

But since people have spent a huge amount of time, it seems furthermore
that they were reluctant to throw everything away, and we are still
dragging tons of data and struggling with puzzling tricks just because
of human nature...

So, trying to give the easiest solution for now, and trying to think
about what can be done to use Plan9 established simplest way: utf, I'm
on the following tracks about this.

Adobe has published the AFM for the 35 standard base fonts for
PostScript (the fonts that are resident in a PS printer). Starting from
these AFM, kerTeX will produce the corresponding TFM, plus a virtual
font.

A virtual font can combine several distinct fonts, and furthermore can
map glyphes. Since TeX uses (for now) its input as a stream of octets,
the deal is to map this input encoding to the correct glyphes. One of
the great feature of the AFM is that a glyphe is described by an ascii
litteral name. The position of the glyphe, its index, in the font is not
of a great concern: the virtual font can take care of the mapping (while
the use directly of TFM will take the input encoding as the index).

I have so extended the encoding used to generate the virtual fonts so
that for the ASCII range it matches the Computer Modern expectations
(hence it is totally compatible with plain TeX), and so that the latin1
encoding used as input will give the correct glyphes. And the cryptic
names will be gone, because loading the (virtual) font will be defined
by calling latin1/the_font.

Why latin1? Not only because, as a French, I use it, but because it is
compatible with unicode.

First question: any feelings about this?

Second question: I'm trying to find if, in western languages, including
ligatures for ae and oe would be good since it is generally needed (one
can forbid ligatures by inserting "{}" between the letters), or if it's
not correct to set this by default for fonts (having the glyphes) since
some western languages use generally the ae or oe combinations without
knowing or expecting the substitution.

A futur step can be made in the following direction:

TeX is not limited to octet character, since for math, it uses indeed
positive wydes (2^15). The code is always mapped to [0..255], but the
whole number is used to switch between fonts (to simplify: see math
mode, \fam and so on).

Something like that could be done in the future, to use a TeX file
directly, encoded in utf, using the rune to select fonts or subfonts.

Cheers,
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-16 12:17 [9fans] [RFC] fonts and unicode/utf [TeX] tlaronde
@ 2011-06-16 16:49 ` Russ Cox
  2011-06-16 17:37   ` tlaronde
  2011-06-16 17:43 ` tlaronde
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: Russ Cox @ 2011-06-16 16:49 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Virtual fonts tricks can't be the correct solution.
The correct solution is to use a font format that
can handle >256 glyphs, such as OTF.
This is what heirloom troff does.

Failing that, it is not clear how much you want to
hack up tex versus just going along to get along.
For Latin alphabets, the Plan 9 tex iso has an extra
style file called 'unicode.sty' that does some serious
latex heroics to trick latex into interpreting UTF-8
byte sequences as their corresponding Latex
equivalents.

Russ

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-16 16:49 ` Russ Cox
@ 2011-06-16 17:37   ` tlaronde
  2011-06-16 18:43     ` Bakul Shah
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-16 17:37 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 16, 2011 at 12:49:12PM -0400, Russ Cox wrote:
> Virtual fonts tricks can't be the correct solution.

Virtual fonts are not the whole solution. To accept, naturally, utf as
input, TeX will have to be adapted (and it is perhaps not as deep as
one could think).

But virtual fonts can use fonts where "glyphes" are not organised
conforming to unicode, leaving the fonts untouched.

That's where the present situation seems not optimal, since afm2tfm(1)
is used to even reencode the PostScript fonts.

> The correct solution is to use a font format that
> can handle >256 glyphs, such as OTF.
> This is what heirloom troff does.
> 
> Failing that, it is not clear how much you want to
> hack up tex versus just going along to get along.
> For Latin alphabets, the Plan 9 tex iso has an extra
> style file called 'unicode.sty' that does some serious
> latex heroics to trick latex into interpreting UTF-8
> byte sequences as their corresponding Latex
> equivalents.

See above. But as always, the first step is to simplify things so that
the bottlenecks are clear.

That's what I'm presently doing, and that's why the Bourne shell
conf/KERTEX_T.post-install generates everything (while compiled fonts
are portable for example, and I could simply provide the result for
download): so that a user can see "how it is done"---even if nobody
cares.

Tracking the current acrobatics done between PostScript fonts (or
others), encoding, tex macro. and so on is puzzling to say things
charitably.  And trying to understand how it is supposed to work
by scrutinizing the current state is definitively not the best
path (and I suspect that this is really the "wizardry": deceiving by
complexity to hide a simple reality).

I seem to recall reading (by a cursory look) about subfonts in Plan9,
precisely for fonts not describing the whole unicode range.

Modifying TeX to accept utf as input (I mean the compiler/interpreter by
itself; not macros), converting to rune and then using 16 bits à la math
mode to switch inside a font family to the "correct" 256 vector is
something that, for a first step, seems to me both reasonable and
simple.

I think I have even a solution to handle right to left, top to bottom,
bottom to top (??), and mixing these inside a page... but this is not on
the top of the stack (and TeX by itself would be lightly touched; the
core will be put in the font format and the dvi drivers).

One of the best "feature" of the TeX package is... METAFONT. For a
mathematician; for a philologist etc. the ability to create signs is, to
my never humble opinion, a must.

And for example, D.E. Knuth math fonts have both +/- and -/+ glyphes.
This is where you can see the mathematician touch. I have old (19th
century) main math textbooks where these are used to explain vertical
alternatives in a "linear" equation, and the order matters.

troff(1) combined with eqn(1) etc. gives already a superb formatting
medium. But it does not provide the designing of fonts... So the TeX
system has to be adapted.
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-16 17:37   ` tlaronde
@ 2011-06-16 18:43     ` Bakul Shah
  2011-06-16 19:20       ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: Bakul Shah @ 2011-06-16 18:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

> Modifying TeX to accept utf as input (I mean the compiler/interpreter by
> itself; not macros), converting to rune and then using 16 bits � la math
> mode to switch inside a font family to the "correct" 256 vector is
> something that, for a first step, seems to me both reasonable and
> simple.

What about XeTeX?  It is a merge of TeX with Unicode and
modern font tech.  Works with OpenType Fonts.  Included in TeX
Live among others.  I can use XeTeX with TeXShop & TeXWorks. I
am just a user so don't know how hard it would be to port but
seems like it is widely used now.

See
    http://scripts.sil.org/xetex
Some more examples @
    http://nitens.org/taraborelli/latex



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-16 18:43     ` Bakul Shah
@ 2011-06-16 19:20       ` tlaronde
  0 siblings, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-16 19:20 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 16, 2011 at 11:43:28AM -0700, Bakul Shah wrote:
> > Modifying TeX to accept utf as input (I mean the compiler/interpreter by
> > itself; not macros), converting to rune and then using 16 bits à la math
> > mode to switch inside a font family to the "correct" 256 vector is
> > something that, for a first step, seems to me both reasonable and
> > simple.
> 
> What about XeTeX?  It is a merge of TeX with Unicode and
> modern font tech.  Works with OpenType Fonts.

I will give it a look. The decision will depend on:

	1) The licence: if it is GPL, I will not touch it even with a long
	spoon...
	2) If the core modifications are separated enough from the kpathsea
	and so on dance.
	3) The nature of the solution.

There is another program I have to give it a look: John Hobby has given
me the information about an evolution of his MetaPost. It is original
AT&T and LGPL, so for the licence it's OK. For the modifications, I will
have to look.

So just to say that I'm not discarding existing solutions by principle. 
If XeTeX does answer correctly---to my taste---to the problem, why not?

But since there is now almost only the lost needle in kerTeX, I will not
add back hay.

And just for the record once more: LaTeX can work with kerTeX; so even
the unicode.sty hack Russ Cox wrote about can work with kerTeX. User has
all the rope he can dream of...
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-16 12:17 [9fans] [RFC] fonts and unicode/utf [TeX] tlaronde
  2011-06-16 16:49 ` Russ Cox
@ 2011-06-16 17:43 ` tlaronde
  2011-06-17 14:18 ` Joel C. Salomon
  2011-06-19 14:07 ` erik quanstrom
  3 siblings, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-16 17:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 16, 2011 at 02:17:00PM +0200, tlaronde@polynum.com wrote:
>[...]
> Second question: I'm trying to find if, in western languages, including
> ligatures for ae and oe would be good since it is generally needed (one
> can forbid ligatures by inserting "{}" between the letters), or if it's
> not correct to set this by default for fonts (having the glyphes) since
> some western languages use generally the ae or oe combinations without
> knowing or expecting the substitution.

Answering to myself: the "co" prefix---coexist etc.---implies
that the "oe" ligature would be a mistake.

And if "ae" in accented french seems to be correct, it is not the case
in english for example: "aerial" would definitively not benefit from the
ligature.

So the two shall not be ligatures.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-16 12:17 [9fans] [RFC] fonts and unicode/utf [TeX] tlaronde
  2011-06-16 16:49 ` Russ Cox
  2011-06-16 17:43 ` tlaronde
@ 2011-06-17 14:18 ` Joel C. Salomon
  2011-06-17 15:37   ` tlaronde
  2011-06-19 14:07 ` erik quanstrom
  3 siblings, 1 reply; 52+ messages in thread
From: Joel C. Salomon @ 2011-06-17 14:18 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 16, 2011 at 8:17 AM,  <tlaronde@polynum.com> wrote:
> Second question: I'm trying to find if, in western languages, including
> ligatures for ae and oe would be good since it is generally needed (one
> can forbid ligatures by inserting "{}" between the letters), or if it's
> not correct to set this by default for fonts (having the glyphes) since
> some western languages use generally the ae or oe combinations without
> knowing or expecting the substitution.

Unicode has part of the answer:  Some ligatures (e.g., U+FB03 “ﬃ”) are
“Presentation Forms”, i.e., not “real” characters but alternate visual
presentations of the the comprising characters.  Those are (usually)
OK to generate automatically.  But “ae”≠“æ” and “oe”≠“œ”, &c.—please
don’t make these substitutions

Unicode doesn’t have many of these Presentation Forms, and only
includes them for round-trip compatibility with other code sets that
have them.  New ligatures of this sort are *very* unlikely to be
added.

Better solution, as Russ suggested, is OpenType.  In that format, the
font designer can include common (e.g., “ff”), historical (e.g., “st”
& “ſs”), and even ad-hoc ligatures.  (There are “fun” fonts with “LOL”
ligatures, for example.)  Different sets of ligatures can be
enabled/disabled by selecting combinations of OTF features.

At which point you’ve reinvented XɘTeX.

—Joel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-17 14:18 ` Joel C. Salomon
@ 2011-06-17 15:37   ` tlaronde
  2011-06-17 18:07     ` Joel C. Salomon
  2011-06-19 14:21     ` erik quanstrom
  0 siblings, 2 replies; 52+ messages in thread
From: tlaronde @ 2011-06-17 15:37 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Jun 17, 2011 at 10:18:20AM -0400, Joel C. Salomon wrote:
>[...]
> OK to generate automatically.  But ?ae???æ? and ?oe?????, &c.?please
> don?t make these substitutions

I have already found (and answered) that "oe" can not be a ligature
since, even in french, the "oe" sequence appears in words that do not
want the substitution ("coefficient"), and "ae" is rare enough and not
even a regular rule (there are greek words [in french] that do not want
it; and even from latin, it is not regular).

>[...] 
> At which point you?ve reinvented X?TeX.

I've given a look at it. I don't want to start a discussion about
Unicode, since, supplementary to the "characters" (alphabetical,
syllabics, ideographics; but no hieroglyphes or Linear B, so it's not
complete ;) there are formatting commands or rendering (the ligature fi
is not a character; but in the XeTeX FAQ it is said user has to insert
directly the Unicode for this codepoint since there is no ligature),
that I don't think should be there (only the historical ASCII controls
should be there; others should be undefined).

But for XeTeX and Plan9 there is a special point: XeTeX uses some C++.
As I have answered privately to someone, it is not an absolute
obstacle---the files are not very numerous so a C flavour could be
achieved.

But if people start throwing me XeTeX in the legs, I will start crying
for a C++ compiler on Plan9...

No, no: I don't make threats!
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-17 15:37   ` tlaronde
@ 2011-06-17 18:07     ` Joel C. Salomon
  2011-06-17 18:37       ` tlaronde
  2011-06-19 14:21     ` erik quanstrom
  1 sibling, 1 reply; 52+ messages in thread
From: Joel C. Salomon @ 2011-06-17 18:07 UTC (permalink / raw)
  To: 9fans

On 06/17/2011 11:37 AM, tlaronde@polynum.com wrote:
> On Fri, Jun 17, 2011 at 10:18:20AM -0400, Joel C. Salomon wrote:
>> At which point you've reinvented XeTeX.
>
> I've given a look at it. I don't want to start a discussion about
> Unicode, since, supplementary to the "characters"
<snip>
> there are formatting commands or rendering
<snip>
> that I don't think should be there (only the historical ASCII controls
> should be there; others should be undefined).

Ignore 'em.  Or map them to TeX control sequences.

> but no hieroglyphes or Linear B, so it's not complete ;)

The fonts may be lacking, but Hieroglyphs & Linear B *are* in Unicode;
see <alanwood.net/unicode/egyptian-hieroglyphs.html> and
<alanwood.net/unicode/linear_b_syllabary.html>.

>                                                        (the ligature fi
> is not a character; but in the XeTeX FAQ it is said user has to insert
> directly the Unicode for this codepoint since there is no ligature),

That's true for TeX's "--" and "---" pseudo-ligatures; the XeTeX way is
to insert the Unicode en- & em-dashes, or to use the "tex-text" font
mapping.  But for "fi" &c., or the more exotic ones, XeTeX will use
whatever ligatures the font's designer has put into the OTF file.

(Also be aware that the XeTeX FAQ on the SIL site is *seriously*
out-of-date.)

> But for XeTeX and Plan9 there is a special point: XeTeX uses some C++.
> As I have answered privately to someone, it is not an absolute
> obstacle---the files are not very numerous so a C flavour could be
> achieved.
>
> But if people start throwing me XeTeX in the legs, I will start crying
> for a C++ compiler on Plan9...

A C version of the PDF library XeTeX uses to translate its "extended"
XDVI format to PDF would be interesting.  C++, though....  No, I'll not
reopen that can of worms today.

--Joel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-17 18:07     ` Joel C. Salomon
@ 2011-06-17 18:37       ` tlaronde
  0 siblings, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-17 18:37 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Jun 17, 2011 at 02:07:42PM -0400, Joel C. Salomon wrote:
> On 06/17/2011 11:37 AM, tlaronde@polynum.com wrote:
>[...] 
> > but no hieroglyphes or Linear B, so it's not complete ;) 
> 
> The fonts may be lacking, but Hieroglyphs & Linear B *are* in Unicode;
> see <alanwood.net/unicode/egyptian-hieroglyphs.html> and
> <alanwood.net/unicode/linear_b_syllabary.html>.

I stand corrected (and this is why I think METAFONT is great: the
ability to create "easily" what would be too expensive due to small
audience). I do think that if Hilbert had had METAFONT to give to one of
his students at Göttingen, he would not have plagued mathematics with
gothic...

>[...] 
> 
> A C version of the PDF library XeTeX uses to translate its "extended"
> XDVI format to PDF would be interesting.  C++, though....  No, I'll not
> reopen that can of worms today.

I'm definitively not a C++ fan, so it's a pure threat.

For now (I mean kerTeX 1.0) I will go the farthest I can go with 8bit
TeX and simplicity, and try to gather enough knowledge around fonts so
that I can decide after where to invest my limited amount of time.

Not in the thread about mouse vs keyboard, I guess.
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-17 15:37   ` tlaronde
  2011-06-17 18:07     ` Joel C. Salomon
@ 2011-06-19 14:21     ` erik quanstrom
  1 sibling, 0 replies; 52+ messages in thread
From: erik quanstrom @ 2011-06-19 14:21 UTC (permalink / raw)
  To: 9fans

> I've given a look at it. I don't want to start a discussion about
> Unicode, since, supplementary to the "characters" (alphabetical,
> syllabics, ideographics; but no hieroglyphes or Linear B, so it's not
> complete ;)

not central to my point, but this is not correct

; grep -i 'linear b syllable b008' /lib/unicode
010000	linear b syllable b008 a
; grep -i 'egyptian hieroglyph a001' /lib/unicode
013000	egyptian hieroglyph a001

> there are formatting commands or rendering (the ligature fi
> is not a character; but in the XeTeX FAQ it is said user has to insert
> directly the Unicode for this codepoint since there is no ligature),
> that I don't think should be there (only the historical ASCII controls
> should be there; others should be undefined).

the general idea behind unicode is that it is a sequenced collection
of codepoints, not characters.  this implies that formatting differences
such as ligatures that have not sematic component (typesetting artifacts,
if you will) shouldn't be encoded in the character set.  i realize there are
some exceptions to this, but imho, the unicode committee are not perfect.

it's easy enough to escape non-codepoints or encode them in one of the
private unicode ranges.

- erik



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-16 12:17 [9fans] [RFC] fonts and unicode/utf [TeX] tlaronde
                   ` (2 preceding siblings ...)
  2011-06-17 14:18 ` Joel C. Salomon
@ 2011-06-19 14:07 ` erik quanstrom
  2011-06-19 16:34   ` tlaronde
  3 siblings, 1 reply; 52+ messages in thread
From: erik quanstrom @ 2011-06-19 14:07 UTC (permalink / raw)
  To: 9fans

> I have so extended the encoding used to generate the virtual fonts so
> that for the ASCII range it matches the Computer Modern expectations
> (hence it is totally compatible with plain TeX), and so that the latin1
> encoding used as input will give the correct glyphes. And the cryptic
> names will be gone, because loading the (virtual) font will be defined
> by calling latin1/the_font.
>
> Why latin1? Not only because, as a French, I use it, but because it is
> compatible with unicode.

perhaps you mean the subset of unicode corresponding to the codepoints
encoded by latin1 encoded in utf-8.  the system character set is utf-8,
and latin1 is not a compatable encoding.  utf-8 is assumed everwhere except
when the data is inbound, and explicitly tagged as having a different
caracter set.  programs like upas/fs and webfs do the conversion at the
border.

there's really no reason for latin1 in 2011.

- erik

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-19 14:07 ` erik quanstrom
@ 2011-06-19 16:34   ` tlaronde
  2011-06-19 18:01     ` tlaronde
  2011-06-19 22:38     ` erik quanstrom
  0 siblings, 2 replies; 52+ messages in thread
From: tlaronde @ 2011-06-19 16:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sun, Jun 19, 2011 at 10:07:19AM -0400, erik quanstrom wrote:
> >
> > Why latin1? Not only because, as a French, I use it, but because it is
> > compatible with unicode.
>
> perhaps you mean the subset of unicode corresponding to the codepoints
> encoded by latin1 encoded in utf-8.  the system character set is utf-8,
> and latin1 is not a compatable encoding.  utf-8 is assumed everwhere except
> when the data is inbound, and explicitly tagged as having a different
> caracter set.  programs like upas/fs and webfs do the conversion at the
> border.
>
> there's really no reason for latin1 in 2011.

There is a reason here: for now, TeX is 8 bits and that's all. So, if
allowing to use, at least, all of the 8 bits means something, it shall
be latin1. This does not prevent somebody to use whatever character set
one wants; but as a default, and _for now_, it's better than nothing;
and significantly better than some random character set that no tcs(1)
will know how to deal with.

To accept directly utf-8 as input will not be addressed for the 1.0
release of kerTeX.

And if people think that I'm too slow: be my guest. I claim it is easier
to tackle the task with kerTeX, than with TeXlive.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-19 16:34   ` tlaronde
@ 2011-06-19 18:01     ` tlaronde
  2011-06-19 22:38     ` erik quanstrom
  1 sibling, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-19 18:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sun, Jun 19, 2011 at 06:34:58PM +0200, tlaronde wrote:
>
> There is a reason here: for now, TeX is 8 bits and that's all. So, if
> allowing to use, at least, all of the 8 bits means something, it shall
> be latin1.

To be more accurate: TeX is 8 bits, and wants ASCII for the first
semi-range.

The Computer Modern are ASCII (plus, in control positions, ligatures and
so on).

The PostScript standard fonts have all latin1.

Hence, by default, the fonts built from the PostScript core fonts shall
be with a latin1 encoding, since this is the best that can be done, with
the glyphes in the font on one side, and the 8 bits capabilities of TeX
on the other.

Other encoding for TeX is possible too (in a 256 glyphes limit), but by
default I will provide latin1 for the fonts built from Adobe afm.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-19 16:34   ` tlaronde
  2011-06-19 18:01     ` tlaronde
@ 2011-06-19 22:38     ` erik quanstrom
  2011-06-20 11:18       ` tlaronde
  1 sibling, 1 reply; 52+ messages in thread
From: erik quanstrom @ 2011-06-19 22:38 UTC (permalink / raw)
  To: 9fans

> > perhaps you mean the subset of unicode corresponding to the codepoints
> > encoded by latin1 encoded in utf-8.  the system character set is utf-8,
> > and latin1 is not a compatable encoding.  utf-8 is assumed everwhere except
> > when the data is inbound, and explicitly tagged as having a different
> > caracter set.  programs like upas/fs and webfs do the conversion at the
> > border.
> >
> > there's really no reason for latin1 in 2011.
>
> There is a reason here: for now, TeX is 8 bits and that's all. So, if
> allowing to use, at least, all of the 8 bits means something, it shall
> be latin1. This does not prevent somebody to use whatever character set
> one wants; but as a default, and _for now_, it's better than nothing;
> and significantly better than some random character set that no tcs(1)
> will know how to deal with.
>
> To accept directly utf-8 as input will not be addressed for the 1.0
> release of kerTeX.

i think you've missed my point.  latin1 is an encoding,
utf-8 is an encoding.  if tex is so backwards that it can't
accept a character wider than 8 bits, then it would be reasonable
to not be different than the rest of the plan 9 system to
read utf 8 runes (i.e. not latin1) in and then reject runes
with a codepoint above 255.

then, if tex is fixed to accept larger codepoints, one can
remove this limit.  if latin1 is used, then it can not be retrofitted
in a way that is compatable with older tex input.

nobody cares what font encoding tex uses internally.  the
real issue is the input to tex.  i sure would be very reluctant
to load anything on my system that will mangle utf-8, especially
for codepoints <256.  that's the path to wchar_t.

- erik

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-19 22:38     ` erik quanstrom
@ 2011-06-20 11:18       ` tlaronde
  2011-06-20 21:53         ` erik quanstrom
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-20 11:18 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sun, Jun 19, 2011 at 06:38:59PM -0400, erik quanstrom wrote:
>
> nobody cares what font encoding tex uses internally.  the
> real issue is the input to tex.  i sure would be very reluctant
> to load anything on my system that will mangle utf-8, especially
> for codepoints <256.  that's the path to wchar_t.

That TeX on Plan9 should accept utf-8 is not a question. But TeX has a
present state, and kerTeX has a present state.

For now, TeX only chews bytes (octets); there is apparently some
acrobatics with a LaTeX macro set trying to accomodate with utf in input
(according to Russ Cox if I understood correctly what he wrote).

One can use TeX with utf as long as one uses only ASCII (by
design/definition of utf). That is one can use TeX in interactive
mode on Plan9 conforming to the TeXbook, since the TeXbook uses
ASCII, even to create non ASCII glyphes (accented with escape
sequences).

TeX will do non desired things if it chews non ASCII encoded in utf
(and this starts even with the Unicode-latin1 range).

BUT, since the "codepoints" described in the latin1 subrange are present
(except for /dcroat and /Dcroat) in the 229 glyphes PostScript Core
fonts, and I can create fonts (tfm) for TeX covering "ASCII/latin1"
characters, this allows people using this more wide (even if limited)
range, to enter the text on Plan9; to use tcs(1) to convert this range
to latin1 i.e. 8 bits encoding, and to feed (not interactive) this file
to TeX. This adds, for now (and for others than Plan9 that still use
chars == octets) some supplementary ability, without removing something.

I have to make a choice. YES, "latin1" too is not less special than not
ASCII in utf; but glyphes are there (in PS core fonts) ; it is in the
same value than Unicode ; so it seems more natural to choose this than
any other _for now_.

Paris has not been built in one day. KerTeX neither.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-20 11:18       ` tlaronde
@ 2011-06-20 21:53         ` erik quanstrom
  2011-06-21 10:56           ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: erik quanstrom @ 2011-06-20 21:53 UTC (permalink / raw)
  To: 9fans

On Mon Jun 20 07:17:16 EDT 2011, tlaronde@polynum.com wrote:
> On Sun, Jun 19, 2011 at 06:38:59PM -0400, erik quanstrom wrote:
> >
> > nobody cares what font encoding tex uses internally.  the
> > real issue is the input to tex.  i sure would be very reluctant
> > to load anything on my system that will mangle utf-8, especially
> > for codepoints <256.  that's the path to wchar_t.
>
> That TeX on Plan9 should accept utf-8 is not a question. But TeX has a
> present state, and kerTeX has a present state.

i'm not sure what the hard part is.  just front the normal input
function with one that calls chartorune and rejects anything above
codepoint 255.  that can't be more than 10 lines of code.

that way there is no possibility of latin1 nonsense breaking previously-
functional .tex files, and you don't have to change any assumptions
in the code.  (it might be better later on to operate directly on utf-8
rather than some sort of wide character format like a rune, but that
can't break existing .tex files.)

- erik

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-20 21:53         ` erik quanstrom
@ 2011-06-21 10:56           ` tlaronde
  2011-06-24 23:05             ` Mauricio CA
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-21 10:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Jun 20, 2011 at 05:53:25PM -0400, erik quanstrom wrote:
>
> i'm not sure what the hard part is.  just front the normal input
> function with one that calls chartorune and rejects anything above
> codepoint 255.  that can't be more than 10 lines of code.
>
> that way there is no possibility of latin1 nonsense breaking previously-
> functional .tex files, and you don't have to change any assumptions
> in the code.  (it might be better later on to operate directly on utf-8
> rather than some sort of wide character format like a rune, but that
> can't break existing .tex files.)

Yes, "casting" to byte can do and this is almost trivial since the input
is buffered and handled via libweb (in kerTeX). But this will disallow
use of TeX for non ASCII, non latin1... It seems to me better to
document, and let user convert his files via tcs(1) to feed TeX.

Alternative solution would be to introduce some TEX_ENCODING env
variable to let input/output in TeX doing the conversion. But on Plan9
this seems to me simply ugly... to reintroduce by the window what was
thrown out by the door...

To be noted that at the moment I do not change _anything_ in the TeX
code. The "latin1" is just the "encoding" of the fontes derived from the
PS core ones (the same can be made with Computer Modern via virtual
fonts to allow to the use directly of accented letters).
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-21 10:56           ` tlaronde
@ 2011-06-24 23:05             ` Mauricio CA
  2011-06-25  6:50               ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: Mauricio CA @ 2011-06-24 23:05 UTC (permalink / raw)
  To: 9fans

>> i'm not sure what the hard part is.  just front the normal input function
>> with one that calls chartorune and rejects anything above codepoint 255.
>> that can't be more than 10 lines of code. [...]

> Yes, "casting" to byte can do and this is almost trivial since the input
> is buffered and handled via libweb (in kerTeX). But this will disallow
> use of TeX for non ASCII, non latin1... It seems to me better to document,
> and let user convert his files via tcs(1) to feed TeX. [...]

I found this text in TeX by Topic[1] that seems to support Quanstrom's
idea. It describes how TeX reads input, and says it's done one line at
a time (where it follows what the system defines as lines) and then for
each line it first removes trailing spaces; then (possibly) ads a return
to the end of the line; and then, since "computers may also differ in
the character encoding (the most common schemes are ASCII and EBCDIC),
so TeX converts the characters that are read from the file to its own
character codes. These codes are then used exclusively [...]"

So, it seems it's expected that encoding specific transformation is
applied to TeX input. Removing trailing spaces, at least, can't be done
without understanding utf-8.

(I warn, though, that I have no expertise in this subject.)

Best, Maurício

[1] http://eijkhout.net/texbytopic/texbytopic.html. I got a ready to
use PDF at http://tex.loria.fr/general/texbytopic.pdf. What I describe
is found at section 2.2.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-24 23:05             ` Mauricio CA
@ 2011-06-25  6:50               ` tlaronde
  2011-06-25 12:19                 ` erik quanstrom
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-25  6:50 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Jun 24, 2011 at 11:05:23PM +0000, Mauricio CA wrote:
>
> I found this text in TeX by Topic[1] that seems to support Quanstrom's
> idea. It describes how TeX reads input, and says it's done one line at
> a time (where it follows what the system defines as lines) and then for
> each line it first removes trailing spaces; then (possibly) ads a return
> to the end of the line; and then, since "computers may also differ in
> the character encoding (the most common schemes are ASCII and EBCDIC),
> so TeX converts the characters that are read from the file to its own
> character codes. These codes are then used exclusively [...]"

This is simply and extract of what is explained, partly in the
TeXbook, and in TeX: the program, 2 volumes of the 5 D.E. Knuth'
series on computer typesetting.

The initial exchange between characters is, shall we say, on the
"system" level. But it is, in the code, limited to the ASCII (7 bits)
range (and even if virtex(1) is almost the bare metal, it can be only
bootstrapped by ASCII macro commands); and furthermore, TeX is "8
bits clean", that is only using, for "text", 8 bits for input...
and as CID for fonts.

The exchange is defined at compilation time, but can also be remapped
via macro-commands.

So casting utf in 8 bits is:
	- useless for ASCII (by definition);
	- will work only for latin1 input.

Extending TeX to wydes (runes) will be relatively easy superficially for
input and output (because D.E.K. has organized the code so that these
parts can be easily changed), but will not work with TeX fonts: all the
fonts machinery has to be changed.

Furthermore, this will not work, as is, with all the Unicode
range, since TeX is "left-to-right" (but what is fundamental is that,
all in all, with the exception perhaps of Frege's ideography, all
languages seem to be linear; so a switch in TeX for width and height of
the boxes computed, and hints for dvi drivers to flip/mirror can achieve
the task). So this also is to be adapted (hence the suggestion for
XeTeX).

So for now, TeX is kept 8 bits. I make no assumption for the encoding
(and user has to feed "8 bits encoding" to TeX; ASCII users have nothing
to change; others, if they want to use directly another 8 bits encoding
(ex.: directly accented letters latin1 code) have to tcs(1) the file
first.

What I will change is only on the fonts available.

For historical reasons, the fonts derived from the PostScript standard
ones were in "EC" encoding, aka Cork, mapping mainly latin1 characters
in the 128-255 in not the latin1 encoding (because it was defined in
1990).

A macro set shall install its own expected fonts.

KerTeX shall be usable to full (relatively to its present state) extent
with the KerTeX provided data, here fonts. And to avoid providing non
D.E.K.'s fonts with the same (cryptic) names as the ones commonly found
in other TeX distributions, the kerTeX ones will use a Unix feature:
directory hierarchy, to explain the dependencies: not an initial letter
for the font forgery, but a subdirectory: adobe/ etc.

This does not prevent anyone from generating other flavours, especially
because by looking to the dir layout and to the conf/KERTEX.post-install
Bourne shell script, everything is shown and explained.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-25  6:50               ` tlaronde
@ 2011-06-25 12:19                 ` erik quanstrom
  2011-06-25 15:03                   ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: erik quanstrom @ 2011-06-25 12:19 UTC (permalink / raw)
  To: 9fans

> So for now, TeX is kept 8 bits. I make no assumption for the encoding
> (and user has to feed "8 bits encoding" to TeX; ASCII users have nothing
> to change; others, if they want to use directly another 8 bits encoding
> (ex.: directly accented letters latin1 code) have to tcs(1) the file
> first.

i am not clear on what "the file" means in this context.  do you mean
the .tex input file or font files?

- erik



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-25 12:19                 ` erik quanstrom
@ 2011-06-25 15:03                   ` tlaronde
  2011-06-25 15:11                     ` erik quanstrom
  2011-06-25 16:34                     ` Mauricio CA
  0 siblings, 2 replies; 52+ messages in thread
From: tlaronde @ 2011-06-25 15:03 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sat, Jun 25, 2011 at 08:19:40AM -0400, erik quanstrom wrote:
> > So for now, TeX is kept 8 bits. I make no assumption for the encoding
> > (and user has to feed "8 bits encoding" to TeX; ASCII users have nothing
> > to change; others, if they want to use directly another 8 bits encoding
> > (ex.: directly accented letters latin1 code) have to tcs(1) the file
> > first.
>
> i am not clear on what "the file" means in this context.  do you mean
> the .tex input file or font files?

I mean the .tex file. The font files as seen by TeX are only the metrics
tfm, and they are binaries.

Since TeX is "8 bits", the tex file must have characters encoded in 8
bits, with the not control positions of the first half being, after
perhaps mapping defined at compile time (can be remapped at user level
but with apparently "strange" macro commands), conforming to ASCII---
used as litterals but also for the primitives.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-25 15:03                   ` tlaronde
@ 2011-06-25 15:11                     ` erik quanstrom
  2011-06-25 16:33                       ` tlaronde
  2011-06-25 16:34                     ` Mauricio CA
  1 sibling, 1 reply; 52+ messages in thread
From: erik quanstrom @ 2011-06-25 15:11 UTC (permalink / raw)
  To: 9fans

On Sat Jun 25 11:01:38 EDT 2011, tlaronde@polynum.com wrote:
> On Sat, Jun 25, 2011 at 08:19:40AM -0400, erik quanstrom wrote:
> > > So for now, TeX is kept 8 bits. I make no assumption for the encoding
> > > (and user has to feed "8 bits encoding" to TeX; ASCII users have nothing
> > > to change; others, if they want to use directly another 8 bits encoding
> > > (ex.: directly accented letters latin1 code) have to tcs(1) the file
> > > first.
> >
> > i am not clear on what "the file" means in this context.  do you mean
> > the .tex input file or font files?
>
> I mean the .tex file. The font files as seen by TeX are only the metrics
> tfm, and they are binaries.

so are you planning on hiding this conversion within the tex
executable or some shell script fronting the executable?
that would work.  but letting ancient and deprecated
latin1 escape into editors, &c. would be a mistake.

- erik



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-25 15:11                     ` erik quanstrom
@ 2011-06-25 16:33                       ` tlaronde
  0 siblings, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-25 16:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sat, Jun 25, 2011 at 11:11:50AM -0400, erik quanstrom wrote:
> On Sat Jun 25 11:01:38 EDT 2011, tlaronde@polynum.com wrote:
> >
> > I mean the .tex file. The font files as seen by TeX are only the metrics
> > tfm, and they are binaries.
>
> so are you planning on hiding this conversion within the tex
> executable or some shell script fronting the executable?
> that would work.  but letting ancient and deprecated
> latin1 escape into editors, &c. would be a mistake.

For the moment I will "hide" strictly nothing in the compiled program.
The only modification is an external choice: that the fonts built
from Adobe PostScript Times-Roman etc. that have only 256 positions
will have in the 128-255 range an "encoding" corresponding to
latin1---while the Cork. aka EC encoding still shipped for historical
reason with distributions of TeX and al. (there is also "8r" encoding
that is latin1 compatible IIRC...) has a not latin1 encoding of
the latin1 characters.

The simplest for a Plan9 user is indeed to access his JIT compiled macro
set (latex(1) is just TeX masquerading behind to know what predigested
version to load using argv[0]) via a script to convert the utf .tex file
to some 8bits characters set before passing this one to TeX.

The correct solution, later, will be to let TeX directly handle utf as
input... but this means not only extending to wydes instead of bytes,
but a heavy lifting for font support too and, if possible, not only
left-to-right direction.

This is why XeTeX comes in the discussion, but with C++ floating
around it, it is not an immediate candidate for Plan9. (And I'm
personnally more keen on having an extended DVI and a dvipdf
driver---because there can be a dvi whatever and there can be a
dvi "standalone" viewer---, than plaguing TeX directly with pdf
whose "interactive" features are growing exponentially with all
the open source solutions lagging far behind Adobe solutions; and I'm
afraid that I will not devote time to support on a "page" interactive
animations, like a paper clip rolling big eyes, even if it is supposed
to sing "la Marseillaise"...)

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-25 15:03                   ` tlaronde
  2011-06-25 15:11                     ` erik quanstrom
@ 2011-06-25 16:34                     ` Mauricio CA
  2011-06-25 17:11                       ` tlaronde
  1 sibling, 1 reply; 52+ messages in thread
From: Mauricio CA @ 2011-06-25 16:34 UTC (permalink / raw)
  To: 9fans

> Since TeX is "8 bits", the tex file must have characters encoded in
> 8 bits, with the not control positions of the first half being, after
> perhaps mapping defined at compile time (can be remapped at user level
> but with apparently "strange" macro commands), conforming to ASCII---
> used as litterals but also for the primitives.

Is it possible to change the representation of a character from an 8 bits
char to, say, a 32 bits integer? If those integers are still mapped to
the existing 8 bits font metrics, wouldn't the basic engine be kept the
same? This probably means extending the syntax of a few control sequences
denoting characters, though.

Best,
Maurício





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-25 16:34                     ` Mauricio CA
@ 2011-06-25 17:11                       ` tlaronde
  2011-06-25 18:43                         ` Michael Kerpan
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-25 17:11 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sat, Jun 25, 2011 at 04:34:17PM +0000, Mauricio CA wrote:
> > Since TeX is "8 bits", the tex file must have characters encoded in
> > 8 bits, with the not control positions of the first half being, after
> > perhaps mapping defined at compile time (can be remapped at user level
> > but with apparently "strange" macro commands), conforming to ASCII---
> > used as litterals but also for the primitives.
>
> Is it possible to change the representation of a character from an 8 bits
> char to, say, a 32 bits integer? If those integers are still mapped to
> the existing 8 bits font metrics, wouldn't the basic engine be kept the
> same? This probably means extending the syntax of a few control sequences
> denoting characters, though.

No, if there is to be a promotion, this is from byte to wyde. There is
"prior art" even in TeX (present program): in math mode, the characters
are wydes (almost), this being interpreted as the combination of a font
family and an index in the font. This exists in PostScript too (see the
Red book).

This extension would allow to accept utf as input (and as output for
messges) without touching the font format.

But for the 1.0 release of kerTeX, I will make strictly no acrobatics
or tries (using tcs(1) gives a solution, even if not ideal). And I will
first spend time thinking of the next step before starting implementing:
take time to decide; once decided and sure of the solution, speed
implementation. Not the common reverse: start "something" without
thinking; and the "final" release being an asymptotical aim, every year
added in implementation leading to a more minuscule gain without ever
crossing the line...
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-25 17:11                       ` tlaronde
@ 2011-06-25 18:43                         ` Michael Kerpan
  2011-06-26  7:57                           ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Kerpan @ 2011-06-25 18:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Modern TeX implementations like XeTeX and LuaTeX handle UTF-8 natively
and also bring all sorts of benefits like OpenType support (automagic
ligatures, real small caps, selectable lining or old-style figures and
more) and the ability to define fonts from the system font pool rather
than using archaic incantations and magic scrolls from the early 90s.
The problem is that these modern implementations are HUGE. On the
average Linux system, TeX, LaTeX and other paraphernalia seem to take
up well over 1 GB these days. I've given up on TeX because it's just
so darn big.

There is, however, hope. Heirloom troff manages to include many of the
same whizz-bang typographic features as XeTeX and friends (including
Unicode support, smartfont support, easy loading of fonts in modern
formats) while taking up about 1/100th the resource footprint. Clearly
what we REALLY need is a filter that takes LaTeX sources and processes
them into TROFF commands to feed to a port of Heirloom troff ;)

Mike

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-25 18:43                         ` Michael Kerpan
@ 2011-06-26  7:57                           ` tlaronde
  2011-06-27  1:01                             ` Michael Kerpan
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-26  7:57 UTC (permalink / raw)
  To: mjkerpan, Fans of the OS Plan 9 from Bell Labs

On Sat, Jun 25, 2011 at 02:43:32PM -0400, Michael Kerpan wrote:
> Modern TeX implementations like XeTeX and LuaTeX handle UTF-8 natively
> and also bring all sorts of benefits like OpenType support (automagic
> ligatures, real small caps, selectable lining or old-style figures and
> more) and the ability to define fonts from the system font pool rather
> than using archaic incantations and magic scrolls from the early 90s.

I don't know what "automagic" ligatures are; but ligatures are here in
the kerTeX fonts, user having nothing special to do to have them. Small
caps are here. Using the system fonts is here too, at least for T1
fonts: afm2tfm(1) makes them available. For other fonts format,
writing a whatever2tfm(1) will do the job.

And "archaic" is definitively a marketing sentence, not a scientific
judgement: "Euclid? Well... it was perhaps good for the epoch..."

> The problem is that these modern implementations are HUGE. On the
> average Linux system, TeX, LaTeX and other paraphernalia seem to take
> up well over 1 GB these days. I've given up on TeX because it's just
> so darn big.

So have I.

>
> There is, however, hope. Heirloom troff manages to include many of the
> same whizz-bang typographic features as XeTeX and friends (including
> Unicode support, smartfont support, easy loading of fonts in modern
> formats) while taking up about 1/100th the resource footprint. Clearly
> what we REALLY need is a filter that takes LaTeX sources and processes
> them into TROFF commands to feed to a port of Heirloom troff ;)

kerTeX is 1/100th of the current TeX distributions and is C89, that is
the most portable. It lacks some Heirloom troff features, but it is for
text and mathematics, includes a font designer: METAFONT, a figure
designer: MetaPost and a bunch of debugging utilities, coding utilities
(WEB), fonts and a state of the art documentation.

So I stick to kerTeX. And I have recorded what _you_ propose to do ;)
Since you seem to claim that the way _you are engaged in_ is easier than
the road I have taken, you should have finished before I have finished
kerTeX, rendering it /* sigh */ obsolete...

Not to mention that I can work on kerTeX only during limited slots of
time, since my main developing time is for a huge beast: KerGIS. And it
should be noted that I manage alone forks of G.R.A.S.S. and TeX and al.,
while "millions of users! thousands of programmers! hundreds of
developers!" seem to be unable to evolve correctly the "community
driven" equivalents... So imagine what one can achieve if one can
concentrate on a far more limited scale? But beware of the tortoise...

This is a lesson "GPL fanatics" have learned: say, by principle, that
"free software" is perfect, and closed source one a desaster. Why?
Simply because if someone criticizes open source code the answer is
immediate: "code is here, be my guest". While, with closed source, one
can spend gallons of electronic ink saying: "This sucks ! If only
I had the code...".
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-26  7:57                           ` tlaronde
@ 2011-06-27  1:01                             ` Michael Kerpan
  2011-06-27 11:48                               ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Kerpan @ 2011-06-27  1:01 UTC (permalink / raw)
  To: tlaronde; +Cc: Fans of the OS Plan 9 from Bell Labs

On Sun, Jun 26, 2011 at 3:57 AM,  <tlaronde@polynum.com> wrote:

> I don't know what "automagic" ligatures are; but ligatures are here in
> the kerTeX fonts, user having nothing special to do to have them. Small
> caps are here. Using the system fonts is here too, at least for T1
> fonts: afm2tfm(1) makes them available. For other fonts format,
> writing a whatever2tfm(1) will do the job.

In general using a simple Type 1 font isn't going to get you things
like true small caps, ligatures (beyond maybe the basic "fi" and "fl")
or the ability to choose between old-style and lining figures. The 256
glyph limit means that you had to split things up into multiple fonts,
This works well enough for simply creating a PostScript file that will
be fed straight to a laser printer, but for creating searchable PDF
files, it's far from ideal. In TeX, it also require a lot of manual
work above and beyond what would be needed to get those features using
Computer Modern. With OpenType support (and using OpenType fonts, of
course), typographic features become as easy to use with third-party
fonts as they are with Computer Modern.

> And "archaic" is definitively a marketing sentence, not a scientific
> judgement: "Euclid? Well... it was perhaps good for the epoch..."

True enough. it's more my opinion than anything else. Still, it must
be an opinion shared by someone else, given the widespread use of
"fontspec" wherever available compared to the older methods.

>> The problem is that these modern implementations are HUGE. On the
>> average Linux system, TeX, LaTeX and other paraphernalia seem to take
>> up well over 1 GB these days. I've given up on TeX because it's just
>> so darn big.
>
> So have I.

> kerTeX is 1/100th of the current TeX distributions and is C89, that is
> the most portable. It lacks some Heirloom troff features, but it is for
> text and mathematics, includes a font designer: METAFONT, a figure
> designer: MetaPost and a bunch of debugging utilities, coding utilities
> (WEB), fonts and a state of the art documentation.

I'm not disparaging your work. In fact I think its pretty good. I was
mainly trying to point out the problems that have arisen in some
"modern" TeX distros in the past.

> So I stick to kerTeX. And I have recorded what _you_ propose to do ;)
> Since you seem to claim that the way _you are engaged in_ is easier than
> the road I have taken, you should have finished before I have finished
> kerTeX, rendering it /* sigh */ obsolete...

I doubt that, as tounge-in-cheek suggestions seldom seem to turn into
working ideas (at least when they come from me)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27  1:01                             ` Michael Kerpan
@ 2011-06-27 11:48                               ` tlaronde
  2011-06-27 12:36                                 ` erik quanstrom
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-27 11:48 UTC (permalink / raw)
  To: Michael Kerpan; +Cc: Fans of the OS Plan 9 from Bell Labs

On Sun, Jun 26, 2011 at 09:01:13PM -0400, Michael Kerpan wrote:
> On Sun, Jun 26, 2011 at 3:57 AM,  <tlaronde@polynum.com> wrote:
>
> > I don't know what "automagic" ligatures are; but ligatures are here in
> > the kerTeX fonts, user having nothing special to do to have them. Small
> > caps are here. Using the system fonts is here too, at least for T1
> > fonts: afm2tfm(1) makes them available. For other fonts format,
> > writing a whatever2tfm(1) will do the job.
>
> In general using a simple Type 1 font isn't going to get you things
> like true small caps, ligatures (beyond maybe the basic "fi" and "fl")
> or the ability to choose between old-style and lining figures.

These are not limitations of the software by itself but limitations due
to the obscurity of the whole process. Ligatures can be added via the
encoding passed to afm2tfm(1). As an example, if the next-to-come
publication, I add the standard TeX classical ones (``, '', fi, fl,
en-dash, em-dash, inverted ponctuation for spanish) plus << and >> for
french guillemets, ,, for basedoublequote.

Once you know how it is done (and since, if the corresponding glyphes do
not exist, this is discarded), it is just a matter of calling the
utility with the correct encoding. And once this is documented, no more
"wizzards" needed...

>The 256
> glyph limit means that you had to split things up into multiple fonts,
> This works well enough for simply creating a PostScript file that will
> be fed straight to a laser printer, but for creating searchable PDF
> files, it's far from ideal. In TeX, it also require a lot of manual
> work above and beyond what would be needed to get those features using
> Computer Modern. With OpenType support (and using OpenType fonts, of
> course), typographic features become as easy to use with third-party
> fonts as they are with Computer Modern.
>

Same answer. TeX does not need the design of the glyphes. It needs only
the metrics (Adobe has published the AFM for the core PostScript; the
definition of the fonts is not public, that's why the "urw" ones are
used.)

These are not a limitation of TeX by itself, but of the surrounding
environment and of the "freedom wizardry by obscurity". That's why too,
I want to preserve dvi, because one can write a dvi2whatever, while
putting directly pdf as the layout language is tying TeX to something
external support. The huge mess "TeX distributions" have become will
sooner of later kill TeX.

One of the major lack of kerTeX now is a dvi display renderer (for X and
Rio). So that the system is standalone and sheltered from external mood.

What Donald E. Knuth wanted is the ability to write his books without
depending on someone else anymore---"we can't print this way, since this
is deprecated, unavailable etc.". KerTeX will definitively miss the
goal if it depends on something else.

The other intellectual context (on my side) is also the following.

How did Michael Ventris find the clues to decipher linear B? The signs
were too numerous to be alphabetical, not enough to be ideographic. So
he guessed they were syllabic with some standalone ideographic ones.

I suspect that if some civilizations have not evolved rapidly, this is
due in part to the way the knowledge is transmitted. It is easy to learn
alphabetic and, furthermore, this disconnects the signs partly from the
sound and totally from the sight of the object (for real ones).
Alphabetic has rules. While ideographic requires erudition, and
since it seems unnatural to have an ideographic base (few signs that
combined can describe highler level notions), it renders new ideas more
difficult to express/transmit.

Unicode is a good idea to avoid "guessing" the language and to
plague code with the language knowledge. With this, utf encoding is the
best idea, keeping ASCII and keeping the "smallest addressable" i.e.
bytes.

But I don't want to have the obligation to "know" 65536 signs to
express what I want to express. I'm sorry, but I think that the
main majority (remember that for latin1/latin2 accented letters
are just variants so need less "user memory" than plain different
characters) can do with (less than) 256 signs blocks, and switch
fonts when "speaking" about special things (the switch can be
automatic by the way). As far as TeX is concerned, all the control
codepoints (positions) are useless in the fonts. There is still
availbale room even if for the latin1 encoded tfm built for (next)
kerTeX from PostScript core.

Does a whole Unicode "Times-Roman" font makes sense? Ideograms in
"Times-Roman"?

So Unicode is not a panacea. It is a mean, not an aim. ("Un moyen, pas
une fin.")
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 11:48                               ` tlaronde
@ 2011-06-27 12:36                                 ` erik quanstrom
  2011-06-27 14:38                                   ` Karljurgen Feuerherm
  2011-06-27 17:20                                   ` tlaronde
  0 siblings, 2 replies; 52+ messages in thread
From: erik quanstrom @ 2011-06-27 12:36 UTC (permalink / raw)
  To: 9fans

> But I don't want to have the obligation to "know" 65536 signs to
> express what I want to express. I'm sorry, but I think that the
> main majority (remember that for latin1/latin2 accented letters
> are just variants so need less "user memory" than plain different
> characters) can do with (less than) 256 signs blocks, and switch
> fonts when "speaking" about special things (the switch can be
> automatic by the way). As far as TeX is concerned, all the control
> codepoints (positions) are useless in the fonts. There is still
> availbale room even if for the latin1 encoded tfm built for (next)
> kerTeX from PostScript core.

there are currently 0x10ffff+1 codepoints (1114112), not 65536,
but only 23669 + the large chinese blocks are currently defined.

but anyway, i think you are missing the point.  every one of those
codepoints is used, or was used in human written communication.
the fact that you or i probablly don't know them all is beside the
point entirely.

there are 600000 words in the oxford english dictionary.  i don't
know them all.  let's suppose i had the power to eliminate all
the ones that i don't know.  wouldn't that be a horrible idea?
then i would not be able to learn any new words.  odious.

so with unicode.  if you strip out all the languages you don't know
by restricting yourself to the latin1 codepoints [0, 256), then you
can't easily add, say, greek or sumerian codepoints should you or
anyone else need them.

since, as you can see, there is a 1:1 identity mapping between latin1
and unicode codepoints [0, 256), i don't see why one wouldn't
give oneself the option to increase this subset to cover more ground.
i use alphas, arrows, math symbols, etc. quite often in code.  and
even more often when i used to use tex.  it's really quite a drag to
read \alpha instead of “α.”

> Does a whole Unicode "Times-Roman" font makes sense? Ideograms in
> "Times-Roman"?

i get confused on terms.  i think the right term is typeface.
extended fonts collections of a given typeface covering very
wide sections of unicode do exist and are sold by the major
font vendors.

i don't think that it's too hard to imagine that one can make
most symbols look compatable enough.  in fact, i'm using a font
with ~32000 glyphs on my plan 9 terminal right now.

and there's no penalty for having that many glyphs.  it just
means that my font file as a couple hundred subfonts.  these
are only open if needed.  typically only 3 subfonts are open
at any one time.

- erik

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 12:36                                 ` erik quanstrom
@ 2011-06-27 14:38                                   ` Karljurgen Feuerherm
  2011-06-27 17:20                                   ` tlaronde
  1 sibling, 0 replies; 52+ messages in thread
From: Karljurgen Feuerherm @ 2011-06-27 14:38 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]

Thanks for bringing up Sumerian (better: Sumero-Akkadian Cuneiform). I
was thinking along exactly those lines. For me at least, solutions that
satisfy 'the majority' are no solutions at all. And obviously, I'm not
alone. 

(Though it could well be that I missed the intent of Thierry's comment
and am barking up the wrong tree.) 

K

>>> erik quanstrom <quanstro@quanstro.net> 06/27/11 8:36 AM >>>
> But I don't want to have the obligation to "know" 65536 signs to
> express what I want to express. I'm sorry, but I think that the
> main majority (remember that for latin1/latin2 accented letters
> are just variants so need less "user memory" than plain different
> characters) can do with (less than) 256 signs blocks, and switch
> fonts when "speaking" about special things (the switch can be
> automatic by the way). As far as TeX is concerned, all the control
> codepoints (positions) are useless in the fonts. There is still
> availbale room even if for the latin1 encoded tfm built for (next)
> kerTeX from PostScript core.

there are currently 0x10ffff+1 codepoints (1114112), not 65536,
but only 23669 + the large chinese blocks are currently defined.

but anyway, i think you are missing the point.  every one of those
codepoints is used, or was used in human written communication.
the fact that you or i probablly don't know them all is beside the
point entirely.

there are 600000 words in the oxford english dictionary.  i don't
know them all.  let's suppose i had the power to eliminate all
the ones that i don't know.  wouldn't that be a horrible idea?
then i would not be able to learn any new words.  odious.

so with unicode.  if you strip out all the languages you don't know
by restricting yourself to the latin1 codepoints [0, 256), then you
can't easily add, say, greek or sumerian codepoints should you or
anyone else need them.

since, as you can see, there is a 1:1 identity mapping between latin1
and unicode codepoints [0, 256), i don't see why one wouldn't
give oneself the option to increase this subset to cover more ground.
i use alphas, arrows, math symbols, etc. quite often in code.  and
even more often when i used to use tex.  it's really quite a drag to
read \alpha instead of “α.”

> Does a whole Unicode "Times-Roman" font makes sense? Ideograms in
> "Times-Roman"?

i get confused on terms.  i think the right term is typeface.
extended fonts collections of a given typeface covering very
wide sections of unicode do exist and are sold by the major
font vendors.

i don't think that it's too hard to imagine that one can make
most symbols look compatable enough.  in fact, i'm using a font
with ~32000 glyphs on my plan 9 terminal right now.

and there's no penalty for having that many glyphs.  it just
means that my font file as a couple hundred subfonts.  these
are only open if needed.  typically only 3 subfonts are open
at any one time.

- erik

[-- Attachment #2: HTML --]
[-- Type: text/html, Size: 4344 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 12:36                                 ` erik quanstrom
  2011-06-27 14:38                                   ` Karljurgen Feuerherm
@ 2011-06-27 17:20                                   ` tlaronde
  2011-06-27 17:34                                     ` erik quanstrom
  2011-06-27 23:45                                     ` Karljurgen Feuerherm
  1 sibling, 2 replies; 52+ messages in thread
From: tlaronde @ 2011-06-27 17:20 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Jun 27, 2011 at 08:36:35AM -0400, erik quanstrom wrote:
>
> and there's no penalty for having that many glyphs.  it just
> means that my font file as a couple hundred subfonts.  these
> are only open if needed.  typically only 3 subfonts are open
> at any one time.

As can be clear from the even more desastrous level of my english
than usual, I only had a minute or two to write the message.

I DON'T SAY THAT I WILL RESTRICT TEX TO THE FIRST 256 CODEPOINTS.

This is precisely why I have rejected your proposal. KerTeX will
provide, because this is what is in the fonts, "latin1" font. But if
there are other fonts for cyrillic, greek etc. I don't want to render
TeX unusable. There are fonts on the one side; TeX on another. And TFM
to link them.

I only say that:

1) Forcing, as this was written in the XeTeX FAQ, user to enter the
special codepoint for the fi ligature since, white eyes, scornful wave
of the hand: "this is the way this is done with Unicode" is sheer
stupidity. I don't want to be forced to specify a printing sugar
instead of the composition of the alphabet. I want to be able to use ~
as a visible sign saying: don't break here, and not the "unbreakable"
space plaguing messages nowdays. Etc. I hate languages supposed to be
human oriented taking whites as semantically significant...

2) I say that one can add utf as input for TeX, and use whatever one
wants/needs---if I speak about Linear B that's perhaps because I have
some interest even in defunct scripting, no?---without dramatically
changing everything in the core TeX engine. TeX, for maths, already
switches fonts by using almost 16 bits. The same can be made for text,
and there is no need to extend the conception of a font metric for TeX
(except marginally for the flipping/mirroring of boxes for direction of
writing), and one can have everything with TeX using 256 glyphes
SUBFONTS, and more precisely, 256 entries TFM. (I add that all in all,
if languages are not mixed, the present TeX can be used for whatever
direction of writing: let the PS interpreter mirror the page, rotate,
flip etc.; more involved when languages are mixed in the same page.)

Subfonts are precisely what you are talking about.

TeX does not use fonts. TeX uses TeX Font Metric. It needs only the
metrics, and one can use whatever fonts, as long as it is described according
to the expectations of TeX.

One can imaging extending a little TeX to switch to TFM "subfonts"
to let it mastered a layout that the _drivers_ will have to translate
according to the native format of the fonts (the drivers handling
really the direction of writing: depending on the hint, the box rendered
is mirrored, flipped etc., TeX needing only to know what is the height
and width [the correct corner] of the result).

"Simplicity is the shortest path to the truth." I suspect that the
current state is not the truth, considering the path taken and the size
of the change files. (In an interview, D.E.K. spoke about omega, whose
change file [against TeX source] was several times the size of the TeX
source...)
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 17:20                                   ` tlaronde
@ 2011-06-27 17:34                                     ` erik quanstrom
  2011-06-27 18:01                                       ` tlaronde
  2011-06-27 23:45                                     ` Karljurgen Feuerherm
  1 sibling, 1 reply; 52+ messages in thread
From: erik quanstrom @ 2011-06-27 17:34 UTC (permalink / raw)
  To: 9fans

> As can be clear from the even more desastrous level of my english
> than usual, I only had a minute or two to write the message.
>
> I DON'T SAY THAT I WILL RESTRICT TEX TO THE FIRST 256 CODEPOINTS.
>
> This is precisely why I have rejected your proposal. KerTeX will
> provide, because this is what is in the fonts, "latin1" font. But if
> there are other fonts for cyrillic, greek etc. I don't want to render
> TeX unusable. There are fonts on the one side; TeX on another. And TFM
> to link them.

no need to yell.  i must be confused.  i thought you said that you
were using latin1 for .tex files.  i don't see a forward-compatable way
to get from latin1 input to utf-8 input.

> I only say that:
>
> 1) Forcing, as this was written in the XeTeX FAQ, user to enter the
> special codepoint for the fi ligature since, white eyes, scornful wave
> of the hand: "this is the way this is done with Unicode" is sheer
> stupidity. I don't want to be forced to specify a printing sugar
> instead of the composition of the alphabet. I want to be able to use ~
> as a visible sign saying: don't break here, and not the "unbreakable"
> space plaguing messages nowdays. Etc. I hate languages supposed to be
> human oriented taking whites as semantically significant...

i don't even have an opinion on this.  i don't understand the conflation
of the input character set and tex's internal representations.  could
you explain why you are taking about them as the same?

to be brutally honest, tex could internally use an array of monkeys
flinging poo to represent characters /internally/ and i would be much
happer than with a reasonable internal representation and a difficult
and incompatable external representation.  at least that way the monkeys
flinging poo are hermetically sealed within the program and not flinging
poo all over my system.  :-)

> 2) I say that one can add utf as input for TeX, and use whatever one
> wants/needs---if I speak about Linear B that's perhaps because I have
> some interest even in defunct scripting, no?---without dramatically
> changing everything in the core TeX engine. TeX, for maths, already
> switches fonts by using almost 16 bits. The same can be made for text,
> and there is no need to extend the conception of a font metric for TeX
> (except marginally for the flipping/mirroring of boxes for direction of
> writing), and one can have everything with TeX using 256 glyphes
> SUBFONTS, and more precisely, 256 entries TFM. (I add that all in all,
> if languages are not mixed, the present TeX can be used for whatever
> direction of writing: let the PS interpreter mirror the page, rotate,
> flip etc.; more involved when languages are mixed in the same page.)

again, i don't think anyone cares if this is how things work internally.

- erik



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 17:34                                     ` erik quanstrom
@ 2011-06-27 18:01                                       ` tlaronde
  2011-06-27 21:17                                         ` Michael Kerpan
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-27 18:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Jun 27, 2011 at 01:34:07PM -0400, erik quanstrom wrote:
>
> i don't even have an opinion on this.  i don't understand the conflation
> of the input character set and tex's internal representations.  could
> you explain why you are taking about them as the same?
>
> to be brutally honest, tex could internally use an array of monkeys
> flinging poo to represent characters /internally/ and i would be much
> happer than with a reasonable internal representation and a difficult
> and incompatable external representation.  at least that way the monkeys
> flinging poo are hermetically sealed within the program and not flinging
> poo all over my system.  :-)

In TeX there is, initially, a defined subset: ASCII. Because TeX is a
compiler/interpreter and one needs to be able to send some
"bootstrapping" commands. This can be rapidly overwritten (but starting
with some ASCII like characters). This can be almost arbitrary.

What people were precisely arguing is precisely that external business,
and "state of the art" (that is soon to be "out of fashion") fonts and
whatever mood "du jour" should lead to the rewrite of TeX internals.

I precisely claim to let TeX internals alone. The majority of the work
is external (the main being in the dvi drivers). If I want to use
ligatures, I shall be able to do. If others want to put directly the
code for the ligatured glyph, they can, but this is their problem and
not a holy rule.

>[...]
> again, i don't think anyone cares if this is how things work internally.
>

Unfortunately wrong. Read back the thread (if you really have
nothing more interesting to do). I have explained this "256 subfonts"
business in the first message, and immediately got answers that
the "correct way" was teaching TeX "modern" fonts.

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 18:01                                       ` tlaronde
@ 2011-06-27 21:17                                         ` Michael Kerpan
  2011-06-28 11:25                                           ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Kerpan @ 2011-06-27 21:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Jun 27, 2011 at 2:01 PM,  <tlaronde@polynum.com> wrote:
> On Mon, Jun 27, 2011 at 01:34:07PM -0400, erik quanstrom wrote:
>>
>> i don't even have an opinion on this.  i don't understand the conflation
>> of the input character set and tex's internal representations.  could
>> you explain why you are taking about them as the same?
>>
>> to be brutally honest, tex could internally use an array of monkeys
>> flinging poo to represent characters /internally/ and i would be much
>> happer than with a reasonable internal representation and a difficult
>> and incompatable external representation.  at least that way the monkeys
>> flinging poo are hermetically sealed within the program and not flinging
>> poo all over my system.  :-)
>
> In TeX there is, initially, a defined subset: ASCII. Because TeX is a
> compiler/interpreter and one needs to be able to send some
> "bootstrapping" commands. This can be rapidly overwritten (but starting
> with some ASCII like characters). This can be almost arbitrary.
>
> What people were precisely arguing is precisely that external business,
> and "state of the art" (that is soon to be "out of fashion") fonts and
> whatever mood "du jour" should lead to the rewrite of TeX internals.
>
> I precisely claim to let TeX internals alone. The majority of the work
> is external (the main being in the dvi drivers). If I want to use
> ligatures, I shall be able to do. If others want to put directly the
> code for the ligatured glyph, they can, but this is their problem and
> not a holy rule.

That's not how OpenType works, actually. It actually works more like
TeX, in that it allows for files to store text as basic
ASCII/Latin-1/8-bit UTF-8 subset format which an OpenType-enabled
renderer (such as XeTeX, InDesign or even Office 2010) then presents
(on screen or on page) as the correct ligature. Thus the big advantage
of OpenType over, say Type 1, is that it offers a featureset much
closer to Computer Modern's full set of ligatures, accents and
alternatives than Type 1 ever could (at least without serious
scripting to combine multiple Type 1 fonts containing all the needed
glyphs into a single "virtual font" as described in your first post)

> Unfortunately wrong. Read back the thread (if you really have
> nothing more interesting to do). I have explained this "256 subfonts"
> business in the first message, and immediately got answers that
> the "correct way" was teaching TeX "modern" fonts.

The subfont system works fine if you both have a complete Type 1 font
set including all the "expert fonts" including the extra glyphs and
the like AND are willing to put together a mapping for it. The problem
is that fonts haven't shipped (to consumers, at least) in that form
for about 10 years. Unless I fundamentally misunderstand the subfont
system (which I admit that I might), for  any font made within the
last 10 years or so, using the subfont/virtual font system would
entail the following steps:
1. Break the complete OpenType font down into a combination of PFBs
and AFMs containing the complete set of characters between them,
carefully remapping each glyph outside of 8-bit range into it so that
they remain accessible. This may break the license agreement for many
fonts and would almost certainly cause the loss of many kerning pairs,
hints and other metadata (I'm not sure how much of that TeX uses, so
that may not be as big a problem as it sounds)
2. Build the virtual font mappings as with a with a "real" Type 1 set
3. Hope for the best.

Given the complexity of the process involved, I would hope you can
understand why, as a USER, teaching TeX to play nice with modern fonts
looks like a good way to go ;)

Again, none of this is meant as a put-down of your quite impressive
work, but rather as a reminder of some areas where others might run
into problems with making USE of said work.

Mike

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 21:17                                         ` Michael Kerpan
@ 2011-06-28 11:25                                           ` tlaronde
  0 siblings, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-28 11:25 UTC (permalink / raw)
  To: mjkerpan, Fans of the OS Plan 9 from Bell Labs

On Mon, Jun 27, 2011 at 05:17:16PM -0400, Michael Kerpan wrote:
>
> The subfont system works fine if you both have a complete Type 1 font
> set including all the "expert fonts" including the extra glyphs and
> the like AND are willing to put together a mapping for it. The problem
> is that fonts haven't shipped (to consumers, at least) in that form
> for about 10 years. Unless I fundamentally misunderstand the subfont
> system (which I admit that I might), for  any font made within the
> last 10 years or so, using the subfont/virtual font system would
> entail the following steps:
> 1. Break the complete OpenType font down into a combination of PFBs
> and AFMs containing the complete set of characters between them,
>[...]

You miss my point: the "subfont" is just letting the inners of TeX alone
by splitting, for TeX, tfm in subfonts.

The fonts by themselves are left alone. The dvi drivers deal with the
fonts; not TeX. TeX needs only the metrics. After some substitutions (by
virtual fonts), TeX tells only "take this glyph of this font and put it
here". And the font, eventually, has a foreign link to the TFM used by
TeX.

The main particularity with afm2tfm(1) is that the PostScript standard
encoding is not Unicode, even not latin1. So one needs to specify an
encoding.

As long as the encoding of the fonts is known, a program will "import"
(just creates the TFM) for a font so that TeX will know the metrics.

The main support is in the dvi drivers (mainly dvips(1)).
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 17:20                                   ` tlaronde
  2011-06-27 17:34                                     ` erik quanstrom
@ 2011-06-27 23:45                                     ` Karljurgen Feuerherm
  2011-06-27 23:48                                       ` erik quanstrom
  2011-06-28 11:19                                       ` tlaronde
  1 sibling, 2 replies; 52+ messages in thread
From: Karljurgen Feuerherm @ 2011-06-27 23:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 643 bytes --]

Thierry,

> I only say that:

> 1) Forcing, as this was written in the XeTeX FAQ, user to enter the
special codepoint for the fi ligature since, white eyes, scornful wave
of the hand: "this is the way this is done with Unicode" is sheer
stupidity. 

I don't know who told you that...  just because there is a codepoint for something does not mean that one has to access that codepoint directly in all cases. Software at various levels can render a ligature on the basis of various actual character sequences (e.g. f + i, or f, i when ligatures are forced, etc. 

It's simply a level of what support one wishes to offer.... 

KF 

[-- Attachment #2: HTML --]
[-- Type: text/html, Size: 1705 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 23:45                                     ` Karljurgen Feuerherm
@ 2011-06-27 23:48                                       ` erik quanstrom
  2011-06-28 11:19                                       ` tlaronde
  1 sibling, 0 replies; 52+ messages in thread
From: erik quanstrom @ 2011-06-27 23:48 UTC (permalink / raw)
  To: 9fans

> I don't know who told you that...  just because there is a codepoint
> for something does not mean that one has to access that codepoint
> directly in all cases.  Software at various levels can render a
> ligature on the basis of various actual character sequences (e.g.  f +
> i, or f, i when ligatures are forced, etc.
>
> It's simply a level of what support one wishes to offer....

+1.

- erik



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-27 23:45                                     ` Karljurgen Feuerherm
  2011-06-27 23:48                                       ` erik quanstrom
@ 2011-06-28 11:19                                       ` tlaronde
  2011-06-28 11:32                                         ` tlaronde
                                                           ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: tlaronde @ 2011-06-28 11:19 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Jun 27, 2011 at 07:45:34PM -0400, Karljurgen Feuerherm wrote:
> Thierry,
> 
> > I only say that:
> 
> > 1) Forcing, as this was written in the XeTeX FAQ, user to enter the
> special codepoint for the fi ligature since, white eyes, scornful wave
> of the hand: "this is the way this is done with Unicode" is sheer
> stupidity. 
> 
> I don't know who told you that...  just because there is a codepoint for something does not mean that one has to access that codepoint directly in all cases. Software at various levels can render a ligature on the basis of various actual character sequences (e.g. f + i, or f, i when ligatures are forced, etc. 
> 
> It's simply a level of what support one wishes to offer.... 

This is exactly what I'm trying to say. If one enters \'e, \' is just
the "charname" or macro command to access the acute accent in the font.
One can enter directly the code for the acute accent. Or one can enter
directly the é (if the CID entered is classified as "other" [literal],
and the fonts have something at the corresponding index).

BUT the documentation found told that with "modern" fonts, one has the
absolute obligation threatened by Thy Unicode GOD to enter the codepoint
and that ligatures were deprecated.

TeX is absolutely agnostic. It is an engine, a compiler/interpreter.
Even tex(1) is just the name of an instance of TeX with a special
convention: D.E. Knuth's plain TeX.
some \'e let
CID
> 
> KF 

-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-28 11:19                                       ` tlaronde
@ 2011-06-28 11:32                                         ` tlaronde
  2011-06-28 12:16                                         ` erik quanstrom
  2011-06-29 23:43                                         ` Karljurgen Feuerherm
  2 siblings, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-28 11:32 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Jun 28, 2011 at 01:19:15PM +0200, tlaronde@polynum.com wrote:
>[...]
> some \'e let
> CID

Please ignore this trailing garbage.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-28 11:19                                       ` tlaronde
  2011-06-28 11:32                                         ` tlaronde
@ 2011-06-28 12:16                                         ` erik quanstrom
  2011-06-29 23:43                                         ` Karljurgen Feuerherm
  2 siblings, 0 replies; 52+ messages in thread
From: erik quanstrom @ 2011-06-28 12:16 UTC (permalink / raw)
  To: 9fans

> BUT the documentation found told that with "modern" fonts, one has the
> absolute obligation threatened by Thy Unicode GOD to enter the codepoint
> and that ligatures were deprecated.

well of course, just use tcs.  ;-|.

- erik



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-28 11:19                                       ` tlaronde
  2011-06-28 11:32                                         ` tlaronde
  2011-06-28 12:16                                         ` erik quanstrom
@ 2011-06-29 23:43                                         ` Karljurgen Feuerherm
  2011-06-30 13:02                                           ` tlaronde
  2 siblings, 1 reply; 52+ messages in thread
From: Karljurgen Feuerherm @ 2011-06-29 23:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 5565 bytes --]

I'd like to make a few comments concerning what you say below. 

1. I've been involved with Unicode, both UTC and as a representative to
WG2, and I can confidently affirm that there is no Unicode God. No one
has ever said There is no Code but Unicode, and UTC/WG2 is its prophet,
or anything like that. If you have a reference to the Unicode Standard
where I can read in black and white what you are referring to, I will
happily look at it. (This is not intended as a smart remark. I'm quite
seriously interested in understanding the facts of this issue.) 

2. Anyone involved in Unicode, including inner core members of UTC etc,
recognize that it's far from perfect. There is acknowledgement that a
number of things could have been handled differently, but weren't.
Stability Policy may seem like a problematic restriction to some in
cases like this, but it guarantees backward compatibility, so has wisdom
to it. 

3. Whatever views one may have on Unicode, for better or worse, it is
what it is. As you said yourself, c'est un moyen et non pas une fin....
One is free to use it, or not, and or devise alternatives. (But more on
alternatives below.) 

4. You suggested in an earlier email that you'd like to think the whole
thing through carefully in advance, rather than implement things in
stages, as others do, who then never get to the advanced stages. To me
this begs the question of whether such is always universally the case.
In particular, if anyone or any group tried/had tried to implement all
of what Unicode proposes to be/become (UCS--Universal Character Set),
the sheer magnitude of the task (which of course grows over time since
scripts either in themselves or as a set are not static), he/she/they
would never get the thing off the ground. This is in part why there are
(arguably) flaws in Unicode. In any case, I seriously doubt that even if
one attempted to "redo" it "the right way this time" one would manage.
This is just not within the grasp of human endeavour. The mistakes would
simply be different or in different areas. Likewise, there are plenty of
things one could bring against the process of Unicode endorsing
proposals, i.e. the inherent politics of interested groups, but that
again is always a reality. 

5. All that being said--Plan 9, as far as I can see, intentionally
supports Unicode (see http://plan9.bell-labs.com/plan9/about.html). (
http://plan9.bell-labs.com/plan9/about.html). ) So to me, it's a
non-starter to want to port *TeX to Plan 9 but rail against Unicode,
whether justifiably or through misunderstanding. 

6. Unicode isn't Eternal, any more than any other encoding standard.
(I'm sure there were--and perhaps still are--those who think that BCD,
no wait! EBCD, no wait! ASCII, no wait...!--were/are the be all and end
all). In time, something else will develop in response to developing
needs. 

7. But at present, the recognized standard out there that for most
practical intents and purposes (in particular, to service the needs of
something other than just North American anglophone techie society) is
Unicode, with whatever blemishes it may have. 

So it seems to me that in keeping with your principle alluded to above,
and given that were talking about a Plan 9 environment here, you ought
to be talking UTF-8 right off the bad. 

As I said--"seems to me". Could be I'm seriously misunderstanding the
discussion... but then again, the diminishing dialogue in terms of
number of participants suggests to me that there may be at least *some*
truth in what I'm thinking.... 

Please don't think this is intended as a rant, either due to the way
I've formatted this or on account of the content. I'm interested in
following what you're doing; I'm just a bit puzzled, and I sincerely
wish you the best in your efforts with this project. 

K

>>> <tlaronde@polynum.com> 06/28/11 7:19 AM >>>
On Mon, Jun 27, 2011 at 07:45:34PM -0400, Karljurgen Feuerherm wrote:
> Thierry,
>
> > I only say that:
>
> > 1) Forcing, as this was
written in the XeTeX FAQ, user to> special codepoint for the fi ligature since, white eyes, scornful wave
> of the hand: "this is the way this is done with Unicode" is sheer
> stupidity.
>
> I don't know who told you that...  just because there is a codepoint
for something does not mean that one has to access that codepoint
directly in all cases. Software at various levels can render a ligature
on the basis of various actual character sequences (e.g. f + i, or f, i
when ligatures are forced, etc.
>
> It's simply a level of what support one wishes to offer....

This is exactly what I'm trying to say. If one enters \'e, \' is just
the "charname" or macro command to access the acute accent in the font.
One can enter directly the code for the acute accent. Or one can enter
directly the é (if the CID entered is classified as "other" [literal],
and the fonts have something at the corresponding index).

BUT the documentation found told that with "modern" fonts, one has the
absolute obligation threatened by Thy Unicode GOD to enter the codepoint
and that ligatures were deprecated.

TeX is absolutely agnostic. It is an engine, a compiler/interpreter.
Even tex(1) is just the name of an instance of TeX with a special
convention: D.E. Knuth's plain TeX.
some \'e let
CID
>
> KF

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

[-- Attachment #2: HTML --]
[-- Type: text/html, Size: 8356 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-29 23:43                                         ` Karljurgen Feuerherm
@ 2011-06-30 13:02                                           ` tlaronde
  2011-06-30 13:14                                             ` erik quanstrom
  2011-06-30 14:51                                             ` Karljurgen Feuerherm
  0 siblings, 2 replies; 52+ messages in thread
From: tlaronde @ 2011-06-30 13:02 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Jun 29, 2011 at 07:43:08PM -0400, Karljurgen Feuerherm wrote:
>[...]

First to make clear what I was refering to (and making a false
generalization) : the XeTeX FAQ:

"However, standard Unicode-compliant fonts do not include ligatures for
these sequences, as the normal expectation is that the actual Unicode
characters will be used in the source text."

Re-reading it, it's not "all ligatures" that are gone with
"Unicode-compliant fonts", but it spoke about the em- and en-dashes and
double quotes. So on these ones, I plead guilty.

But starting with "modern fonts", "modern system", "archaic" and the
like, it's like starting with: "only Adolf Hitler would still use not
Unicode fonts".

> 4. You suggested in an earlier email that you'd like to think the whole
> thing through carefully in advance, rather than implement things in
> stages, as others do, who then never get to the advanced stages. To me
> this begs the question of whether such is always universally the case.

There are 2 essential things in a human mental process:

1) there is a string of thoughts; even ideas that seem for others foreign
had a path in the discoverer mind; he started in the vicinity of his
knowledge. One that gets dropped in the middle of nowhere will never
make a discovery. The former is clearing the virgin forest; the latter
is beating around the bush. The first is the tortoise; the second the
hare.

2) There is no actual infinite: resources, specially in time, are
limited. And generally, when a resource is of the highest quality, it is
scarce. For example, I have a stupendous patience; but not a lot of it.

A first step for TeX is obvious: put aside the direction of writing,
and do things so that at least Unicode (in utf encoding) can be
mastered as input (and at least interactive output), and also, since
it is a formatting system, that adequate fonts can be accessed.

But, as the present state allows the use for every character set that
fits in eight bits, by using (for Plan9 users) tcs(1) to feed TeX with
what it expects, I will not delay forever the release of 1.0 waiting for
this next solution.

And this extension to utf will be done in the spirit of utf: it will be
an extension, but compatible with the existing. One would take the
TeXbook and obtains exactly what described here. In particular,
selecting a Computer Modern font will work as described in the
TeXbook.

>[...]
> 5. All that being said--Plan 9, as far as I can see, intentionally
> supports Unicode (see http://plan9.bell-labs.com/plan9/about.html). (
> http://plan9.bell-labs.com/plan9/about.html). ) So to me, it's a
> non-starter to want to port *TeX to Plan 9 but rail against Unicode,
> whether justifiably or through misunderstanding.

I have written that TeX shall accept utf as input and output (for text)
and utf is an encoding of a special all encompassing character set:
Unicode. So where did I wrote that I don't plan to support Unicode
(because of utf)?

What I did say, and say again, is that, whether people continue throwing
"archaic vs modern" and various other Godwin points arguments or not, if
to my taste I still want to have ligatures for em-, en-dashes, various
quoting and whatever; and if I want to put in ASCII control places _in
font_, characters expected for tex-text compatibility, I will do.

This doesn't prevent anybody from doing whatever one likes;
but symetrically, I'm libre to do whatever I want. Specially if I do
the work.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-30 13:02                                           ` tlaronde
@ 2011-06-30 13:14                                             ` erik quanstrom
  2011-06-30 13:47                                               ` tlaronde
  2011-06-30 14:51                                             ` Karljurgen Feuerherm
  1 sibling, 1 reply; 52+ messages in thread
From: erik quanstrom @ 2011-06-30 13:14 UTC (permalink / raw)
  To: 9fans

> But, as the present state allows the use for every character set that
> fits in eight bits, by using (for Plan9 users) tcs(1) to feed TeX with
> what it expects, I will not delay forever the release of 1.0 waiting for
> this next solution.

good grief.  how hard is it to write this code!?  this bit depends on just a
few simple functions from the plan 9 c library and that can be easily
appropriated, namely chartorune and fullrune, and a user-defined getc.
(not compiled, just dashed off.  just an example of how easy this is.)

char
texgetutfchar(void)
{
	char ibuf[UTFmax + 1];
	int c, utfi;
	Rune r;

	for(;;){
		if(utfi == sizeof ibuf - 1){
			itfi = 0;
			print("garbage input rejected\n");
		}
		ibuf[utfi++] = getc();
		ibuf[utfi] = 0;
		if(fullrune(ibuf, utfi)){
			r = chartorune(&r, ibuf);
			utfi = 0;
			if(r >= 256){
				print("codepoint %#.6ux rejected", r);
				continue;
			}
			return (char)r;
		}
	}
}

- erik



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-30 13:14                                             ` erik quanstrom
@ 2011-06-30 13:47                                               ` tlaronde
  0 siblings, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-30 13:47 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 30, 2011 at 09:14:10AM -0400, erik quanstrom wrote:
> > But, as the present state allows the use for every character set that
> > fits in eight bits, by using (for Plan9 users) tcs(1) to feed TeX with
> > what it expects, I will not delay forever the release of 1.0 waiting for
> > this next solution.
> 
> good grief.  how hard is it to write this code!?  this bit depends on just a
> few simple functions from the plan 9 c library and that can be easily
> appropriated, namely chartorune and fullrune, and a user-defined getc.
> (not compiled, just dashed off.  just an example of how easy this is.)

This is easy just for input. But as I said, constraining to only the
first 256 bits will render TeX unusable for other 8bits sets (latin2,
etc.).

Would starting with an ASCII character set and guessing the character
set from the first not ASCII code (and remaining in this state)
work? And what will be the interaction of for example the LaTeX
macro-definitions, that handle re-encoding out of my reach?

The place where to put the conversion is (thanks to D.E.K.)
well identified. The problem is that (I speak now about going from byte
to wyde), this has an inpact on macro-définition, and this is useless if
there is not the adequate font support.

So limiting the input for now to a subset of Unicode will only be a
(superficial) convenience for the users of this subset and forbid the
use of TeX to others. Trying to guess the 8 bits character set could
lead to some surprises (and I'm reluctant even temporarily to introduce
some KERTEX_CS to specify the character set).

And extending at least to whatever left-to-right can not be confined
only to the input/output convention, but some surgery (even if I
want it limited) must be put in the guts of TeX.

So for now, I will complete the fonts; fix what I wanted to fix
(essentially so that if D.E. Knuth had the need to reinstall his
software, he could have the 5 minutes solution at hand); give a look at
the newer version of MetaPost. And after 1.0 will be another day.
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-30 13:02                                           ` tlaronde
  2011-06-30 13:14                                             ` erik quanstrom
@ 2011-06-30 14:51                                             ` Karljurgen Feuerherm
  2011-06-30 15:22                                               ` Michael Kerpan
  2011-06-30 16:25                                               ` tlaronde
  1 sibling, 2 replies; 52+ messages in thread
From: Karljurgen Feuerherm @ 2011-06-30 14:51 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 1215 bytes --]

Thanks for this. Two notes:

>Re-reading it, it's not "all ligatures" that are gone with
"Unicode-compliant fonts", but it spoke about the em- and en-dashes and
double quotes. So on these ones, I plead guilty. 

Alright. Not a big deal, it seems to me.

>But starting with "modern fonts", "modern system", "archaic" and the
like, it's like starting with: "only Adolf Hitler would still use not
Unicode fonts".

Looking here: http://scripts.sil.org/cms/scripts/page.php?item_id=xetex_faq ( http://scripts.sil.org/cms/scripts/page.php?item_id=xetex_faq ) I cannot find this; you'll have to help me out. 

But still: it's not about being Adolf Hitler by any means. XeTeX aims to be a Unicode compliant system, and that means adhering to the Unicode standard. I personally don't see a need to defend the purpose and use of standards. If you don't want to, don't; you are free, as you said, to do as you please, and no one is disputing that. But then you and everyone else have to accept the obvious consequences. For all its flaws (and even the encoding I helped to author has them), to *me* (and to the authors of the FAQ you're looking at) the benefits far far outweigh the downsides. 

Best 

KF 

[-- Attachment #2: HTML --]
[-- Type: text/html, Size: 2688 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-30 14:51                                             ` Karljurgen Feuerherm
@ 2011-06-30 15:22                                               ` Michael Kerpan
  2011-06-30 16:25                                               ` tlaronde
  1 sibling, 0 replies; 52+ messages in thread
From: Michael Kerpan @ 2011-06-30 15:22 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 30, 2011 at 10:51 AM, Karljurgen Feuerherm
<kfeuerherm@wlu.ca> wrote:
> Thanks for this. Two notes:
>
>>Re-reading it, it's not "all ligatures" that are gone with
> "Unicode-compliant fonts", but it spoke about the em- and en-dashes and
> double quotes. So on these ones, I plead guilty.
>
> Alright. Not a big deal, it seems to me.
Unless XeTeX has changed since I last used it, traditional TeX
ligatures ARE still there.if you load the fonts properly. I always
used tradition TeX punctuation pseudo-ligatures when I used XeTeX
because, unlike accents which are easy to type and not used very
frequently, dashes and quotation marks and frequently used and hard to
type. Thus shortcuts are welcome.

Mike



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-30 14:51                                             ` Karljurgen Feuerherm
  2011-06-30 15:22                                               ` Michael Kerpan
@ 2011-06-30 16:25                                               ` tlaronde
  2011-06-30 16:31                                                 ` erik quanstrom
  1 sibling, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-30 16:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 30, 2011 at 10:51:53AM -0400, Karljurgen Feuerherm wrote:
> [...]
>
> >But starting with "modern fonts", "modern system", "archaic" and the
> like, it's like starting with: "only Adolf Hitler would still use not
> Unicode fonts".
>
> Looking here: http://scripts.sil.org/cms/scripts/page.php?item_id=xetex_faq ( http://scripts.sil.org/cms/scripts/page.php?item_id=xetex_faq ) I cannot find this; you'll have to help me out.
> [...]

It was not about the XeTeX FAQ this time, it was about this thread.

When I first said: OK, I take the bull by the horns, I will redo from
scratch a TeX distribution, I heard: "current TeX on Plan9, even
if obsolete, is enough..."

When I announced the job was done with the core of TeX, answer: "Nobody
uses TeX: everybody uses LaTeX; so it is almost useless." [This is my
special favorite!]

When I saw that the TFM provided with recent TeX distributions provide
latin1 glyphes but not at the latin1 (i.e. Unicode) positions, I decided
it was an historical artefact and was inconsistent. Then my first
message and the avalanche about "teaching TeX _modern_ fonts" etc.

For me _these_ arguments about modern, archaic etc. are Godwin points.
A vast majority of contemporary mathematicians could read the "archaic"
Euclid to learn, for example, that Euclid has never written that "a line
is composed of points" even less "a line is composed of an infinity of
points".  And they should confer this with the fifth book. Because if
the Greeks have not said that, there is probably a reason why...

So back to the technic: as far as TeX is concerned, there is input
(provided by an user, normally) leading to layout rendered by a dvi
driver. A font interacts with the user input by providing some
facilities (ligatures); since these facilities can be added to TeX view
of the font (TFM), without even changing the fonts as viewed by the
drivers, I don't see why they should be discarded.

Furthermore, I don't see why some special glyphes put, in plain
TeX conventions, in ASCII control positions should not be added to
TeX view of a font (TFM).

I've read in a hurry the directory layout of XeTeX, the WEB change file
and the FAQ, just in order to have an idea about what was going on and
a rough idea of the work needed to import it in kerTeX. Hence my mistake
about believing "modern fonts" have thrown away _every ligature_.
I'm relieved to see that I was wrong on this one.

But I would probably have read the whole more coolly if people have not
used some arguments.

I don't despise XeTeX. Nor Unicode. And I will take Unicode as is. But I
will take TeX conventions as is too, since I'm working on TeX, and not
another formatting system; since these conventions are confined to the
ASCII subrange and only diverging from ASCII for the not glyph
positions. I still fail to see what's the big deal?

--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-30 16:25                                               ` tlaronde
@ 2011-06-30 16:31                                                 ` erik quanstrom
  2011-06-30 17:00                                                   ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: erik quanstrom @ 2011-06-30 16:31 UTC (permalink / raw)
  To: 9fans

> I don't despise XeTeX. Nor Unicode. And I will take Unicode as is. But I
> will take TeX conventions as is too, since I'm working on TeX, and not
> another formatting system; since these conventions are confined to the
> ASCII subrange and only diverging from ASCII for the not glyph
> positions. I still fail to see what's the big deal?

you can't have it both ways.  you can't at the same time say tex is
only defined for ascii, so utf-8 is a non sequitor, and at the same time
put out a version of tex that takes latin1 input.

the question is, should you use latin1 or utf-8.  and i think the answer
to this for plan 9 is pretty clear.  use utf-8.  a subset would be much
better than latin1.

the fact that there is a latin2 is proof that latin1 is misguided in ways
that utf-8 does fix.

- erik

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-30 16:31                                                 ` erik quanstrom
@ 2011-06-30 17:00                                                   ` tlaronde
  2011-06-30 17:12                                                     ` tlaronde
  0 siblings, 1 reply; 52+ messages in thread
From: tlaronde @ 2011-06-30 17:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 30, 2011 at 12:31:17PM -0400, erik quanstrom wrote:
> > I don't despise XeTeX. Nor Unicode. And I will take Unicode as is. But I
> > will take TeX conventions as is too, since I'm working on TeX, and not
> > another formatting system; since these conventions are confined to the
> > ASCII subrange and only diverging from ASCII for the not glyph
> > positions. I still fail to see what's the big deal?
>
> you can't have it both ways.  you can't at the same time say tex is
> only defined for ascii, so utf-8 is a non sequitor, and at the same time
> put out a version of tex that takes latin1 input.

No, this is an error you and others are making.

There is a distinction between the encoding input (for the moment TeX
expect only 8 bits), and some conventions in the font organization.

The Computer Modern fonts provide ASCII "visible" characters (glyphes)
in the ASCII positions. But they are other positions in the 0-127 range
that are free. These positions are used "internally" by the plain TeX
conventions (TeX is the compiler/interpreter; tex(1) is the interpreter
having loaded a special set of conventions, the ones of plain TeX; one
can do almost totally without or totally differently). These free
(as far as a font is concerned) positions are filled with non ASCII
characters/glyphes. For example, in the text font layout, the 0x1a
position has the glyphe for the \ae. If a user, using plain TeX,
specifies \ae, the TFM constructed will give the correct metrics for the
glyph, and the dvi driver will put the correct glyph.

This does not preclude the user from directly entering the unicode
codepoint: in the TFM, if you want, the glyph information is duplicated,
in the conventional plain TeX position, and as a literal in the unicode
position.

In this case, the plain TeX convention is accessed whether by the \ae
char definition, the 0x1a code (ASCII control "sub"), or the 0x00e6
unicode.

This is not the input encoding; this is a font mapping.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [9fans] [RFC] fonts and unicode/utf [TeX]
  2011-06-30 17:00                                                   ` tlaronde
@ 2011-06-30 17:12                                                     ` tlaronde
  0 siblings, 0 replies; 52+ messages in thread
From: tlaronde @ 2011-06-30 17:12 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 30, 2011 at 07:00:48PM +0200, tlaronde wrote:
>
> This does not preclude the user from directly entering the unicode
> codepoint: in the TFM, if you want, the glyph information is duplicated,
> in the conventional plain TeX position, and as a literal in the unicode
> position.

More precisely (I hope), since the "latin minuscule ae" is duplicated
too in the conventional TeX positions (overwriting only ASCII range
control positions that are useless), the plain TeX conventions work,
while user can also enter directly the unicode for this.
--
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C




^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2011-06-30 17:12 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-16 12:17 [9fans] [RFC] fonts and unicode/utf [TeX] tlaronde
2011-06-16 16:49 ` Russ Cox
2011-06-16 17:37   ` tlaronde
2011-06-16 18:43     ` Bakul Shah
2011-06-16 19:20       ` tlaronde
2011-06-16 17:43 ` tlaronde
2011-06-17 14:18 ` Joel C. Salomon
2011-06-17 15:37   ` tlaronde
2011-06-17 18:07     ` Joel C. Salomon
2011-06-17 18:37       ` tlaronde
2011-06-19 14:21     ` erik quanstrom
2011-06-19 14:07 ` erik quanstrom
2011-06-19 16:34   ` tlaronde
2011-06-19 18:01     ` tlaronde
2011-06-19 22:38     ` erik quanstrom
2011-06-20 11:18       ` tlaronde
2011-06-20 21:53         ` erik quanstrom
2011-06-21 10:56           ` tlaronde
2011-06-24 23:05             ` Mauricio CA
2011-06-25  6:50               ` tlaronde
2011-06-25 12:19                 ` erik quanstrom
2011-06-25 15:03                   ` tlaronde
2011-06-25 15:11                     ` erik quanstrom
2011-06-25 16:33                       ` tlaronde
2011-06-25 16:34                     ` Mauricio CA
2011-06-25 17:11                       ` tlaronde
2011-06-25 18:43                         ` Michael Kerpan
2011-06-26  7:57                           ` tlaronde
2011-06-27  1:01                             ` Michael Kerpan
2011-06-27 11:48                               ` tlaronde
2011-06-27 12:36                                 ` erik quanstrom
2011-06-27 14:38                                   ` Karljurgen Feuerherm
2011-06-27 17:20                                   ` tlaronde
2011-06-27 17:34                                     ` erik quanstrom
2011-06-27 18:01                                       ` tlaronde
2011-06-27 21:17                                         ` Michael Kerpan
2011-06-28 11:25                                           ` tlaronde
2011-06-27 23:45                                     ` Karljurgen Feuerherm
2011-06-27 23:48                                       ` erik quanstrom
2011-06-28 11:19                                       ` tlaronde
2011-06-28 11:32                                         ` tlaronde
2011-06-28 12:16                                         ` erik quanstrom
2011-06-29 23:43                                         ` Karljurgen Feuerherm
2011-06-30 13:02                                           ` tlaronde
2011-06-30 13:14                                             ` erik quanstrom
2011-06-30 13:47                                               ` tlaronde
2011-06-30 14:51                                             ` Karljurgen Feuerherm
2011-06-30 15:22                                               ` Michael Kerpan
2011-06-30 16:25                                               ` tlaronde
2011-06-30 16:31                                                 ` erik quanstrom
2011-06-30 17:00                                                   ` tlaronde
2011-06-30 17:12                                                     ` tlaronde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).