Raw UTF-8?

discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed

* Raw UTF-8?
@ 2010-07-07  3:13 Anthony J. Bentley
  2010-07-07  9:33 ` Kristaps Dzonsons
  2010-07-07 18:58 ` Ingo Schwarze
  0 siblings, 2 replies; 14+ messages in thread
From: Anthony J. Bentley @ 2010-07-07  3:13 UTC (permalink / raw)
  To: discuss

Hey guys,

When using special characters in manpages, I use plain UTF-8 instead of
the escapes documented in mandoc_char(7), for a couple reasons. I'm just
wondering, is this practice discouraged in any way? Is there a chance
of this _not_ working in future versions of mandoc?

--
Thanks,
Anthony J. Bentley
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-07  3:13 Raw UTF-8? Anthony J. Bentley
@ 2010-07-07  9:33 ` Kristaps Dzonsons
  2010-07-07 14:39   ` Anthony J. Bentley
  2010-07-07 18:58 ` Ingo Schwarze
  1 sibling, 1 reply; 14+ messages in thread
From: Kristaps Dzonsons @ 2010-07-07  9:33 UTC (permalink / raw)
  To: discuss

> When using special characters in manpages, I use plain UTF-8 instead of
> the escapes documented in mandoc_char(7), for a couple reasons. I'm just
> wondering, is this practice discouraged in any way? Is there a chance
> of this _not_ working in future versions of mandoc?

This is being discussed on tech@ right now.

Currently, once you use any non-ASCII encoding, the manual is no longer 
accessable to all terminals.  This is bad.  Furthermore, -Tps will throw 
away your input.  This is more bad.  In fact, only -Thtml will be ok 
with what you do, which is only by dint of it using the same output 
encoding.

groff promises Unicode support in "the next major version".  According 
to their mailing lists, they plan on using \[uNNN] for a Unicode escape 
and on-the-fly translate input UTF-8 into Unicode (effectively using 
"int" instead of "char" for characters).

     http://www.mail-archive.com/groff@gnu.org/msg01378.html

I think it's best for the time being to lift the input warnings and 
document that non-ASCII characters will Balkanise the manual.  I'm 
flapping between warning about it and not warning.

What, by the way, are the reasons you have against using the mandoc_char 
escapes?
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-07  9:33 ` Kristaps Dzonsons
@ 2010-07-07 14:39   ` Anthony J. Bentley
  2010-07-07 20:13     ` Ingo Schwarze
  0 siblings, 1 reply; 14+ messages in thread
From: Anthony J. Bentley @ 2010-07-07 14:39 UTC (permalink / raw)
  To: discuss; +Cc: Kristaps Dzonsons

> What, by the way, are the reasons you have against using the mandoc_char 
> escapes?

Mostly it's just less to memorize. Much easier to remember my compose key's
alt / o, as opposed to \(/o or \o or &oslash; or \u00F8  depending on where
I'm working at the moment.

mandoc_char(7) isn't much help either, as it's several pages long and
doesn't display the actual character in the terminal, so I have to take
a minute or so just to find the one character...

--
Thanks,
Anthony J. Bentley
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-07  3:13 Raw UTF-8? Anthony J. Bentley
  2010-07-07  9:33 ` Kristaps Dzonsons
@ 2010-07-07 18:58 ` Ingo Schwarze
  2010-07-07 19:18   ` Joerg Sonnenberger
  1 sibling, 1 reply; 14+ messages in thread
From: Ingo Schwarze @ 2010-07-07 18:58 UTC (permalink / raw)
  To: discuss

Hi Anthony,

> When using special characters in manpages,

I consider that a terrible idea.  In a nutshell, such manuals are
useless on terminals.  If some piece of information is important, you
should really encode it such that all readers can see it.  If it is
unimportant, just leave it out instead of obfuscating it, which will
make some people wonder whether they are missing anything.

We should probably add a warning to discourage people from using
characters needing more than ASCII on output, saying something
like "this manual is not portable and will not display correctly
in some environments".

From my point of view, non-ASCII-output escape sequences are only
supported for backward compatibility with legacy manuals, and displaying
something semi-sensible in their place is done on a best-effort basis,
knowing that it is ultimately unreliable.  Using such escape sequences
in new mdoc(7) source code, you would only show that you don't care
about the usability of your manuals.

For the occasional proper name of an author, use transliteration
to ASCII.  I consider using non-ASCII-output escape sequences in
there a discourtesy with respect to the author, because then some
people will not be able to read the name.

> I use plain UTF-8 instead of the escapes documented in mandoc_char(7),
> for a couple reasons.  I'm just wondering, is this practice
> discouraged in any way?

Yes.  Eight-Bit characters in roff, man and mdoc source code are syntax
errors, just like they are in C and in any sane programming language.
The current implementation passes them through, but it could as well
throw them away, or abort the parser, subject to change without notice.

> Is there a chance of this _not_ working in future versions of mandoc?

If it works, that is by mere chance, but not portable in any way,
neither between output devices, nor between platforms, nor between
different versions of mandoc.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-07 18:58 ` Ingo Schwarze
@ 2010-07-07 19:18   ` Joerg Sonnenberger
  2010-07-07 21:12     ` Ingo Schwarze
  0 siblings, 1 reply; 14+ messages in thread
From: Joerg Sonnenberger @ 2010-07-07 19:18 UTC (permalink / raw)
  To: discuss

On Wed, Jul 07, 2010 at 08:58:15PM +0200, Ingo Schwarze wrote:
> For the occasional proper name of an author, use transliteration
> to ASCII.  I consider using non-ASCII-output escape sequences in
> there a discourtesy with respect to the author, because then some
> people will not be able to read the name.

Actually, I would consider the reverse the correct behavior. The escape
sequences should provide the transliteration depending on the device
capabilities. Consider my name -- I would strongly hope that output
devices with proper Latin1/Latin15/UTF-8 support to use the diacrit, but
fall back to the transliterated version otherwise.

> > I use plain UTF-8 instead of the escapes documented in mandoc_char(7),
> > for a couple reasons.  I'm just wondering, is this practice
> > discouraged in any way?
> 
> Yes.  Eight-Bit characters in roff, man and mdoc source code are syntax
> errors, just like they are in C and in any sane programming language.
> The current implementation passes them through, but it could as well
> throw them away, or abort the parser, subject to change without notice.

You know that C99 just like many other modern language (dialects) allow
full 8bit input?

The primary problem I have with using 8bit input for mandoc(1) (or groff
in general) is that it doesn't have a way to specify the input character
set. If that is addressed, the discussion would move to the more
interesting point of transliteration.

Joreg
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-07 14:39   ` Anthony J. Bentley
@ 2010-07-07 20:13     ` Ingo Schwarze
  0 siblings, 0 replies; 14+ messages in thread
From: Ingo Schwarze @ 2010-07-07 20:13 UTC (permalink / raw)
  To: discuss

Hi Anthony,

Anthony J. Bentley wrote on Wed, Jul 07, 2010 at 08:39:09AM -0600:

> mandoc_char(7) isn't much help either, as it's several pages long and
> doesn't display the actual character in the terminal,

See the problem?
In the manual you write, \(/o won't be displayed either,
no matter how you try to input it.

The output is the problem here, not the input.

I think we should reorganize mandoc_char(7) in the following way.

 1) First the sane escape sequences that render well everywhere
    and serve a real purpose.

    Examples: \~ \  \& \(ba \(em \(en \(hy \e \-

    The number of sane sequences is relatively small,
    which will also solve the problem you are pointing
    out above:  mandoc_char(7) is unreasonably long.
    A typesetting system like TeX needs long character
    tables, and it needs the ability to print obscure
    characters.  A manual page does not.

 2) Then a sentence explaining that what follows is rarely
    needed, because mandoc(1) is not really intended for
    general purpose typesetting, and much less typesetting
    of mathematical formulas, but just for writing manuals,
    encouraging people to express their intention using text,
    not symbols.  Still, all escape sequences in this
    section are guaranteed to render well and may be useful
    in uncommon situations.

    Examples: \(co \(rg \(-> \(rA \(+- \(<= \*(Pi

 3) Then a sentence explaining that what follows is rarely
    needed, because mdoc(7) has alternative concepts handling
    the typical use case better.  Note that writing new man(7)
    code is discouraged anyway.
    Still, all escape sequences in this section are guaranteed
    to render well and may be useful in very uncommon situations.

    Examples: \0  -- use .Dl or .Bd -literal
              \bu -- use .Bl -bullet
              \lq -- use .Qq or .Qo
              \lB -- use .Bq or .Bc

 4) Then a sentence mildly discouraging the use of what follows,
    listing obsolete sequences that render well, but are not
    needed at all because equivalent recommended escape sequences
    exist.

    Examples: \^ \% \| \(at \(mi \(eq \*(Ba

 5) Then a sentence strongly discouraging any use of the rest,
    listing those escape sequences that render badly.

    Examples: \(r! \(sr \(ss \('e \(`u \(~n \(:a \(^o \(,c \(oa \(eu

    Obviously, this is by far the largest group.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-07 19:18   ` Joerg Sonnenberger
@ 2010-07-07 21:12     ` Ingo Schwarze
  2010-07-07 21:17       ` Joerg Sonnenberger
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Schwarze @ 2010-07-07 21:12 UTC (permalink / raw)
  To: discuss

Hi Joerg,

Joerg Sonnenberger wrote on Wed, Jul 07, 2010 at 09:18:08PM +0200:

> Consider my name -- I would strongly hope that output devices with
> proper Latin1/Latin15/UTF-8 support to use the diacrit, but fall
> back to the transliterated version otherwise.

You hope in vain.  Did you try?

Both old and new groff render that as 'J"\borg Sonnenberger',
which looks like "Jorg Sonnenberger" on a typical terminal.

Maybe the reason for using the unreliable backspace-encoding
variant instead of the transliteration "oe" is that more languages
than just german might use the "LATIN SMALL LETTER O WITH DIAERESIS",
as Unicode calls it, and who knows how a good transliteration from
those languages into ASCII might look like?

The point is, for correct results, you must transliterate before
encoding, when you still know the context, e.g. the language,
which is often required to figure out a correct transliteration.

Thus, you should really use

.An Joerg Sonnenberger

and never

.An J\(:org Sonnenberger

when documenting your programs.

> You know that C99 just like many other modern language (dialects)
> allow full 8bit input?

I know that some do, and i have fought with Python code garbled
in that way, and all the more do i call it insane.

> The primary problem I have with using 8bit input for mandoc(1) (or groff
> in general) is that it doesn't have a way to specify the input character
> set. If that is addressed, the discussion would move to the more
> interesting point of transliteration.

In my experience, as soon as you start dealing with character sets,
chaos ensues.  WTF has made matters worse, not better, because now
many people think it is OK to scatter crap all over the place.
In typesetting, the mentioned chaos is unfortunately unavoidable,
and you need to deal with it; but most of the time, it is also easier
to handle there because in most typesetting environments, you deal
with one language at a time, and you know beforehand with which one.

Unless we enjoy pain, bloat and code obfuscation *and* want to be
continuously distracted from serious development, we should keep
mandoc as far away from any kind of charset considerations as
possible.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-07 21:12     ` Ingo Schwarze
@ 2010-07-07 21:17       ` Joerg Sonnenberger
  2010-07-09 21:05         ` Ulrich Spörlein
  0 siblings, 1 reply; 14+ messages in thread
From: Joerg Sonnenberger @ 2010-07-07 21:17 UTC (permalink / raw)
  To: discuss

On Wed, Jul 07, 2010 at 11:12:12PM +0200, Ingo Schwarze wrote:
> Hi Joerg,
> 
> Joerg Sonnenberger wrote on Wed, Jul 07, 2010 at 09:18:08PM +0200:
> 
> > Consider my name -- I would strongly hope that output devices with
> > proper Latin1/Latin15/UTF-8 support to use the diacrit, but fall
> > back to the transliterated version otherwise.
> 
> You hope in vain.  Did you try?

Yes. lp(1) on NetBSD is such an example. It does the right thing with
groff. Depending on the output device (-Tlatin1 vs -Tascii), it will
either use the umlaut for \(:o or the oe transliteration.

Joerg
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-07 21:17       ` Joerg Sonnenberger
@ 2010-07-09 21:05         ` Ulrich Spörlein
  2010-07-10 18:11           ` J.C. Roberts
  2010-07-11 22:38           ` Kristaps Dzonsons
  0 siblings, 2 replies; 14+ messages in thread
From: Ulrich Spörlein @ 2010-07-09 21:05 UTC (permalink / raw)
  To: discuss

On Wed, 07.07.2010 at 23:17:25 +0200, Joerg Sonnenberger wrote:
> On Wed, Jul 07, 2010 at 11:12:12PM +0200, Ingo Schwarze wrote:
> > Hi Joerg,
> > 
> > Joerg Sonnenberger wrote on Wed, Jul 07, 2010 at 09:18:08PM +0200:
> > 
> > > Consider my name -- I would strongly hope that output devices with
> > > proper Latin1/Latin15/UTF-8 support to use the diacrit, but fall
> > > back to the transliterated version otherwise.
> > 
> > You hope in vain.  Did you try?
> 
> Yes. lp(1) on NetBSD is such an example. It does the right thing with
> groff. Depending on the output device (-Tlatin1 vs -Tascii), it will
> either use the umlaut for \(:o or the oe transliteration.

This also works fine with FreeBSD's groff when rendering to UTF-8 aware
terminals using -Tutf8 (and of course in -Tps and -Thtml mode).

I really hope the sentiment expressed in this thread is in jest, as I
would stop considering mandoc(1) a viable alternative for FreeBSD's man
subsystem if it will never support UTF-8 output (and then render \(:o as
ö like it should).

Uli
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-09 21:05         ` Ulrich Spörlein
@ 2010-07-10 18:11           ` J.C. Roberts
  2010-07-11 22:17             ` Ingo Schwarze
  2010-07-11 22:38           ` Kristaps Dzonsons
  1 sibling, 1 reply; 14+ messages in thread
From: J.C. Roberts @ 2010-07-10 18:11 UTC (permalink / raw)
  To: discuss; +Cc: Ulrich Spörlein

On Fri, 9 Jul 2010 22:05:39 +0100 Ulrich Sp__rlein <uqs@spoerlein.net>
wrote:
>
> On Wed, 07.07.2010 at 23:17:25 +0200, Joerg Sonnenberger wrote:
> > On Wed, Jul 07, 2010 at 11:12:12PM +0200, Ingo Schwarze wrote:
> > > Hi Joerg,
> > > 
> > > Joerg Sonnenberger wrote on Wed, Jul 07, 2010 at 09:18:08PM +0200:
> > > 
> > > > Consider my name -- I would strongly hope that output devices
> > > > with proper Latin1/Latin15/UTF-8 support to use the diacrit,
> > > > but fall back to the transliterated version otherwise.
> > > 
> > > You hope in vain.  Did you try?
> > 
> > Yes. lp(1) on NetBSD is such an example. It does the right thing
> > with groff. Depending on the output device (-Tlatin1 vs -Tascii),
> > it will either use the umlaut for \(:o or the oe transliteration.
> 
> This also works fine with FreeBSD's groff when rendering to UTF-8
> aware terminals using -Tutf8 (and of course in -Tps and -Thtml mode).
> 
> I really hope the sentiment expressed in this thread is in jest, as I
> would stop considering mandoc(1) a viable alternative for FreeBSD's
> man subsystem if it will never support UTF-8 output (and then render
> \(:o as __ like it should).
> 
> Uli

I doubt Ingo was joking and I do understand his concerns, but I agree
UTF-8 support is very important and many consider it a "requirement"
these days.

Personally, I think the \(:o syntax is nonsense. It's an ancient and sad
work-around addressing only one of the countless transliterations and/or
translations needed for a complete solution. If we tried to create 7-bit
strings like this for every possible transliteration and/or translation
of every non-ascii character, the list would be absolutely humongous and
computationally intractable as well as still being incomplete and often
totally inaccurate.

UTF-8 sucks less.

Since we now have UTF, it seems better to error out on the archaic \(:o
syntax to prompt change, rather than support it to prolong a bad idea
and yet another syntax everyone needs to learn.

More importantly, the real problem is the *idea* of automating
transliteration. If you think it through, you'll realize automated
transliteration cannot be completely solved. A complete solution would
require an accurate transliteration, or even translation, to ascii of
every non-ascii character, as well as doing so correctly for every
possible language/usage/context. In essence, you are asking for perfect
automated translation even when perfect manual translation can be
impossible in some situations.

Given the need to support ascii-only terminals/outputs, and given the
need to support non-ascii characters, and given fully automated
transliteration/translation is currently impossible, at first glance it
seems there is an irreconcilable conflict.

Luckily, we can look at it again.
And there is a way to resolve it.

Since we cannot solve the problem of automated transliteration (and
hence, automated translation) for all cases, the idea itself is flawed.
The best thing to do is change the problem we're trying to solve.
Instead of trying to automate the transliteration/translation of
non-ascii characters, we can impose a simple requirement.

The most simple answer would be allow non-ascii if and only if an ascii
equivalent is provided, otherwise error. This puts both the option to
use non-ascii characters as well as the responsibility of correct
transliteration/translation in the hands of the author.

I don't mean to pick on Joerg, but names are excellent examples as well
as one of the most compelling reasons to have proper support for
non-ascii characters. A format something like:

	{ascii, utf-8}

such as:
 	J{oe, \u00F6;}rg
or
	{Joerg, J\u00F6rg}
or
	{Joerg Sonnenberger, J\u00F6rg Sonnenberger}

Ummm... no, the above is just an example, not a suggestion of syntax.
There's probably an existing IF-THEN-ELSE which could be leveraged
without undue overhead, but you get the main idea... --make the author
provide both the THEN and the ELSE.

In many situations, even when a terminal is capable of displaying the
UTF-8, it could still be beneficial to also display the ascii, possibly
in parenthesis. There are plenty of idiots like me who do not know how
to pronounce or even type an "o" with a diacritic, so showing the ascii
transliterated/translated version really does help. If you saw a formal
name in Japanese, Arabic, Thai, Russian, or any language you don't know,
written in it's native character set, could you pronounce or type it?
Worse yet, when it comes to the ascii-fication of non-ascii names of
people, there are tons of variations and different people have different
preferences, so the result is there is no "right" way to do it and the
best practice is to avoid offense by requiring a transcription or
translation from the person. And then there are the people who really
want to be unique (like everyone else) and intentionally (mis?)spell
their name in their own words...

http://pichaus.com/wattoom-zink-pubbawup-gazork-@911d3fd1ffb157c0e23066faca4cf751/

The state of California could not write out my full name correctly on my
drivers license, and the US Federal Government could not write out my
full name correctly on my passport, but at least the latter sent me a
nice "Sorry for the inconvenience" letter. Though I personally learned
not to care about it at an early age, most people are offended if you
get their name wrong.

Throwing an error if an ascii equivalent is not provided is fairly
harsh, but it is necessary to prevent hidden information on ascii-only
terminals/outputs and also to prevent offending anyone (by either
omission in ascii, or by misspelling the ascii equivalent).

	jcr
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-10 18:11           ` J.C. Roberts
@ 2010-07-11 22:17             ` Ingo Schwarze
  0 siblings, 0 replies; 14+ messages in thread
From: Ingo Schwarze @ 2010-07-11 22:17 UTC (permalink / raw)
  To: discuss

Hi Jonathan,

J.C. Roberts wrote on Sat, Jul 10, 2010 at 11:11:18AM -0700:

> UTF-8 support is very important and many consider it a "requirement"
> these days.

Not for manual pages, i just don't see the point.

That said, i wouldn't oppose a -Tlatin1 or -Tutf8 output mode merely
for groff compatibility, as long as it is not too intrusive (which
probably it need not be, implementation will be quite local in one
corner of the terminal output frontend).  But i have no plans to
implement, use or maintain it, and i would test it only in so far
as it must not break anything else.

Also, i would continue urging people to not use it, as manual pages
relying on it would sacrifice portability for no good reason.

And, of course, i will strongly oppose 8-bit-character input.
I'm not willing to deal with multi-byte or wide character support
functions anywhere in mandoc's code.

> Personally, I think the \(:o syntax is nonsense.

Agreed, the reason being that there is no reliable way to render it.
Again, i consider it provided for backward compatibility.

> Since we now have UTF, it seems better to error out on the archaic \(:o
> syntax to prompt change, rather than support it to prolong a bad idea
> and yet another syntax everyone needs to learn.

No, it is used in too many places, and it would not be nice to deny
rendering just because some piece of mdoc(7) or man(7) source code
contains syntax we don't like.  We are not defining new standards
right now, we are re-implementing an existing language.

> There's probably an existing IF-THEN-ELSE which could be leveraged
> without undue overhead, but you get the main idea... --make the author
> provide both the THEN and the ELSE.

You mean, like in
http://www.openbsd.org/cgi-bin/cvsweb/src/share/man/man4/sppp.4#rev1.20

I don't consider that viable, and jmc and myself agree to remove
it from the tree when we find it (unless it comes from upstream).

I doubt that people will regularly provide alternatives.
Typing in special characters at all is already tedious,
providing alternatives will not get done.

And even if it is done, the result is incredibly ugly
and hardly maintainable.

> In many situations, even when a terminal is capable of displaying the
> UTF-8, it could still be beneficial to also display the ascii, possibly
> in parenthesis.

In my opinion, you are vastly overestimating the importance of special
characters in manual pages (beware, i'm not talking about typesetting
of mathematical papers here!) and you are vastly underestimating the
importance of portability and simplicity.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-09 21:05         ` Ulrich Spörlein
  2010-07-10 18:11           ` J.C. Roberts
@ 2010-07-11 22:38           ` Kristaps Dzonsons
  2010-07-13 19:23             ` Ulrich Spörlein
  1 sibling, 1 reply; 14+ messages in thread
From: Kristaps Dzonsons @ 2010-07-11 22:38 UTC (permalink / raw)
  To: discuss

> This also works fine with FreeBSD's groff when rendering to UTF-8 aware
> terminals using -Tutf8 (and of course in -Tps and -Thtml mode).
> 
> I really hope the sentiment expressed in this thread is in jest, as I
> would stop considering mandoc(1) a viable alternative for FreeBSD's man
> subsystem if it will never support UTF-8 output (and then render \(:o as
> ö like it should).

I think there's a little confusion here.  I see Ingo just wrote and 
answered most questions.  Well, no point in wasting a response...

The state of affairs follows:

  - mandoc/groff accept and understand ASCII input
  - mandoc/groff [sometimes] accept but DO NOT understand non-ASCII input

That UTF-8 input renders on your screen is coincidence: you happen to 
have a UTF-8 terminal and groff hasn't puked on the characters.  You 
implicitly assume your readers' mediums have the same capabilities.

Now for the \[foo] syntax.  First, it exists.  Second, it covers most 
European characters.  Is it general?  No.  Why let it stay?  Because it 
lets \(:u be both "u" (my terminal) and ü (e.g. www output).  If you 
don't use the \[foo] escapes, you're screwing readers.  Yes, we're 
screwing non-western-European manual writers ("me") already, but this is 
not a problem we need to solve right now.

Now for output and The Good Stuff.

-Tutf8 is not hard.  I think I can manage this in coming releases 
without any negative effects.  In fact, it will cut the binary size, as 
I'd key special chars as integers and rewrite them on the fly into 
UTF-8, Latin-1, or whatever, for all outputs.

Thanks,

Kristaps
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-11 22:38           ` Kristaps Dzonsons
@ 2010-07-13 19:23             ` Ulrich Spörlein
  2010-07-13 23:25               ` Kristaps Dzonsons
  0 siblings, 1 reply; 14+ messages in thread
From: Ulrich Spörlein @ 2010-07-13 19:23 UTC (permalink / raw)
  To: discuss

On Mon, 12.07.2010 at 00:38:33 +0200, Kristaps Džonsons wrote:
> > This also works fine with FreeBSD's groff when rendering to UTF-8 aware
> > terminals using -Tutf8 (and of course in -Tps and -Thtml mode).
> > 
> > I really hope the sentiment expressed in this thread is in jest, as I
> > would stop considering mandoc(1) a viable alternative for FreeBSD's man
> > subsystem if it will never support UTF-8 output (and then render \(:o as
> > ö like it should).
> 
> I think there's a little confusion here.  I see Ingo just wrote and 
> answered most questions.  Well, no point in wasting a response...
> 
> The state of affairs follows:
> 
>   - mandoc/groff accept and understand ASCII input
>   - mandoc/groff [sometimes] accept but DO NOT understand non-ASCII input
> 
> That UTF-8 input renders on your screen is coincidence: you happen to 
> have a UTF-8 terminal and groff hasn't puked on the characters.  You 
> implicitly assume your readers' mediums have the same capabilities.
> 
> Now for the \[foo] syntax.  First, it exists.  Second, it covers most 
> European characters.  Is it general?  No.  Why let it stay?  Because it 
> lets \(:u be both "u" (my terminal) and ü (e.g. www output).  If you 
> don't use the \[foo] escapes, you're screwing readers.  Yes, we're 
> screwing non-western-European manual writers ("me") already, but this is 
> not a problem we need to solve right now.

I completely agree here, there's nothing fancy we could or should do
regarding input.

> Now for output and The Good Stuff.
> 
> -Tutf8 is not hard.  I think I can manage this in coming releases 
> without any negative effects.  In fact, it will cut the binary size, as 
> I'd key special chars as integers and rewrite them on the fly into 
> UTF-8, Latin-1, or whatever, for all outputs.

Sounds great, do you also plan on adding "special chars" support to -Tps
(mostly for latin1 accents and umlauts)?

Regards,
Uli
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raw UTF-8?
  2010-07-13 19:23             ` Ulrich Spörlein
@ 2010-07-13 23:25               ` Kristaps Dzonsons
  0 siblings, 0 replies; 14+ messages in thread
From: Kristaps Dzonsons @ 2010-07-13 23:25 UTC (permalink / raw)
  To: discuss

>>> This also works fine with FreeBSD's groff when rendering to UTF-8 aware
>>> terminals using -Tutf8 (and of course in -Tps and -Thtml mode).
>>>
>>> I really hope the sentiment expressed in this thread is in jest, as I
>>> would stop considering mandoc(1) a viable alternative for FreeBSD's man
>>> subsystem if it will never support UTF-8 output (and then render \(:o as
>>> ö like it should).
>> I think there's a little confusion here.  I see Ingo just wrote and 
>> answered most questions.  Well, no point in wasting a response...
>>
>> The state of affairs follows:
>>
>>   - mandoc/groff accept and understand ASCII input
>>   - mandoc/groff [sometimes] accept but DO NOT understand non-ASCII input
>>
>> That UTF-8 input renders on your screen is coincidence: you happen to 
>> have a UTF-8 terminal and groff hasn't puked on the characters.  You 
>> implicitly assume your readers' mediums have the same capabilities.
>>
>> Now for the \[foo] syntax.  First, it exists.  Second, it covers most 
>> European characters.  Is it general?  No.  Why let it stay?  Because it 
>> lets \(:u be both "u" (my terminal) and ü (e.g. www output).  If you 
>> don't use the \[foo] escapes, you're screwing readers.  Yes, we're 
>> screwing non-western-European manual writers ("me") already, but this is 
>> not a problem we need to solve right now.
> 
> I completely agree here, there's nothing fancy we could or should do
> regarding input.

Yes.  Note that the problem space lies entirely within -Tps, which for 
now has hard-coded glyph widths.

> 
>> Now for output and The Good Stuff.
>>
>> -Tutf8 is not hard.  I think I can manage this in coming releases 
>> without any negative effects.  In fact, it will cut the binary size, as 
>> I'd key special chars as integers and rewrite them on the fly into 
>> UTF-8, Latin-1, or whatever, for all outputs.
> 
> Sounds great, do you also plan on adding "special chars" support to -Tps
> (mostly for latin1 accents and umlauts)?

Yes.  I want to roll it into the next release along with the chars.in 
upgrade.

Thanks,

Kristaps
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-07-13 23:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-07  3:13 Raw UTF-8? Anthony J. Bentley
2010-07-07  9:33 ` Kristaps Dzonsons
2010-07-07 14:39   ` Anthony J. Bentley
2010-07-07 20:13     ` Ingo Schwarze
2010-07-07 18:58 ` Ingo Schwarze
2010-07-07 19:18   ` Joerg Sonnenberger
2010-07-07 21:12     ` Ingo Schwarze
2010-07-07 21:17       ` Joerg Sonnenberger
2010-07-09 21:05         ` Ulrich Spörlein
2010-07-10 18:11           ` J.C. Roberts
2010-07-11 22:17             ` Ingo Schwarze
2010-07-11 22:38           ` Kristaps Dzonsons
2010-07-13 19:23             ` Ulrich Spörlein
2010-07-13 23:25               ` Kristaps Dzonsons

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).