ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
From: Arthur Rosendahl <arthur.reutenauer@normalesup.org>
To: Mailing list for ConTeXt users <ntg-context@ntg.nl>
Subject: Re: Hyphenation patterns
Date: Fri, 9 Apr 2021 23:57:56 +0200	[thread overview]
Message-ID: <20210409215756.7kjmlqioq75yxahp@phare.normalesup.org> (raw)
In-Reply-To: <996CFED2-BF77-4D29-B09E-81395ECF3939@fiee.net>

  Denis’ latest question reminded me of an earlier query he had about
hyphenation, asking why “applicable” and “obligated” were hyphenated by
ConTeXt as ap-plic-a-ble and ob-lig-at-ed, and not ap-pli-ca-ble and
ob-li-ga-te(d) like in Merriam-Webster (the discussion started at
https://mailman.ntg.nl/pipermail/ntg-context/2020/099695.html).

  First of all, I note that while Webster’s dictionary is a useful
guide, and indeed a major reference for any American typographer,
there’s no absolute rule that we have to follow it either.  The break
applic-able, for example, does look acceptable to me; oblig-ated, less
so.

  Taco reminded that when producing a set of hyphenation patterns from a
list of hyphenated words, we’re essentially compressing information, and
that some minor deviations are to be expected.  However, in my
experience, unexpected breakpoints are almost never due to chance, but
to a deliberate decision.

  Then Hraban said that:

On Fri, Oct 09, 2020 at 10:15:17AM +0200, Henning Hraban Ramm wrote:
> Usually Arthur’s (hail the emperor of hyphenation and protector of the patterns) patterns are flawless, so I guess it’s not a bug but an exception of the rules.

  I see that my self-appointed title is catching on, nice :-)
Unfortunately the patterns are just as likely to contain errors as
anything else, and in this particular case we’ll probably never know for
sure, because the original hyphenated word list was never published (all
the word lists from which patterns were produced in the 80s and 90s have
been lost, for all languages).  We’re thus reduced to guessing the
intent of those who compiled the lists.

  We can get hints from looking at the patterns involved in the
debatable breaks.  Hans has a useful script:

	$ mtxrun --script patterns --language=us --left=2 --right=2 --hyphenate applicable
	hyphenator      |
	hyphenator      | . a p p l i c a b l e .   . a p p l i c a b l e .  
	hyphenator      |    4p1p0                   0 4 1 0 0 0 0 0 0 0 0  
	hyphenator      |      1p2l2                 0 4 1 2 2 0 0 0 0 0 0  
	hyphenator      |      0p0l0i2c1a0b0         0 4 1 2 2 2 1 0 0 0 0  
	hyphenator      |            1c0a0           0 4 1 2 2 2 1 0 0 0 0  
	hyphenator      |            0c0a1b0l0       0 4 1 2 2 2 1 1 0 0 0  
	hyphenator      |                0b2l2       0 4 1 2 2 2 1 1 2 2 0  
	hyphenator      |                0b4l0e0.0   0 4 1 2 2 2 1 1 4 2 0  
	hyphenator      | .0a4p1p2l2i2c1a1b4l2e0.   . a p-p l i c-a-b l e .  
	hyphenator      |
	mtx-patterns    | us 2 2 : applicable : ap-plic-a-ble

  That tells us that there are seven patterns involved in hyphenating
the word applicable: 4p1, 1p2l2, pli2c1ab, 1ca, ca1bl, b2l2, and b4le.
(the final dot is part of that last pattern).  The pattern responsible
for the break applic-able is pli2c1ab.  If we now refer to the source
repository for hyphenation patterns (since comments are stripped in the
ConTeXt sources): https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-en-us.tex
-- we can see line 4508

	hyphen.tex patterns end here, and additional patterns begin:

which means that the pattern pli2c1ab, line 4817, is an “additional
pattern”.  The background story is that hyphen.tex, the original
hyphenation pattern file for American English, produced in 1982-1983
from a list of hyphenated words (following mostly Webster’s), was later
augmented with more patterns that were supposed to improve hyphenation
for many words.  The person who added these new patterns apparently had
a list of words hyphenated incorrectly (according to him) by hyphen.tex,
but both that list and the one used to produce hyphen.tex are as
mentioned above now lost, probably forever.

  In any case, the pattern that causes the break applic-able was clearly
added intentionally; and as I said that break seems quite reasonable to
me.  Not so for the one in oblig-ated, so let’s have a look at that:

	$ mtxrun --script patterns --language=us --left=2 --right=2 --hyphenate obligated
	hyphenator      |
	hyphenator      | . o b l i g a t e d .   . o b l i g a t e d .  
	hyphenator      |  0o0b0l0i2g1             0 0 0 0 2 1 0 0 0 0  
	hyphenator      |    0b2l2                 0 0 2 2 2 1 0 0 0 0  
	hyphenator      |      5l0i0g0a0t0e0       0 0 5 2 2 1 0 0 0 0  
	hyphenator      |        2i0g0             0 0 5 2 2 1 0 0 0 0  
	hyphenator      |          1g0a0           0 0 5 2 2 1 0 0 0 0  
	hyphenator      |              2t1e0d0     0 0 5 2 2 1 2 1 0 0  
	hyphenator      | .0o0b5l2i2g1a2t1e0d0.   . o b-l i g-a t-e d .  
	hyphenator      |
	mtx-patterns    | us 2 2 : obligated : ob-lig-at-ed

  Here we see that the dubious break is caused by the pattern obli2g1,
also an “additional pattern” (line 4783), and here it’s not hard to
guess where it comes from: it has to be for the word obligatory,
hyphenated regularly as o-blig-a-to-ry according to M-W -- and myself ;-)
The incorrect breakpoint in obli-gated is an undesired side effect of
that.

	Best,

		ArthuR
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

  parent reply	other threads:[~2021-04-09 21:57 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-08 15:41 Denis Maier
2020-10-08 16:20 ` Tomas Hala
2020-10-08 17:05 ` Henning Hraban Ramm
2020-10-09  6:52   ` Denis Maier
2020-10-09  6:57     ` Taco Hoekwater
2020-10-09  7:01       ` Denis Maier
2020-10-09 12:48         ` Hans Hagen
2020-10-09 12:59           ` Denis Maier
2020-10-09  8:15     ` Henning Hraban Ramm
2020-10-09  8:59       ` Hans Hagen
2021-04-09 21:57       ` Arthur Rosendahl [this message]
2020-10-09  8:54   ` Hans Hagen
  -- strict thread matches above, loose matches on Subject: below --
2010-05-23 23:22 hyphenation patterns Rogutės Sparnuotos
2010-05-23 21:38 ` Mojca Miklavec
     [not found]   ` <4BF9AE8A.6040405@gmail.com>
2010-05-24  0:16     ` Mojca Miklavec
2010-05-24  8:17       ` Hans Hagen
2010-05-24 18:52       ` rogutes
2010-05-24 14:50   ` luigi scarso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210409215756.7kjmlqioq75yxahp@phare.normalesup.org \
    --to=arthur.reutenauer@normalesup.org \
    --cc=ntg-context@ntg.nl \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).