ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
From: <denis.maier@ub.unibe.ch>
To: <j.hagen@xs4all.nl>, <ntg-context@ntg.nl>
Subject: Re: Ligature suppression word list
Date: Tue, 6 Apr 2021 14:59:59 +0000	[thread overview]
Message-ID: <02e7ba65f8ba463980ef7e63bb88a2c2@ub.unibe.ch> (raw)
In-Reply-To: <a7181820-975a-3d84-6574-a81a6524e427@xs4all.nl>



> -----Ursprüngliche Nachricht-----
> Von: Hans Hagen <j.hagen@xs4all.nl>
> Gesendet: Samstag, 3. April 2021 17:58
> An: mailing list for ConTeXt users <ntg-context@ntg.nl>; Maier, Denis
> Christian (UB) <denis.maier@ub.unibe.ch>
> Betreff: Re: [NTG-context] Ligature suppression word list
> 
> On 4/3/2021 5:06 PM, denis.maier@ub.unibe.ch wrote:
> > Hi everyone
> >
> > Now that Hans has implemented the new ligature suppression mechanism
> > via language goodies - thanks again Hans! - we now need to come up
> > with wordlists.
> >
> > I've started working on a list of German words with ligatures that
> > should be suppressed. The list is derived from the word list that
> > comes with the lualatex selnolig package:
> > https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo
> > rdlist.tex
> > <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w
> > ordlist.tex>
> >
> > You can find the current list here :
> > https://github.com/denismaier/context-nolig-wordlist
> > <https://github.com/denismaier/context-nolig-wordlist>
> >
> > The list is currently organized as follows :
> >
> >  1. L.25-l.35: This specifies words where automatic pattern matching is
> >     more difficult than usually because the words contain multiple
> >     ligatures, some of which must be suppressed while others must be
> >     preserved. In the case of « Auflagefläche » it's even the same
> >     combination of letters. So here, we use the bar | to manually
> >     indicate points where no ligature must occur.
> >  2. L. 36ff.: The vast amount of words is currently in that list that
> >     specifies words where a ff, fl, fi, ffi, or ffl ligature has to be
> >     broken up after the first f.
> >  3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be
> >     prevented after the second f, so the first two fs form a ligature.
> >  4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225,
> >     and l. 2277 suppress ligatures for « ft » and « fft »,  « fb » and
> >     « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
> >
> > Obviously, that list is far from being complete, and the question is
> > if it ever can be. Please have a look and feel free to propose more
> > words to be included - either via mail or directly on github.
> >
> > More generally, there's the question how such a list should be enhanced?
> > I was thinking about two options:
> >
> >  1. The new language options features include a tracker that allows for
> >     tracking for which words in a given document ligature prevention
> >     happened, and which words haven't been touched by the mechanism. It
> >     should be possible to analyze the log file and to create lists of
> >     words with ligatures. Should be a rather simple step to derive new
> >     words for the ligature-suppression wordlist.
> >  2. A bigger solution might be to use selnoligs patterns in a script
> >     that can be run over a large corpus, such as the DWDS (Digitales
> >     Wörterbuch der deutschen Sprache). That should produce us a more
> >     complete list of words where ligatures must be suppressed.
> 
> where is that DWDS ... i can write some code to deal with it (i'd rather start
> from the source than from some interpretation; who know what more there
> is to uncover)

The DWDS is here: https://www.dwds.de/
But I still need to check how we can extract the words from there...

Denis
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

  reply	other threads:[~2021-04-06 14:59 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-03 15:06 denis.maier
2021-04-03 15:20 ` Arthur Rosendahl
2021-04-03 16:02   ` Hans Hagen
2021-04-08 19:37     ` Arthur Rosendahl
2021-04-08 20:51       ` Hans Hagen
2021-04-03 15:58 ` Hans Hagen
2021-04-06 14:59   ` denis.maier [this message]
2021-04-06 15:03   ` denis.maier
2021-04-03 16:03 ` Hans Hagen
2021-04-03 16:30   ` Thangalin
2021-04-03 16:43     ` Hans Hagen
2021-04-03 19:21       ` Thangalin
2021-04-03 16:42 ` Hans Hagen
     [not found] <mailman.248.1617745098.1120.ntg-context@ntg.nl>
2021-04-07 18:19 ` rha17
2021-04-08  8:52   ` denis.maier
2021-04-12 15:52   ` denis.maier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=02e7ba65f8ba463980ef7e63bb88a2c2@ub.unibe.ch \
    --to=denis.maier@ub.unibe.ch \
    --cc=j.hagen@xs4all.nl \
    --cc=ntg-context@ntg.nl \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).