From: <denis.maier@ub.unibe.ch>
To: <j.hagen@xs4all.nl>, <ntg-context@ntg.nl>
Subject: Re: Ligature suppression word list
Date: Tue, 6 Apr 2021 14:59:59 +0000 [thread overview]
Message-ID: <02e7ba65f8ba463980ef7e63bb88a2c2@ub.unibe.ch> (raw)
In-Reply-To: <a7181820-975a-3d84-6574-a81a6524e427@xs4all.nl>
> -----Ursprüngliche Nachricht-----
> Von: Hans Hagen <j.hagen@xs4all.nl>
> Gesendet: Samstag, 3. April 2021 17:58
> An: mailing list for ConTeXt users <ntg-context@ntg.nl>; Maier, Denis
> Christian (UB) <denis.maier@ub.unibe.ch>
> Betreff: Re: [NTG-context] Ligature suppression word list
>
> On 4/3/2021 5:06 PM, denis.maier@ub.unibe.ch wrote:
> > Hi everyone
> >
> > Now that Hans has implemented the new ligature suppression mechanism
> > via language goodies - thanks again Hans! - we now need to come up
> > with wordlists.
> >
> > I've started working on a list of German words with ligatures that
> > should be suppressed. The list is derived from the word list that
> > comes with the lualatex selnolig package:
> > https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo
> > rdlist.tex
> > <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w
> > ordlist.tex>
> >
> > You can find the current list here :
> > https://github.com/denismaier/context-nolig-wordlist
> > <https://github.com/denismaier/context-nolig-wordlist>
> >
> > The list is currently organized as follows :
> >
> > 1. L.25-l.35: This specifies words where automatic pattern matching is
> > more difficult than usually because the words contain multiple
> > ligatures, some of which must be suppressed while others must be
> > preserved. In the case of « Auflagefläche » it's even the same
> > combination of letters. So here, we use the bar | to manually
> > indicate points where no ligature must occur.
> > 2. L. 36ff.: The vast amount of words is currently in that list that
> > specifies words where a ff, fl, fi, ffi, or ffl ligature has to be
> > broken up after the first f.
> > 3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be
> > prevented after the second f, so the first two fs form a ligature.
> > 4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225,
> > and l. 2277 suppress ligatures for « ft » and « fft », « fb » and
> > « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
> >
> > Obviously, that list is far from being complete, and the question is
> > if it ever can be. Please have a look and feel free to propose more
> > words to be included - either via mail or directly on github.
> >
> > More generally, there's the question how such a list should be enhanced?
> > I was thinking about two options:
> >
> > 1. The new language options features include a tracker that allows for
> > tracking for which words in a given document ligature prevention
> > happened, and which words haven't been touched by the mechanism. It
> > should be possible to analyze the log file and to create lists of
> > words with ligatures. Should be a rather simple step to derive new
> > words for the ligature-suppression wordlist.
> > 2. A bigger solution might be to use selnoligs patterns in a script
> > that can be run over a large corpus, such as the DWDS (Digitales
> > Wörterbuch der deutschen Sprache). That should produce us a more
> > complete list of words where ligatures must be suppressed.
>
> where is that DWDS ... i can write some code to deal with it (i'd rather start
> from the source than from some interpretation; who know what more there
> is to uncover)
The DWDS is here: https://www.dwds.de/
But I still need to check how we can extract the words from there...
Denis
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://context.aanhet.net
archive : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___________________________________________________________________________________
next prev parent reply other threads:[~2021-04-06 14:59 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-03 15:06 denis.maier
2021-04-03 15:20 ` Arthur Rosendahl
2021-04-03 16:02 ` Hans Hagen
2021-04-08 19:37 ` Arthur Rosendahl
2021-04-08 20:51 ` Hans Hagen
2021-04-03 15:58 ` Hans Hagen
2021-04-06 14:59 ` denis.maier [this message]
2021-04-06 15:03 ` denis.maier
2021-04-03 16:03 ` Hans Hagen
2021-04-03 16:30 ` Thangalin
2021-04-03 16:43 ` Hans Hagen
2021-04-03 19:21 ` Thangalin
2021-04-03 16:42 ` Hans Hagen
[not found] <mailman.248.1617745098.1120.ntg-context@ntg.nl>
2021-04-07 18:19 ` rha17
2021-04-08 8:52 ` denis.maier
2021-04-12 15:52 ` denis.maier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=02e7ba65f8ba463980ef7e63bb88a2c2@ub.unibe.ch \
--to=denis.maier@ub.unibe.ch \
--cc=j.hagen@xs4all.nl \
--cc=ntg-context@ntg.nl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).