> Message: 2 > Date: Tue, 6 Apr 2021 15:03:54 +0000 > From: > > To: >, > > Subject: Re: [NTG-context] Ligature suppression word list > Message-ID: <41e6530172b54bffb7a82febff0a6be5@ub.unibe.ch > > Content-Type: text/plain; charset="iso-8859-1" > >> -----Ursprüngliche Nachricht----- >> Von: Hans Hagen > >> Gesendet: Samstag, 3. April 2021 17:58 >> An: mailing list for ConTeXt users >; Maier, Denis >> Christian (UB) > >> Betreff: Re: [NTG-context] Ligature suppression word list […] >> >>> 2. A bigger solution might be to use selnoligs patterns in a script >>> that can be run over a large corpus, such as the DWDS (Digitales >>> Wörterbuch der deutschen Sprache). That should produce us a more >>> complete list of words where ligatures must be suppressed. >> >> where is that DWDS ... i can write some code to deal with it (i'd rather start >> from the source than from some interpretation; who know what more there >> is to uncover) > > As it turn out, the linguists that helped with the selnolig package did use another corpus: Stuttgart "Deutsch" Web as Corpus > They describe their approach in that paper: https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf A lot of corpora can be found here: https://wortschatz.uni-leipzig.de/de especially here: https://wortschatz.uni-leipzig.de/de/download/German There are corpora of many other languages, too, such as English, French, Dutch, Spanish, Russian, Japanese, Latin, … HTH Ralf