From mboxrd@z Thu Jan 1 00:00:00 1970 From: bakul@bitblocks.com (Bakul Shah) Date: Fri, 24 Nov 2017 19:25:15 -0800 Subject: [TUHS] Spell - was tmac: Move macro diagnostics away from `quotes' In-Reply-To: <201711230105.vAN15NI6026629@coolidge.cs.Dartmouth.EDU> References: <201711230105.vAN15NI6026629@coolidge.cs.Dartmouth.EDU> Message-ID: On Nov 22, 2017, at 5:05 PM, Doug McIlroy wrote: > > Steve's program was good, but the dictionary isn't an ideal source > for real text, which abounds in proper names and terms of art. > It also has a lot of rare words that don't pull their weight in > a spell checker, and some attractive nuisances, especially obscure > short words from Scots, botany, etc, which are more likely to > arise in everyday text as typos than by intent. Given the basic > success of Steve's program, I undertook to make a more useful > spelling list, along with more vigorous affix stripping (and a > stop list to avert associated traps, e.g. "presenation" = > pre+senate+ion"). That has been described in Bentley's "Programming > Pearls" and in http://www.cs.dartmouth.edu/~doug/spell.pdf. This is quite interesting to me. A while ago I looked into building a spell checker for Gujarati (a Sanskrit based language) and found it to be a complicated affair -- words can have multiple suffixes since the Guj. equivalents of from/to/in/ etc prepositions are tacked on at the end of a word. But the same endings can also appear in normal words. And there are other complications.... Even though the language is phonetic, mistakes of using the wrong form of long/short vowel signs are common. After reading your paper I am tempted to revive the effort.