* [TUHS] Was the compressed dictionary used?
@ 2025-01-02 12:40 arnold
2025-01-02 14:51 ` [TUHS] " Douglas McIlroy
2025-01-02 15:13 ` Grant Taylor via TUHS
0 siblings, 2 replies; 9+ messages in thread
From: arnold @ 2025-01-02 12:40 UTC (permalink / raw)
To: tuhs
Hi.
The paper on compressing the dictionary was interesting. In the day
of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is
a big savings.
Was the compressed dictionary put into use? I could imaging that
spell(1) at least would have needed some library routines to return
a stream of words from it.
Just wondering. Thanks,
Arnold
^ permalink raw reply [flat|nested] 9+ messages in thread
* [TUHS] Re: Was the compressed dictionary used?
2025-01-02 12:40 [TUHS] Was the compressed dictionary used? arnold
@ 2025-01-02 14:51 ` Douglas McIlroy
2025-01-02 15:12 ` Warner Losh
2025-01-02 16:20 ` arnold
2025-01-02 15:13 ` Grant Taylor via TUHS
1 sibling, 2 replies; 9+ messages in thread
From: Douglas McIlroy @ 2025-01-02 14:51 UTC (permalink / raw)
To: arnold; +Cc: tuhs
I am not aware that the compressed dictionary was used for anything.
Steve Johnson's first shell-script spelling-checker did make a pass
over a dictionary, but not Webster's second, which would have caused
lots of false negatives because it contains so many exotic small words
that could result from typos. My production spell aggresively stripped
affixes and used hashing and other coding tricks to keep its
"dictionary" in the limited memory of a PDP-11. (The whole story is
told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully
described by Jon Bentley in
https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory
became available, these heroics were replaced by basic common-prefix
coding patterned after Morris and Thompson, just as Arnold surmised.
On Thu, Jan 2, 2025 at 7:41 AM <arnold@skeeve.com> wrote:
>
> Hi.
>
> The paper on compressing the dictionary was interesting. In the day
> of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is
> a big savings.
>
> Was the compressed dictionary put into use? I could imaging that
> spell(1) at least would have needed some library routines to return
> a stream of words from it.
>
> Just wondering. Thanks,
>
> Arnold
^ permalink raw reply [flat|nested] 9+ messages in thread
* [TUHS] Re: Was the compressed dictionary used?
2025-01-02 14:51 ` [TUHS] " Douglas McIlroy
@ 2025-01-02 15:12 ` Warner Losh
2025-01-02 17:20 ` Douglas McIlroy
2025-01-02 16:20 ` arnold
1 sibling, 1 reply; 9+ messages in thread
From: Warner Losh @ 2025-01-02 15:12 UTC (permalink / raw)
To: Douglas McIlroy; +Cc: The Eunuchs Hysterical Society
[-- Attachment #1: Type: text/plain, Size: 1654 bytes --]
On Thu, Jan 2, 2025, 7:51 AM Douglas McIlroy <douglas.mcilroy@dartmouth.edu>
wrote:
> I am not aware that the compressed dictionary was used for anything.
> Steve Johnson's first shell-script spelling-checker did make a pass
> over a dictionary, but not Webster's second, which would have caused
> lots of false negatives because it contains so many exotic small words
> that could result from typos.
Where did the Websters Second file come from? Did the labs give the public
domain paper dictionary to the equivalent of a typing pool and had them
enter it? It did it come from elsewhere? Or something else? How was it
checked for accuracy?
Warner
My production spell aggresively stripped
> affixes and used hashing and other coding tricks to keep its
> "dictionary" in the limited memory of a PDP-11. (The whole story is
> told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully
> described by Jon Bentley in
> https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory
> became available, these heroics were replaced by basic common-prefix
> coding patterned after Morris and Thompson, just as Arnold surmised.
>
> On Thu, Jan 2, 2025 at 7:41 AM <arnold@skeeve.com> wrote:
> >
> > Hi.
> >
> > The paper on compressing the dictionary was interesting. In the day
> > of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is
> > a big savings.
> >
> > Was the compressed dictionary put into use? I could imaging that
> > spell(1) at least would have needed some library routines to return
> > a stream of words from it.
> >
> > Just wondering. Thanks,
> >
> > Arnold
>
[-- Attachment #2: Type: text/html, Size: 2681 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* [TUHS] Re: Was the compressed dictionary used?
2025-01-02 12:40 [TUHS] Was the compressed dictionary used? arnold
2025-01-02 14:51 ` [TUHS] " Douglas McIlroy
@ 2025-01-02 15:13 ` Grant Taylor via TUHS
2025-01-03 3:14 ` John Levine
1 sibling, 1 reply; 9+ messages in thread
From: Grant Taylor via TUHS @ 2025-01-02 15:13 UTC (permalink / raw)
To: tuhs
On 1/2/25 6:40 AM, arnold@skeeve.com wrote:
> The paper on compressing the dictionary was interesting. In the day
> of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is a
> big savings.
It's even more important when sending data across the wire.
> Was the compressed dictionary put into use? I could imaging that
> spell(1) at least would have needed some library routines to return
> a stream of words from it.
I couldn't help but think about the DNS on wire compression format which
will re-use part of the existing query name to de-duplicate later parts
of the same query name.
I know it's not the same, but it felt un-ignorably close in both purpose
and method.
--
Grant. . . .
unix || die
^ permalink raw reply [flat|nested] 9+ messages in thread
* [TUHS] Re: Was the compressed dictionary used?
2025-01-02 14:51 ` [TUHS] " Douglas McIlroy
2025-01-02 15:12 ` Warner Losh
@ 2025-01-02 16:20 ` arnold
1 sibling, 0 replies; 9+ messages in thread
From: arnold @ 2025-01-02 16:20 UTC (permalink / raw)
To: douglas.mcilroy, arnold; +Cc: tuhs
Douglas McIlroy <douglas.mcilroy@dartmouth.edu> wrote:
> My production spell aggresively stripped
> affixes and used hashing and other coding tricks to keep its
> "dictionary" in the limited memory of a PDP-11. (The whole story is
> told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully
> described by Jon Bentley in
> https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory
> became available, these heroics were replaced by basic common-prefix
> coding patterned after Morris and Thompson, just as Arnold surmised.
But all this would have been in the C code for spell, and not in
the dictionary used, right?
Thanks,
Arnold
P.S. A few years ago I made the v10 spell available for today's systems,
see https://github.com/arnoldrobbins/v10spell.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [TUHS] Re: Was the compressed dictionary used?
2025-01-02 15:12 ` Warner Losh
@ 2025-01-02 17:20 ` Douglas McIlroy
2025-01-02 21:19 ` Warner Losh
0 siblings, 1 reply; 9+ messages in thread
From: Douglas McIlroy @ 2025-01-02 17:20 UTC (permalink / raw)
To: Warner Losh; +Cc: The Eunuchs Hysterical Society
The word list of Webster's 2nd came from an Air Force project along
with several other files, including a medical dictionary and an
alphabetical list of tetragrams found in Web2--something one would
expect to create for oneself nowadays. The files were freely
distributed with no strings attached. We have not noticed any
mistakes. The list includes 76205 entries that contain blanks or
hyphens; these were omitted from the pinhead exercise.
Doug
On Thu, Jan 2, 2025 at 10:13 AM Warner Losh <imp@bsdimp.com> wrote:
>
>
>
> On Thu, Jan 2, 2025, 7:51 AM Douglas McIlroy <douglas.mcilroy@dartmouth.edu> wrote:
>>
>> I am not aware that the compressed dictionary was used for anything.
>> Steve Johnson's first shell-script spelling-checker did make a pass
>> over a dictionary, but not Webster's second, which would have caused
>> lots of false negatives because it contains so many exotic small words
>> that could result from typos.
>
>
> Where did the Websters Second file come from? Did the labs give the public domain paper dictionary to the equivalent of a typing pool and had them enter it? It did it come from elsewhere? Or something else? How was it checked for accuracy?
>
> Warner
>
>
>> My production spell aggresively stripped
>> affixes and used hashing and other coding tricks to keep its
>> "dictionary" in the limited memory of a PDP-11. (The whole story is
>> told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully
>> described by Jon Bentley in
>> https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory
>> became available, these heroics were replaced by basic common-prefix
>> coding patterned after Morris and Thompson, just as Arnold surmised.
>>
>> On Thu, Jan 2, 2025 at 7:41 AM <arnold@skeeve.com> wrote:
>> >
>> > Hi.
>> >
>> > The paper on compressing the dictionary was interesting. In the day
>> > of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is
>> > a big savings.
>> >
>> > Was the compressed dictionary put into use? I could imaging that
>> > spell(1) at least would have needed some library routines to return
>> > a stream of words from it.
>> >
>> > Just wondering. Thanks,
>> >
>> > Arnold
^ permalink raw reply [flat|nested] 9+ messages in thread
* [TUHS] Re: Was the compressed dictionary used?
2025-01-02 17:20 ` Douglas McIlroy
@ 2025-01-02 21:19 ` Warner Losh
2025-01-02 23:32 ` Douglas McIlroy
0 siblings, 1 reply; 9+ messages in thread
From: Warner Losh @ 2025-01-02 21:19 UTC (permalink / raw)
To: Douglas McIlroy; +Cc: The Eunuchs Hysterical Society
[-- Attachment #1: Type: text/plain, Size: 2865 bytes --]
The BSDs since 4.4lite have added a lot of missing words, but few
corrections. From FreeBSD:
Capitalized Transvaal, fixed 'stock certificate' to have a 't' and
preconsoidate -> preconsolidate
Ahtena, freen, unknowen and structurelessness were removed
corelate (etc) and freend were removed as typos and only thinly supported
variants.
Not bad for 50 years of nit-pickers pouring over the file.
Warner
On Thu, Jan 2, 2025 at 10:20 AM Douglas McIlroy <
douglas.mcilroy@dartmouth.edu> wrote:
> The word list of Webster's 2nd came from an Air Force project along
> with several other files, including a medical dictionary and an
> alphabetical list of tetragrams found in Web2--something one would
> expect to create for oneself nowadays. The files were freely
> distributed with no strings attached. We have not noticed any
> mistakes. The list includes 76205 entries that contain blanks or
> hyphens; these were omitted from the pinhead exercise.
>
> Doug
>
> On Thu, Jan 2, 2025 at 10:13 AM Warner Losh <imp@bsdimp.com> wrote:
> >
> >
> >
> > On Thu, Jan 2, 2025, 7:51 AM Douglas McIlroy <
> douglas.mcilroy@dartmouth.edu> wrote:
> >>
> >> I am not aware that the compressed dictionary was used for anything.
> >> Steve Johnson's first shell-script spelling-checker did make a pass
> >> over a dictionary, but not Webster's second, which would have caused
> >> lots of false negatives because it contains so many exotic small words
> >> that could result from typos.
> >
> >
> > Where did the Websters Second file come from? Did the labs give the
> public domain paper dictionary to the equivalent of a typing pool and had
> them enter it? It did it come from elsewhere? Or something else? How was it
> checked for accuracy?
> >
> > Warner
> >
> >
> >> My production spell aggresively stripped
> >> affixes and used hashing and other coding tricks to keep its
> >> "dictionary" in the limited memory of a PDP-11. (The whole story is
> >> told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully
> >> described by Jon Bentley in
> >> https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory
> >> became available, these heroics were replaced by basic common-prefix
> >> coding patterned after Morris and Thompson, just as Arnold surmised.
> >>
> >> On Thu, Jan 2, 2025 at 7:41 AM <arnold@skeeve.com> wrote:
> >> >
> >> > Hi.
> >> >
> >> > The paper on compressing the dictionary was interesting. In the day
> >> > of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is
> >> > a big savings.
> >> >
> >> > Was the compressed dictionary put into use? I could imaging that
> >> > spell(1) at least would have needed some library routines to return
> >> > a stream of words from it.
> >> >
> >> > Just wondering. Thanks,
> >> >
> >> > Arnold
>
[-- Attachment #2: Type: text/html, Size: 4088 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* [TUHS] Re: Was the compressed dictionary used?
2025-01-02 21:19 ` Warner Losh
@ 2025-01-02 23:32 ` Douglas McIlroy
0 siblings, 0 replies; 9+ messages in thread
From: Douglas McIlroy @ 2025-01-02 23:32 UTC (permalink / raw)
To: Warner Losh; +Cc: The Eunuchs Hysterical Society
Warner,
Thanks for those bugs. Here's a similar list for lucky owners of
Webster's 7th Collegiate:
dissymmettric
brecia
belicoseness
assaugement
A space is missing in the pronunciation field for Ouija.
There must be more bugs in other fields, which constitute the bulk of
the Web7 files.
Doug
On Thu, Jan 2, 2025 at 4:20 PM Warner Losh <imp@bsdimp.com> wrote:
>
> The BSDs since 4.4lite have added a lot of missing words, but few corrections. From FreeBSD:
>
> Capitalized Transvaal, fixed 'stock certificate' to have a 't' and preconsoidate -> preconsolidate
>
> Ahtena, freen, unknowen and structurelessness were removed
>
> corelate (etc) and freend were removed as typos and only thinly supported variants.
>
> Not bad for 50 years of nit-pickers pouring over the file.
>
> Warner
>
> On Thu, Jan 2, 2025 at 10:20 AM Douglas McIlroy <douglas.mcilroy@dartmouth.edu> wrote:
>>
>> The word list of Webster's 2nd came from an Air Force project along
>> with several other files, including a medical dictionary and an
>> alphabetical list of tetragrams found in Web2--something one would
>> expect to create for oneself nowadays. The files were freely
>> distributed with no strings attached. We have not noticed any
>> mistakes. The list includes 76205 entries that contain blanks or
>> hyphens; these were omitted from the pinhead exercise.
>>
>> Doug
>>
>> On Thu, Jan 2, 2025 at 10:13 AM Warner Losh <imp@bsdimp.com> wrote:
>> >
>> >
>> >
>> > On Thu, Jan 2, 2025, 7:51 AM Douglas McIlroy <douglas.mcilroy@dartmouth.edu> wrote:
>> >>
>> >> I am not aware that the compressed dictionary was used for anything.
>> >> Steve Johnson's first shell-script spelling-checker did make a pass
>> >> over a dictionary, but not Webster's second, which would have caused
>> >> lots of false negatives because it contains so many exotic small words
>> >> that could result from typos.
>> >
>> >
>> > Where did the Websters Second file come from? Did the labs give the public domain paper dictionary to the equivalent of a typing pool and had them enter it? It did it come from elsewhere? Or something else? How was it checked for accuracy?
>> >
>> > Warner
>> >
>> >
>> >> My production spell aggresively stripped
>> >> affixes and used hashing and other coding tricks to keep its
>> >> "dictionary" in the limited memory of a PDP-11. (The whole story is
>> >> told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully
>> >> described by Jon Bentley in
>> >> https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory
>> >> became available, these heroics were replaced by basic common-prefix
>> >> coding patterned after Morris and Thompson, just as Arnold surmised.
>> >>
>> >> On Thu, Jan 2, 2025 at 7:41 AM <arnold@skeeve.com> wrote:
>> >> >
>> >> > Hi.
>> >> >
>> >> > The paper on compressing the dictionary was interesting. In the day
>> >> > of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is
>> >> > a big savings.
>> >> >
>> >> > Was the compressed dictionary put into use? I could imaging that
>> >> > spell(1) at least would have needed some library routines to return
>> >> > a stream of words from it.
>> >> >
>> >> > Just wondering. Thanks,
>> >> >
>> >> > Arnold
^ permalink raw reply [flat|nested] 9+ messages in thread
* [TUHS] Re: Was the compressed dictionary used?
2025-01-02 15:13 ` Grant Taylor via TUHS
@ 2025-01-03 3:14 ` John Levine
0 siblings, 0 replies; 9+ messages in thread
From: John Levine @ 2025-01-03 3:14 UTC (permalink / raw)
To: tuhs; +Cc: gtaylor
It appears that Grant Taylor via TUHS <gtaylor@tnetconsulting.net> said:
>On 1/2/25 6:40 AM, arnold@skeeve.com wrote:
>> The paper on compressing the dictionary was interesting. In the day
>> of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is a
>> big savings.
>
>It's even more important when sending data across the wire.
>
>> Was the compressed dictionary put into use? I could imaging that
>> spell(1) at least would have needed some library routines to return
>> a stream of words from it.
>
>I couldn't help but think about the DNS on wire compression format which
>will re-use part of the existing query name to de-duplicate later parts
>of the same query name.
>
>I know it's not the same, but it felt un-ignorably close in both purpose
>and method.
Lempel and Ziv published the LZ77 paper in 1977 (hence the name) which uses
back pointers into a sliding window of text. Later tweaks brought us LZ78
and compress and gzip.
There's really only two ways to compress data: use a variable length coding scheme with
the shortest codes for the most common tokens, or a dictionary that uses pointers to
repeated strings. Huffman invented the former in 1951, Lempel and Ziv the latter in
1977, although as we've seen people did special purpose versions of the dictionary
approach like this one. Modern schemes use combinarions of both.
The DNS data formats were invented in about 1982 but I have no idea whether
Mockapetris was familar with LZ. I suppose I could ask him.
R's,
John
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-01-03 3:15 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-02 12:40 [TUHS] Was the compressed dictionary used? arnold
2025-01-02 14:51 ` [TUHS] " Douglas McIlroy
2025-01-02 15:12 ` Warner Losh
2025-01-02 17:20 ` Douglas McIlroy
2025-01-02 21:19 ` Warner Losh
2025-01-02 23:32 ` Douglas McIlroy
2025-01-02 16:20 ` arnold
2025-01-02 15:13 ` Grant Taylor via TUHS
2025-01-03 3:14 ` John Levine
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).