From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE,RDNS_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 Received: (qmail 21420 invoked from network); 10 Mar 2020 19:39:50 -0000 Received-SPF: pass (minnie.tuhs.org: domain of minnie.tuhs.org designates 45.79.103.53 as permitted sender) receiver=inbox.vuxu.org; client-ip=45.79.103.53 envelope-from= Received: from unknown (HELO minnie.tuhs.org) (45.79.103.53) by inbox.vuxu.org with ESMTP; 10 Mar 2020 19:39:50 -0000 Received: by minnie.tuhs.org (Postfix, from userid 112) id 08CCC9BB78; Wed, 11 Mar 2020 05:39:46 +1000 (AEST) Received: from minnie.tuhs.org (localhost [127.0.0.1]) by minnie.tuhs.org (Postfix) with ESMTP id D43FB9BB47; Wed, 11 Mar 2020 05:39:29 +1000 (AEST) Authentication-Results: minnie.tuhs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="T4UC2YX5"; dkim-atps=neutral Received: by minnie.tuhs.org (Postfix, from userid 112) id E4E2D9BB47; Wed, 11 Mar 2020 05:39:27 +1000 (AEST) Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by minnie.tuhs.org (Postfix) with ESMTPS id 2FBF29BB46 for ; Wed, 11 Mar 2020 05:39:27 +1000 (AEST) Received: by mail-qt1-f174.google.com with SMTP id e13so10594293qts.6 for ; Tue, 10 Mar 2020 12:39:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JbeDgwmUx3Uu4vtsDf7/L19mJlL85qYLOjKHGgFwpz0=; b=T4UC2YX5ecunDaUQ6JT6wzI1oVuJ4ZAv+uxxc4QoQ5gU31AlCf+UDkDBt8JBhbKN44 90cKxGosgDXYYPocBrs1Dep/lkS/Lhg6VZNvXDBytpE0RADA246Qs4ExAOPPsIuQMXVC 1TCaG0BkB9O+omO2ZanIc/5TtVwOod2hj9gzz/EUF5MXACtGr9up2tcGJJU0Z4CyxuTA vOVCq8lKfVtTwRJ1Ly2WH6RSij1DRXeulF7N/dsMZgWANx/F9d/tQwWdSHacIOUxbFLX gCwpq9gnAu/jwlYlbRCA7EjfJGR5oF9IllaAfFEGjG7EXCTykZOh2NX32O2R/sR98fmW /zDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JbeDgwmUx3Uu4vtsDf7/L19mJlL85qYLOjKHGgFwpz0=; b=C2wGUMOv5vmqiOm2QG8q4xo+QuhJ6Z1ZxyRm4Z5Bk8rhF8HDRfIY8b5HVQLitLoAQ8 inqTHfTvEAbB2P++U//u9/Ss6pxHdsbQv9EnW7WXKQeBW6ubatRzTLlN0iEjMkDRC3Kv Ptq9+yGlTNua4RXsfJ4+vWabNly1Yi8bA4pyztpzQj7RbH0ohizcsEBipXbFJ4g1nzxO IAYK96Hvk+f9BKLEoPBBQcvhMVotkyT1eraesZARIiQSYqEOESJt3I6yyIgaF4YLHyAZ MF7ujQH0USkVMqPyZhCBM+swK+AnfKSa/F5hToNt4KrBw7hYHoz3CjWz4M8NBNDYsZV+ cSuw== X-Gm-Message-State: ANhLgQ07l3/EewlJIPu0F6nPibLyAeCmZ0q8NJrOTLFU+XlklLXAJUAj RSiU80DqP8HLR5P08g29FTOlV1WsSFXVlhzlVvqjtA== X-Google-Smtp-Source: ADFU+vuCSDOabeC2aL+hgqdvh0ug8KrNWYmUxrDIrszk09Bpn+3MIhGLUKxKLwdnfD6oT7z9y+TG1GqUP9YomgkJHe0= X-Received: by 2002:ac8:1b34:: with SMTP id y49mr3494108qtj.91.1583869166283; Tue, 10 Mar 2020 12:39:26 -0700 (PDT) MIME-Version: 1.0 References: <202003101842.02AIgr2x082209@coolidge.cs.dartmouth.edu> In-Reply-To: <202003101842.02AIgr2x082209@coolidge.cs.dartmouth.edu> From: Dan Cross Date: Tue, 10 Mar 2020 15:38:50 -0400 Message-ID: To: Doug McIlroy Content-Type: multipart/alternative; boundary="00000000000006eea305a0854749" Subject: Re: [TUHS] Command line options and complexity X-BeenThere: tuhs@minnie.tuhs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: The Unix Heritage Society mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: The Eunuchs Hysterical Society Errors-To: tuhs-bounces@minnie.tuhs.org Sender: "TUHS" --00000000000006eea305a0854749 Content-Type: text/plain; charset="UTF-8" On Tue, Mar 10, 2020 at 2:43 PM Doug McIlroy wrote: > > This begs questions of stability > > Astute question. I had that in my original draft, but eliminited > it for what I thought was clarity. Anyway, depending on implementation > of sort, you may need sort -s. Of course it doesn't matter which copy > among several equal lines uniq produces, nor does it matter in sort > when there are no comparison options--they're all the same. > Thanks. That's interesting. Did `sort -s` come later? The idea that you preferred clarity over stability for `sort -u` would indicate so, otherwise one might imagine that `-u` would just imply `-s` and that would be that. > I don't know enough about the > > internals of sed to know even what algorithm it uses > > (... a disk-based merge sort?) > > sed is not a sorting program--basically it copies input to > output, making line-by-line editing changes. That's the > way I meant to use it in sed s/nonkeys//|sort -keys|uniq. > (I have added options to sort, hopefully for clarity). > The argument to sed here means substitute the empty > string for the nonkey fields (specified by a regular expression). > `sed` in my email was a typo, as you speculated below. Interestingly, this `sed` construction prior to `sort` loses information, which perhaps doesn't matter in any given specific case, but is insufficient in general, which I gathered to be the entire reason you implemented `sort -u`. If "sed" was a typo for "sort", It was. all versions of sort that > I know of use an internal sorting algorithm for big chunks > of the file, then combines the chunks by merge. But internal > sorting varies all over the map--variations on quicksort, > radix sort, merge sort, ... > It's the details of the internal sorts that are most interesting in some sense, as the merges are probably fairly straight forward but the internal sorts will affect stability and have other interesting characteristics. As an aside, one must imagine that, in this day and age, a "big chunk" is probably big enough to hold the vast majority of files entirely in RAM, and only exceptionally large files actually require merging multiple blocks. - Dan C. --00000000000006eea305a0854749 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Tue, Mar 10, 2020 at 2:43 PM Doug McIl= roy <doug@cs.dartmouth.edu&= gt; wrote:
> This begs questions of stability

Astute question. I had that in my original draft, but eliminited
it for what I thought was clarity. Anyway, depending on implementation
of sort, you may need sort -s. Of course it doesn't matter which copy among several equal lines uniq produces, nor does it matter in sort
when there are no comparison options--they're all the same.

Thanks. That's interesting.

Did `sort -s` come later? The idea that you preferred clarity over s= tability for `sort -u` would indicate so, otherwise=C2=A0one might imagine = that `-u` would just imply `-s` and that would be that.

> I don't know enough about the
> internals of sed to know even what algorithm it uses
> (... a disk-based merge sort?)

sed is not a sorting program--basically it copies input to=C2=A0 =C2=A0 =C2= =A0
output, making line-by-line editing changes. That's the=C2=A0 =C2=A0 = =C2=A0 =C2=A0
way I meant to use it in sed s/nonkeys//|sort -keys|uniq.
(I have added options to sort, hopefully for clarity).
The argument to sed here means substitute the empty
string for the nonkey fields (specified by a regular expression).

`sed` in my email was a typo, as you speculated = below.

Interestingly, this `sed` construction prio= r to `sort` loses information, which perhaps doesn't matter in any give= n specific case, but is insufficient in general, which I gathered to be the= entire reason you implemented `sort -u`.

If "sed" was a typo for "sort",

It was.

all versions of sort that
I know of use an internal sorting algorithm for big chunks
of the file, then combines the chunks by merge. But internal
sorting varies all over the map--variations on quicksort,
radix sort, merge sort, ...

It's th= e details of the internal sorts that are most interesting in some sense, as= the merges are probably fairly straight forward but the internal sorts wil= l affect stability and have other interesting characteristics.
As an aside, one must imagine that, in this day and age, a &qu= ot;big chunk" is probably big enough to hold the vast majority of file= s entirely in RAM, and only exceptionally large files actually require merg= ing multiple blocks.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 -= Dan C.

--00000000000006eea305a0854749--