From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE,RDNS_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 Received: (qmail 6474 invoked from network); 10 Mar 2020 17:40:09 -0000 Received-SPF: pass (minnie.tuhs.org: domain of minnie.tuhs.org designates 45.79.103.53 as permitted sender) receiver=inbox.vuxu.org; client-ip=45.79.103.53 envelope-from= Received: from unknown (HELO minnie.tuhs.org) (45.79.103.53) by inbox.vuxu.org with ESMTP; 10 Mar 2020 17:40:09 -0000 Received: by minnie.tuhs.org (Postfix, from userid 112) id 31BC79BC05; Wed, 11 Mar 2020 03:40:07 +1000 (AEST) Received: from minnie.tuhs.org (localhost [127.0.0.1]) by minnie.tuhs.org (Postfix) with ESMTP id 7D8409BB47; Wed, 11 Mar 2020 03:39:04 +1000 (AEST) Authentication-Results: minnie.tuhs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="Uq1i9Rfe"; dkim-atps=neutral Received: by minnie.tuhs.org (Postfix, from userid 112) id F0E839BB47; Wed, 11 Mar 2020 03:39:01 +1000 (AEST) Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179]) by minnie.tuhs.org (Postfix) with ESMTPS id D868B9BB46 for ; Wed, 11 Mar 2020 03:39:00 +1000 (AEST) Received: by mail-qk1-f179.google.com with SMTP id e11so13509791qkg.9 for ; Tue, 10 Mar 2020 10:39:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Ka9802jT5RbLVbnO/URsLeBIKgdxUzq06/ON+mX1sjM=; b=Uq1i9Rfemzx5nKP0TDaAIWj4+RpcsCOiVLqH3dkv7WVEcJXcEen5cQ3OYePpY5HDAV y+fun9nISgNncUtVewW6a1esqjTHrFsInlT/iRVzcZBbBE65t7uiXnmmpWO+6HtRu3Xg zFmtlA06+LGUjSHm2GdBXNCY+gh5Rdx6e0btFJcDreO4fjxLte0YDnfcV9SuYosHsG5E HqE8XZ3DY8IAQYRAeauOHZ1ojOvXwffUouYXys7lAzcKfWHrMgdtjP75wg05UnXXQxu6 w+eYLl5KaKirvHsn1mvR8qTCWRpRII+dNvAiStr+VSKaLPb6rRQ2nOCQCD9qemU4AoJ4 3vrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Ka9802jT5RbLVbnO/URsLeBIKgdxUzq06/ON+mX1sjM=; b=hcLWsIyouMkylomrXLcsTkwduhvZH1npSx5z/oTCIAqjHT5ihWFlrrlJeCTuHjkBFa 0fg0XzO+fsrbEkozJbCK9vegPqnOE5mhLPyftg7q0fsviRUAEp5Q3PQEWypQCHvYDHN2 gcf0bcAG9kn0lU8UH54MlO86jfnmLqAaeZpC9c0wXTJJnLRKYUlU1tlq71M9OjO1FJY9 qsmq8YqfVMp8dGi6T2qZW/hpFiKZNZyC6nKBGPwjlGoCZ5UrjX2kXRWuqzg3hNrnLMuJ Xw/4VLYZVDRIKyxK23aOoYZg2te0TWutuizjlo/v8gj3AIwOClyUGufUBbGHF9J9Zj1I QKrA== X-Gm-Message-State: ANhLgQ1ZfZRf0jIiIjlvQHUT+WBroxBzwhB0UdKCObjk6yz8kJ/fD1lQ JJpTtHOSJtHQlfBtGrysoWksI5Y7rrWw//h4VF6eKQ== X-Google-Smtp-Source: ADFU+vsyr6z/6vwfV0Segskbo/WPR142ckE12y2FQlP3N4NxjPr3QB36Yc1iyPZCWiIOJViNSfywpEBPYY9QWkxOB9E= X-Received: by 2002:ae9:c001:: with SMTP id u1mr18654628qkk.346.1583861939980; Tue, 10 Mar 2020 10:38:59 -0700 (PDT) MIME-Version: 1.0 References: <202003101615.02AGFLgS075920@coolidge.cs.dartmouth.edu> In-Reply-To: <202003101615.02AGFLgS075920@coolidge.cs.dartmouth.edu> From: Dan Cross Date: Tue, 10 Mar 2020 13:38:23 -0400 Message-ID: To: Doug McIlroy Content-Type: multipart/alternative; boundary="0000000000004e4ec605a08398f3" Subject: Re: [TUHS] Command line options and complexity X-BeenThere: tuhs@minnie.tuhs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: The Unix Heritage Society mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: The Eunuchs Hysterical Society Errors-To: tuhs-bounces@minnie.tuhs.org Sender: "TUHS" --0000000000004e4ec605a08398f3 Content-Type: text/plain; charset="UTF-8" On Tue, Mar 10, 2020 at 12:16 PM Doug McIlroy wrote: > > The idea of a simple rule is great, but the suggested rule fails on sort > -u > > which afaik came after sort | uniq for performance reasons. > > As the guilty party for most of sort's comparison options, I can > attest that efficiency was not an objective of -u. It was invented > precisely because uniq had proved useful, but not when one was > interested in uniqueness only of some key aspect of the data. > > -u differs from uniq in that -u selects samples based on > equality of keys, not equality of lines. In the default > case of whole-line keys, sort -u of course does exactly > what sort|uniq does. > > For many applications of -u with keys, the non-key fields > are not of interest. Then sed s/nonkeys//|sort|uniq may > suffice. But sed did not exist when -u was invented. > And not all sort key specs are easily imitated in sed. > This begs questions of stability: in the event of non-unique keys and non-key fields in the sortable data, which "records" (lines) are kept and which are discarded? Surely the "first" is kept and subsequent entries with the same key suppressed, but I confess I don't know enough about the internals of sed to know even what algorithm it uses (I assume a disk-based merge sort?), but I would imagine these details have changed over time. - Dan C. --0000000000004e4ec605a08398f3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Tue, Mar 10, 2020 at 12:16 PM Doug McI= lroy <doug@cs.dartmouth.edu= > wrote:
> The idea of a simple rule is great, but the sug= gested rule fails on sort -u
> which afaik came after sort | uniq for performance reasons.

As the guilty party for most of sort's comparison options, I can
attest that efficiency was not an objective of -u. It was invented
precisely because uniq had proved useful, but not when one was
interested in uniqueness only of some key aspect of the data.

-u differs from uniq in that -u selects samples based on
equality of keys, not equality of lines. In the default
case of whole-line keys, sort -u of course does exactly
what sort|uniq does.

For many applications of -u with keys, the non-key fields
are not of interest. Then sed s/nonkeys//|sort|uniq may
suffice. But sed did not exist when -u was invented.
And not all sort key specs are easily imitated in sed.

This begs questions of stability: in the event of non-uniqu= e keys and non-key fields in the sortable data, which "records" (= lines) are kept and which are discarded? Surely the "first" is ke= pt and subsequent entries with the same key suppressed, but I confess I don= 't know enough about the internals of sed to know even what algorithm i= t uses (I assume a disk-based merge sort?), but I would imagine these detai= ls have changed over time.

=C2=A0 =C2=A0 =C2=A0 = =C2=A0 - Dan C.

--0000000000004e4ec605a08398f3--