On Tue, Mar 10, 2020 at 12:16 PM Doug McIlroy <doug@cs.dartmouth.edu> wrote:
> The idea of a simple rule is great, but the suggested rule fails on sort -u
> which afaik came after sort | uniq for performance reasons.

As the guilty party for most of sort's comparison options, I can
attest that efficiency was not an objective of -u. It was invented
precisely because uniq had proved useful, but not when one was
interested in uniqueness only of some key aspect of the data.

-u differs from uniq in that -u selects samples based on
equality of keys, not equality of lines. In the default
case of whole-line keys, sort -u of course does exactly
what sort|uniq does.

For many applications of -u with keys, the non-key fields
are not of interest. Then sed s/nonkeys//|sort|uniq may
suffice. But sed did not exist when -u was invented.
And not all sort key specs are easily imitated in sed.

This begs questions of stability: in the event of non-unique keys and non-key fields in the sortable data, which "records" (lines) are kept and which are discarded? Surely the "first" is kept and subsequent entries with the same key suppressed, but I confess I don't know enough about the internals of sed to know even what algorithm it uses (I assume a disk-based merge sort?), but I would imagine these details have changed over time.

        - Dan C.