On Tue, Mar 10, 2020 at 12:16 PM Doug McIlroy wrote: > > The idea of a simple rule is great, but the suggested rule fails on sort > -u > > which afaik came after sort | uniq for performance reasons. > > As the guilty party for most of sort's comparison options, I can > attest that efficiency was not an objective of -u. It was invented > precisely because uniq had proved useful, but not when one was > interested in uniqueness only of some key aspect of the data. > > -u differs from uniq in that -u selects samples based on > equality of keys, not equality of lines. In the default > case of whole-line keys, sort -u of course does exactly > what sort|uniq does. > > For many applications of -u with keys, the non-key fields > are not of interest. Then sed s/nonkeys//|sort|uniq may > suffice. But sed did not exist when -u was invented. > And not all sort key specs are easily imitated in sed. > This begs questions of stability: in the event of non-unique keys and non-key fields in the sortable data, which "records" (lines) are kept and which are discarded? Surely the "first" is kept and subsequent entries with the same key suppressed, but I confess I don't know enough about the internals of sed to know even what algorithm it uses (I assume a disk-based merge sort?), but I would imagine these details have changed over time. - Dan C.