The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: imp@bsdimp.com (Warner Losh)
Subject: [TUHS] Discuss of style and design of computer programs from a user stand point
Date: Sun, 7 May 2017 09:13:11 -0600	[thread overview]
Message-ID: <CANCZdfonzdbH4dX_HKbtWpGaJtC=-DxZXD3VbFFtKxNNuEZHgA@mail.gmail.com> (raw)
In-Reply-To: <CAGfO01x_pDxSTtyKFgAu9FYAJBQDtqgdSRmkXAO4TO==73GmYw@mail.gmail.com>

On Sat, May 6, 2017 at 7:42 PM, Noel Hunt <noel.hunt at gmail.com> wrote:
> I was about to suggest using the Plan9 port utilities of the
> same name but it seems 'uniq' is not coded to handle Runes
> (aka utf-8). I don't imagine it would be hard to re-write it to
> handle utf-8.

I guess I should have been clearer on what wouldn't work. It can't
possibly work for Japanese and Chinese where words aren't separated by
whitespace. Would cause problems in hybrid languages where words can
be composed of logograms and sonograms (say Japanese which often use a
few Kanji with hiragana endings that then run into hiragana particles
or other grammar elements). Can't work without modification (using
class names) for Cyrillic because there's no A or Z in words there.
Won't work in any language that has a discontiguous set of letters,
which includes many western european languages since all the accented
or otherwise decorated letters aren't in the range A-Z.

So whether or not the underlying tools can handle UTF-8 encoding,
there are problems with the original.

If you used:

 tr -cs "[:alpha:]" '\n' | tr "[:upper:]" "[:lower:]" | sort | uniq -c
| sort -rn | sed ${1}q

you'd still have issues with languages that don't use word separators,
or write non-alphabetically.

Warner

> On Sun, May 7, 2017 at 11:15 AM, Warner Losh <imp at bsdimp.com> wrote:
>>
>> On Sat, May 6, 2017 at 1:50 PM, Bakul Shah <bakul at bitblocks.com> wrote:
>> > tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
>>
>> The cool thing about this thread is that I learned two things: what tr
>> -s does, and the Nq does for sed...
>>
>> Sadly, this doesn't work so well for text that isn't ASCII-7 english...
>>
>> Warner
>
>


  parent reply	other threads:[~2017-05-07 15:13 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-05 15:20 [TUHS] Discuss of style and design of computer programs from a user stand point [was dmr note on BSD's sins] Clem Cole
2017-05-05 15:37 ` Bakul Shah
2017-05-06  2:16   ` Noel Hunt
2017-05-06  2:40     ` Toby Thain
2017-05-06  6:07     ` Bakul Shah
2017-05-06 22:11       ` Steve Johnson
2017-05-06 23:35         ` Larry McVoy
2017-05-07  4:06       ` Dan Cross
2017-05-07 13:49         ` [TUHS] Discuss of style and design of computer programs from a user stand point Michael Kjörling
2017-05-06  2:02 ` [TUHS] Discuss of style and design of computer programs from a user stand point [was dmr note on BSD's sins] Doug McIlroy
2017-05-06  5:33   ` Steve Johnson
2017-05-06  9:18     ` [TUHS] Discuss of style and design of computer programs from a user stand point Michael Kjörling
2017-05-06 13:09       ` Nemo
2017-05-06 13:44         ` Michael Kjörling
2017-05-06 14:40       ` Larry McVoy
2017-05-06 15:09         ` [TUHS] Discuss of style and design of computer programs from a Corey Lindsly
2017-05-06 15:20           ` Michael Kjörling
2017-05-06 15:24             ` Larry McVoy
2017-05-06 15:51               ` Michael Kjörling
2017-05-06 15:53                 ` Larry McVoy
2017-05-06 20:00             ` Steve Nickolas
2017-05-06 21:45               ` Michael Kjörling
2017-05-07  7:42                 ` Stephen Kitt
2017-05-06 15:23           ` ron minnich
2017-05-06 15:44             ` Michael Kjörling
2017-05-06 18:43         ` [TUHS] Discuss of style and design of computer programs from a user stand point Dave Horsfall
2017-05-06 19:50           ` Bakul Shah
2017-05-07  1:15             ` Warner Losh
2017-05-07  1:42               ` Noel Hunt
2017-05-07 13:54                 ` Michael Kjörling
2017-05-07 14:58                   ` arnold
2017-05-07 16:33                     ` Michael Kjörling
2017-05-07 15:13                 ` Warner Losh [this message]
2017-05-06 16:40       ` Kurt H Maier
2017-05-06 14:16     ` [TUHS] The Elements of Programming Style (book) - was Re: Discuss of style and design of computer programs Toby Thain
     [not found] <mailman.821.1494062349.3779.tuhs@minnie.tuhs.org>
2017-05-06 17:52 ` [TUHS] Discuss of style and design of computer programs from a user stand point David

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANCZdfonzdbH4dX_HKbtWpGaJtC=-DxZXD3VbFFtKxNNuEZHgA@mail.gmail.com' \
    --to=imp@bsdimp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).