From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tuhs-bounces@minnie.tuhs.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE
	autolearn=ham autolearn_force=no version=3.4.2
Received: from minnie.tuhs.org (minnie.tuhs.org [45.79.103.53])
	by inbox.vuxu.org (OpenSMTPD) with ESMTP id e1dadebc
	for <ml@inbox.vuxu.org>;
	Fri, 7 Dec 2018 08:32:56 +0000 (UTC)
Received: by minnie.tuhs.org (Postfix, from userid 112)
	id 09A27A24F4; Fri,  7 Dec 2018 18:32:55 +1000 (AEST)
Received: from minnie.tuhs.org (localhost [127.0.0.1])
	by minnie.tuhs.org (Postfix) with ESMTP id A15CFA1F03;
	Fri,  7 Dec 2018 18:32:14 +1000 (AEST)
Received: by minnie.tuhs.org (Postfix, from userid 112)
 id BB086A1F03; Fri,  7 Dec 2018 18:32:06 +1000 (AEST)
Received: from mail-ot1-f65.google.com (mail-ot1-f65.google.com
 [209.85.210.65])
 by minnie.tuhs.org (Postfix) with ESMTPS id 2DA1D93D07
 for <tuhs@tuhs.org>; Fri,  7 Dec 2018 18:32:01 +1000 (AEST)
Received: by mail-ot1-f65.google.com with SMTP id s5so3069609oth.7
 for <tuhs@tuhs.org>; Fri, 07 Dec 2018 00:32:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=date:from:to:subject:message-id:references:mime-version
 :content-disposition:in-reply-to:user-agent;
 bh=Dfz/ubImEDp1sYCSXH7UTzPDBEamJn9an0cUygx5AAs=;
 b=fK0YJd9r3hz9WyEzJq62qMjmZEbQWSvwyUk08vR5dzLmVLIdoUqMpM9cY1ECFbFHKr
 8txY3VQiSe2FZe8wMdpmV+zcReARW9bQaQrK+RxTwfr5yt7/LWOSssC6FJJHVEFS7cXe
 zMM0595/kcl/oqE1JQ0JLkarJLOfkfmgOLCj4UGwEe/0RXkIEjRBj8RocvLKohSaedZ/
 az1tuWSrIfUXYwLd8Pm73FTFjBP/85lTQnqFoIeGouUopd7aFA8rBtYtdtdK2wyKHT7O
 cJLHoqDM9lctqVd12ciVgS2Ne0JR5ZgOCLeRdIaZeyC/S35sfcf/KMGSWL1tPsFyo/Pf
 lpsQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:subject:message-id:references
 :mime-version:content-disposition:in-reply-to:user-agent;
 bh=Dfz/ubImEDp1sYCSXH7UTzPDBEamJn9an0cUygx5AAs=;
 b=SzzolI3P052e0BwhvytJTrU8/6RJ1AhQfFJuQUt/PhBhMG2IeYdNIERIg3/F2hTeZ1
 PrdiF+tenpum8DdF/pbIcp9mcQaQ5EVdygDEh1zuojPs3luGk67r/AMgkYSXw3u8c/e8
 OQ6yhNROWdwYz9sjCxcG3XSbqQeedjfJ6WPjbebl/lHIwEdtE3ZwIkC654bhaT3Z7cQG
 Ckf7M8qf1keW2DzyzigVcQgsvgkZSeOwa5I2U3mBrIlqmhnh9F1VK+kEu1WV5OUW6WeL
 CWCMIP70/NU2e5iOfhDRZU2IZVUaibptRuHA4NnMvDHZiezGy65r87cldg1ECqOzjPCd
 uluQ==
X-Gm-Message-State: AA+aEWYP2CturdaFyAw/9G4MxC7zin4E32OEENSsmQiUKv5al3J2vt8N
 8thKfNArFbmVjA+5HvRxIxEh4nvpcx4=
X-Google-Smtp-Source: AFSGD/U06UQ9s+1+tcDEaNOGq8+/7o3tjMdQV+ENsnvrvZ0NqAyiX6cAbWY8leJg5i8anSuceKYy4Q==
X-Received: by 2002:a9d:728a:: with SMTP id t10mr801869otj.216.1544171519817; 
 Fri, 07 Dec 2018 00:31:59 -0800 (PST)
Received: from crack.deadbeast.net (rrcs-24-171-184-100.midsouth.biz.rr.com.
 [24.171.184.100])
 by smtp.gmail.com with ESMTPSA id v68sm1117761oie.16.2018.12.07.00.31.58
 for <tuhs@tuhs.org>
 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
 Fri, 07 Dec 2018 00:31:59 -0800 (PST)
Date: Fri, 7 Dec 2018 03:31:56 -0500
From: "G. Branden Robinson" <g.branden.robinson@gmail.com>
To: tuhs@tuhs.org
Message-ID: <20181207083154.ucpd6qim3ghclfhn@crack.deadbeast.net>
References: <1C1BB8D2-9535-4A6D-96E8-3792F14BCBAE@163.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature"; boundary="zrx5ihnykqhcfrlz"
Content-Disposition: inline
In-Reply-To: <1C1BB8D2-9535-4A6D-96E8-3792F14BCBAE@163.com>
User-Agent: NeoMutt/20180716
Subject: Re: [TUHS] c's comment
X-BeenThere: tuhs@minnie.tuhs.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: The Unix Heritage Society mailing list <tuhs.minnie.tuhs.org>
List-Unsubscribe: <https://minnie.tuhs.org/cgi-bin/mailman/options/tuhs>,
 <mailto:tuhs-request@minnie.tuhs.org?subject=unsubscribe>
List-Archive: <http://minnie.tuhs.org/pipermail/tuhs/>
List-Post: <mailto:tuhs@minnie.tuhs.org>
List-Help: <mailto:tuhs-request@minnie.tuhs.org?subject=help>
List-Subscribe: <https://minnie.tuhs.org/cgi-bin/mailman/listinfo/tuhs>,
 <mailto:tuhs-request@minnie.tuhs.org?subject=subscribe>
Errors-To: tuhs-bounces@minnie.tuhs.org
Sender: "TUHS" <tuhs-bounces@minnie.tuhs.org>


--zrx5ihnykqhcfrlz
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

At 2018-12-07T04:27:41+0000, Caipenghui wrote:
> Why can't c language comments be nested? What is the historical reason
> for language design?Feeling awkward.

I'm a callow youth compared to many on this list but I'll give it a try.

My understanding is that it's not so much a historical reason[1] as a
design choice motivated by ease of lexical analysis.

As you may be aware, interpretation and compilation of programming
languages often split into two parts: lexical analysis and semantic
parsing.

For instance, in

int a = 1;

A lexical analyzer breaks this input into several "tokens":
* A type declarator;
* A variable name;
* An assignment operator;
* A numerical literal (constant);
* A statement separator;
* A newline.

Because we'll need it in a moment, here's another example:

char s[] = "foobar"; /* initial value */

The tokens here are:
* A type declarator;
* A variable name;
* An array indicator, which is really part of the type declarator
  (nobody said C was easy);
* An assignment operator;
* A string literal;
* A statement separator;
* A comment;
* A newline.

The lexical analyzer ("lexer") categorizes the tokens and hands them to
a semantic parser to give them meaning and build a "machine" that will
execute the code.  (That "machine" is then translated into instructions
that will run on a general-purpose computer, either in silicon or
software, as we see in virtual machines like Java's.)

There is such a thing as "lexerless parsing" which combines these two
steps into one, but tokenization vs. semantic analysis remains a useful
way to learn about how programs actually become things that execute.

To answer your question, it is often desirable, because it can keep
matters simple and comprehensible, to have a programming language that
is easy to tokenize.  Regular expressions are a popular means of doing
tokenization, and the classic Unix tool "lex" has convenient support for
this.  (Its classic counterpart, a parser generator, is known as
"yacc".)

If you have experience with regular expressions you may realize that
there are things that it is hard (or impossible[2]) for them to do.

In classic C, there is only one type of comment.  It starts with '/*'
not inside a string literal, and continues until a '*/' is seen.

It is a simple rule.  If you wanted to nest comments, then the lexer
would have to keep track of state--specifically, how many '/*'s it had
seen, and pop one and only one of them for each '*/' it encounters.

Furthermore, you have another design decision to make; should '*/' count
to close a comment if it occurs inside a string literal inside a
comment?  People might comment out code containing string literals,
after all, and then you have to worry about what those string literals
might contain.

Not only is it easier on a programmer's suffering brain to keep a
programming language lexically simple--see the recent thread on the
nightmare that is Fortran lexing, for instance--but it also affords
easier opportunities for things that are not language implementations to
lexically analyze your language.

A tremendously successful example of this is "syntax" highlighting in
text editors and IDE editor windows, which mark up your code with pretty
colors to help you understand what you are doing.

At this point you may see, incidentally, why it is more correct to call
"syntax highlighting" lexical highlighting instead.

A well-trained lexical analyzer can correctly tokenize and highlight:

int a = 42;
int a = "foobar";

But a syntax parser knows that a string literal cannot be assigned to a
variable of integral type--that's a syntax error.  It might be nice if
our text editors would catch this kind of mistake, too, and for all I
know Eclipse or NetBeans does.  But doing so adds significantly more
machinery to the development environment.  In my experience, lexical
highlighting largely forecloses major categories of fumble-fingers or
rookie mistakes that used to linger until a compilation was attempted.
Back in the old days (1993!) a freshman programmer on SunOS 4 would be
subjected to a truly incomprehensible chain of compiler errors that
arose from a single lexical mistake like a missing semicolon.  With the
arms race of helpful compiler diagnostics currently going between LLVM
and GCC, and with our newfangled text editors doing lexical analysis and
spraying terminal windows with avant-garde SGR escapes making things
colorful, the learning process for C seems less savage than it used to
be.

If you'd like to learn more about lexing and parsing from a practical
perspective, with the fun of implementing your own C-like language
step-by-step which you can then customize to your heart's content, I
recommend chapter 8 of:

	_The Unix Programming Environment_, by Kernighan and Pike,
	Prentice Hall 1984.

I have to qualify that recommendation a bit because you will have to do
some work to port traditional K&R C to ANSI C, and point out that these
days people use flex and bison (or flex and byacc) instead of lex and
yacc, but if you're a moderately seasoned C programmer who hasn't
checked off the "written a compiler" box, K&P hold one's hand pretty
well through the process.  It's how I got my feet wet, it taught me a
lot, and was less intimidating than Aho, Sethi, and Ullman.

I will venture that programming languages that are simple to parse tend
to be easier to learn and retain, and promote more uniformity in
presentation.  In spite of the feats on display in the IOCCC, and
interminable wars over brace style and whitespace, we see less variation
in source code layout in lexically simple languages than we
(historically) did in Fortran.  As much as I would love to have another
example of a hard-to-lex language, I don't know one.  As others pointed
out here, Backus knew the revolution when he saw it, and swiftly chose
the winning side.

I welcome correction on any of the above points by the sages on this
list.

Regards,
Branden

[1] A historical choice would be the BCPL comment style of '//',
reintroduced in C++ and eventually admitted into C with the C99
standard.  An ahistorical choice would have been using '@@' for this
purpose, for instance.

[2] The identity between the CS regular languages and what can be
recognized by "regular expression" implementations was broken long ago,
and I am loath to make claims about what something like perlre can't do.

--zrx5ihnykqhcfrlz
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEh3PWHWjjDgcrENwa0Z6cfXEmbc4FAlwKL/IACgkQ0Z6cfXEm
bc6AWBAAmoSjN+ALTDldJejLAbpLrhtmbO6aZq/nldUgLiSHL10LT82EMrQbXCJD
haSyniqIoNGxWkMR0TzXFhNfPHRxmeUFVpY228vZDVxUruDWrzLTD/Jx2XKat045
9T/Omj0W448NVzZ0nPRXQujfe9ImlL7PJ3wwtBRSG8kzYUKL0TvIfg5eqp46DkT5
pqv0KgloN/r72crI3KQY+5LLWMfqXWzgr42WXSn9GrSJmT60tK8PFc5KK+x7YN9e
AJ4lxDC+shvWwAKBJRPGoXF9yVw5YNEFJI4fLYIaF+jLvqhX8BnylVPEEbZAlkqj
PmYjp13pMVVMnUsTau7pCuGS/veNnxNLwubHVDRxXasBBu9m87nZ8a3Q9efSDdbr
svaWupJ3O1QF64QBeiGj+oVxYY/ayK27kLEK2txq/lyUC+OTusXlA7m8dVWFmrIe
DQjilzuNQStyYDVTf2LIirqtiXfHQr1+vSuecDP8oJwYhUThwpiTCQJ5YWeFJLRN
r8Iok30454Wn6bi/tWAo+2D0Mi6l53BuzJbIYAODCXgj6B6chidBKeH12SHFDHuV
RbOujtc9iRZIRjy/c8Xs4sDRQjV83Ke0y5WFL5Xufc5nEWrmF3gKUZ3YXBn2wu3v
NHKUczwRgeZhaJHsTt1mBKQM2TWDRSBKhVnOvjmmd1cjWbKN8IE=
=5UtV
-----END PGP SIGNATURE-----

--zrx5ihnykqhcfrlz--