Re: [TUHS] c's comment

The Unix Heritage Society mailing list
 help / color / mirror / Atom feed

From: Caipenghui <Caipenghui_c@163.com>
To: tuhs@minnie.tuhs.org,
	"G. Branden Robinson" <g.branden.robinson@gmail.com>,
	tuhs@tuhs.org
Subject: Re: [TUHS] c's comment
Date: Fri, 07 Dec 2018 09:53:14 +0000	[thread overview]
Message-ID: <37B2A52F-43FC-4B2B-970F-170318D89135@163.com> (raw)
In-Reply-To: <20181207083154.ucpd6qim3ghclfhn@crack.deadbeast.net>

于 December 7, 2018 8:31:56 AM UTC, "G. Branden Robinson" <g.branden.robinson@gmail.com> 写到:
> At 2018-12-07T04:27:41+0000, Caipenghui wrote:
> > Why can't c language comments be nested? What is the historical
> reason
> > for language design?Feeling awkward.
> 
> I'm a callow youth compared to many on this list but I'll give it a
> try.
> 
> My understanding is that it's not so much a historical reason[1] as a
> design choice motivated by ease of lexical analysis.
> 
> As you may be aware, interpretation and compilation of programming
> languages often split into two parts: lexical analysis and semantic
> parsing.
> 
> For instance, in
> 
> int a = 1;
> 
> A lexical analyzer breaks this input into several "tokens":
> * A type declarator;
> * A variable name;
> * An assignment operator;
> * A numerical literal (constant);
> * A statement separator;
> * A newline.
> 
> Because we'll need it in a moment, here's another example:
> 
> char s[] = "foobar"; /* initial value */
> 
> The tokens here are:
> * A type declarator;
> * A variable name;
> * An array indicator, which is really part of the type declarator
>   (nobody said C was easy);
> * An assignment operator;
> * A string literal;
> * A statement separator;
> * A comment;
> * A newline.
> 
> The lexical analyzer ("lexer") categorizes the tokens and hands them
> to
> a semantic parser to give them meaning and build a "machine" that will
> execute the code.  (That "machine" is then translated into
> instructions
> that will run on a general-purpose computer, either in silicon or
> software, as we see in virtual machines like Java's.)
> 
> There is such a thing as "lexerless parsing" which combines these two
> steps into one, but tokenization vs. semantic analysis remains a
> useful
> way to learn about how programs actually become things that execute.
> 
> To answer your question, it is often desirable, because it can keep
> matters simple and comprehensible, to have a programming language that
> is easy to tokenize.  Regular expressions are a popular means of doing
> tokenization, and the classic Unix tool "lex" has convenient support
> for
> this.  (Its classic counterpart, a parser generator, is known as
> "yacc".)
> 
> If you have experience with regular expressions you may realize that
> there are things that it is hard (or impossible[2]) for them to do.
> 
> In classic C, there is only one type of comment.  It starts with '/*'
> not inside a string literal, and continues until a '*/' is seen.
> 
> It is a simple rule.  If you wanted to nest comments, then the lexer
> would have to keep track of state--specifically, how many '/*'s it had
> seen, and pop one and only one of them for each '*/' it encounters.
> 
> Furthermore, you have another design decision to make; should '*/'
> count
> to close a comment if it occurs inside a string literal inside a
> comment?  People might comment out code containing string literals,
> after all, and then you have to worry about what those string literals
> might contain.
> 
> Not only is it easier on a programmer's suffering brain to keep a
> programming language lexically simple--see the recent thread on the
> nightmare that is Fortran lexing, for instance--but it also affords
> easier opportunities for things that are not language implementations
> to
> lexically analyze your language.
> 
> A tremendously successful example of this is "syntax" highlighting in
> text editors and IDE editor windows, which mark up your code with
> pretty
> colors to help you understand what you are doing.
> 
> At this point you may see, incidentally, why it is more correct to
> call
> "syntax highlighting" lexical highlighting instead.
> 
> A well-trained lexical analyzer can correctly tokenize and highlight:
> 
> int a = 42;
> int a = "foobar";
> 
> But a syntax parser knows that a string literal cannot be assigned to
> a
> variable of integral type--that's a syntax error.  It might be nice if
> our text editors would catch this kind of mistake, too, and for all I
> know Eclipse or NetBeans does.  But doing so adds significantly more
> machinery to the development environment.  In my experience, lexical
> highlighting largely forecloses major categories of fumble-fingers or
> rookie mistakes that used to linger until a compilation was attempted.
> Back in the old days (1993!) a freshman programmer on SunOS 4 would be
> subjected to a truly incomprehensible chain of compiler errors that
> arose from a single lexical mistake like a missing semicolon.  With
> the
> arms race of helpful compiler diagnostics currently going between LLVM
> and GCC, and with our newfangled text editors doing lexical analysis
> and
> spraying terminal windows with avant-garde SGR escapes making things
> colorful, the learning process for C seems less savage than it used to
> be.
> 
> If you'd like to learn more about lexing and parsing from a practical
> perspective, with the fun of implementing your own C-like language
> step-by-step which you can then customize to your heart's content, I
> recommend chapter 8 of:
> 
> 	_The Unix Programming Environment_, by Kernighan and Pike,
> 	Prentice Hall 1984.
> 
> I have to qualify that recommendation a bit because you will have to
> do
> some work to port traditional K&R C to ANSI C, and point out that
> these
> days people use flex and bison (or flex and byacc) instead of lex and
> yacc, but if you're a moderately seasoned C programmer who hasn't
> checked off the "written a compiler" box, K&P hold one's hand pretty
> well through the process.  It's how I got my feet wet, it taught me a
> lot, and was less intimidating than Aho, Sethi, and Ullman.
> 
> I will venture that programming languages that are simple to parse
> tend
> to be easier to learn and retain, and promote more uniformity in
> presentation.  In spite of the feats on display in the IOCCC, and
> interminable wars over brace style and whitespace, we see less
> variation
> in source code layout in lexically simple languages than we
> (historically) did in Fortran.  As much as I would love to have
> another
> example of a hard-to-lex language, I don't know one.  As others
> pointed
> out here, Backus knew the revolution when he saw it, and swiftly chose
> the winning side.
> 
> I welcome correction on any of the above points by the sages on this
> list.
> 
> Regards,
> Branden
> 
> [1] A historical choice would be the BCPL comment style of '//',
> reintroduced in C++ and eventually admitted into C with the C99
> standard.  An ahistorical choice would have been using '@@' for this
> purpose, for instance.
> 
> [2] The identity between the CS regular languages and what can be
> recognized by "regular expression" implementations was broken long
> ago,
> and I am loath to make claims about what something like perlre can't
> do.

Thank you for your wonderful comments. I don't know what Dennis Ritchie thinks? Why not put the comment content in the first line of the comment?

First example:

/*  hello world
  *
  *
  */

Second example:

/*
  * hello world
  *
  */

To be honest, this is a bit uncomfortable for some perfectionists and looks uncomfortable. As Dennis said a word, "You can't understand this."

Caipenghui

next prev parent reply	other threads:[~2018-12-07 10:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-07  4:27 Caipenghui
2018-12-07  8:31 ` G. Branden Robinson
2018-12-07  9:53   ` Caipenghui [this message]
2018-12-07 14:06   ` Michael Kjörling
2018-12-07 15:01 Richard Tobin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=37B2A52F-43FC-4B2B-970F-170318D89135@163.com \
    --to=caipenghui_c@163.com \
    --cc=g.branden.robinson@gmail.com \
    --cc=tuhs@minnie.tuhs.org \
    --cc=tuhs@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).