From: Caipenghui <Caipenghui_c@163.com>
To: tuhs@minnie.tuhs.org,
"G. Branden Robinson" <g.branden.robinson@gmail.com>,
tuhs@tuhs.org
Subject: Re: [TUHS] c's comment
Date: Fri, 07 Dec 2018 09:53:14 +0000 [thread overview]
Message-ID: <37B2A52F-43FC-4B2B-970F-170318D89135@163.com> (raw)
In-Reply-To: <20181207083154.ucpd6qim3ghclfhn@crack.deadbeast.net>
于 December 7, 2018 8:31:56 AM UTC, "G. Branden Robinson" <g.branden.robinson@gmail.com> 写到:
> At 2018-12-07T04:27:41+0000, Caipenghui wrote:
> > Why can't c language comments be nested? What is the historical
> reason
> > for language design?Feeling awkward.
>
> I'm a callow youth compared to many on this list but I'll give it a
> try.
>
> My understanding is that it's not so much a historical reason[1] as a
> design choice motivated by ease of lexical analysis.
>
> As you may be aware, interpretation and compilation of programming
> languages often split into two parts: lexical analysis and semantic
> parsing.
>
> For instance, in
>
> int a = 1;
>
> A lexical analyzer breaks this input into several "tokens":
> * A type declarator;
> * A variable name;
> * An assignment operator;
> * A numerical literal (constant);
> * A statement separator;
> * A newline.
>
> Because we'll need it in a moment, here's another example:
>
> char s[] = "foobar"; /* initial value */
>
> The tokens here are:
> * A type declarator;
> * A variable name;
> * An array indicator, which is really part of the type declarator
> (nobody said C was easy);
> * An assignment operator;
> * A string literal;
> * A statement separator;
> * A comment;
> * A newline.
>
> The lexical analyzer ("lexer") categorizes the tokens and hands them
> to
> a semantic parser to give them meaning and build a "machine" that will
> execute the code. (That "machine" is then translated into
> instructions
> that will run on a general-purpose computer, either in silicon or
> software, as we see in virtual machines like Java's.)
>
> There is such a thing as "lexerless parsing" which combines these two
> steps into one, but tokenization vs. semantic analysis remains a
> useful
> way to learn about how programs actually become things that execute.
>
> To answer your question, it is often desirable, because it can keep
> matters simple and comprehensible, to have a programming language that
> is easy to tokenize. Regular expressions are a popular means of doing
> tokenization, and the classic Unix tool "lex" has convenient support
> for
> this. (Its classic counterpart, a parser generator, is known as
> "yacc".)
>
> If you have experience with regular expressions you may realize that
> there are things that it is hard (or impossible[2]) for them to do.
>
> In classic C, there is only one type of comment. It starts with '/*'
> not inside a string literal, and continues until a '*/' is seen.
>
> It is a simple rule. If you wanted to nest comments, then the lexer
> would have to keep track of state--specifically, how many '/*'s it had
> seen, and pop one and only one of them for each '*/' it encounters.
>
> Furthermore, you have another design decision to make; should '*/'
> count
> to close a comment if it occurs inside a string literal inside a
> comment? People might comment out code containing string literals,
> after all, and then you have to worry about what those string literals
> might contain.
>
> Not only is it easier on a programmer's suffering brain to keep a
> programming language lexically simple--see the recent thread on the
> nightmare that is Fortran lexing, for instance--but it also affords
> easier opportunities for things that are not language implementations
> to
> lexically analyze your language.
>
> A tremendously successful example of this is "syntax" highlighting in
> text editors and IDE editor windows, which mark up your code with
> pretty
> colors to help you understand what you are doing.
>
> At this point you may see, incidentally, why it is more correct to
> call
> "syntax highlighting" lexical highlighting instead.
>
> A well-trained lexical analyzer can correctly tokenize and highlight:
>
> int a = 42;
> int a = "foobar";
>
> But a syntax parser knows that a string literal cannot be assigned to
> a
> variable of integral type--that's a syntax error. It might be nice if
> our text editors would catch this kind of mistake, too, and for all I
> know Eclipse or NetBeans does. But doing so adds significantly more
> machinery to the development environment. In my experience, lexical
> highlighting largely forecloses major categories of fumble-fingers or
> rookie mistakes that used to linger until a compilation was attempted.
> Back in the old days (1993!) a freshman programmer on SunOS 4 would be
> subjected to a truly incomprehensible chain of compiler errors that
> arose from a single lexical mistake like a missing semicolon. With
> the
> arms race of helpful compiler diagnostics currently going between LLVM
> and GCC, and with our newfangled text editors doing lexical analysis
> and
> spraying terminal windows with avant-garde SGR escapes making things
> colorful, the learning process for C seems less savage than it used to
> be.
>
> If you'd like to learn more about lexing and parsing from a practical
> perspective, with the fun of implementing your own C-like language
> step-by-step which you can then customize to your heart's content, I
> recommend chapter 8 of:
>
> _The Unix Programming Environment_, by Kernighan and Pike,
> Prentice Hall 1984.
>
> I have to qualify that recommendation a bit because you will have to
> do
> some work to port traditional K&R C to ANSI C, and point out that
> these
> days people use flex and bison (or flex and byacc) instead of lex and
> yacc, but if you're a moderately seasoned C programmer who hasn't
> checked off the "written a compiler" box, K&P hold one's hand pretty
> well through the process. It's how I got my feet wet, it taught me a
> lot, and was less intimidating than Aho, Sethi, and Ullman.
>
> I will venture that programming languages that are simple to parse
> tend
> to be easier to learn and retain, and promote more uniformity in
> presentation. In spite of the feats on display in the IOCCC, and
> interminable wars over brace style and whitespace, we see less
> variation
> in source code layout in lexically simple languages than we
> (historically) did in Fortran. As much as I would love to have
> another
> example of a hard-to-lex language, I don't know one. As others
> pointed
> out here, Backus knew the revolution when he saw it, and swiftly chose
> the winning side.
>
> I welcome correction on any of the above points by the sages on this
> list.
>
> Regards,
> Branden
>
> [1] A historical choice would be the BCPL comment style of '//',
> reintroduced in C++ and eventually admitted into C with the C99
> standard. An ahistorical choice would have been using '@@' for this
> purpose, for instance.
>
> [2] The identity between the CS regular languages and what can be
> recognized by "regular expression" implementations was broken long
> ago,
> and I am loath to make claims about what something like perlre can't
> do.
Thank you for your wonderful comments. I don't know what Dennis Ritchie thinks? Why not put the comment content in the first line of the comment?
First example:
/* hello world
*
*
*/
Second example:
/*
* hello world
*
*/
To be honest, this is a bit uncomfortable for some perfectionists and looks uncomfortable. As Dennis said a word, "You can't understand this."
Caipenghui
next prev parent reply other threads:[~2018-12-07 10:09 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-07 4:27 Caipenghui
2018-12-07 8:31 ` G. Branden Robinson
2018-12-07 9:53 ` Caipenghui [this message]
2018-12-07 14:06 ` Michael Kjörling
2018-12-07 15:01 Richard Tobin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=37B2A52F-43FC-4B2B-970F-170318D89135@163.com \
--to=caipenghui_c@163.com \
--cc=g.branden.robinson@gmail.com \
--cc=tuhs@minnie.tuhs.org \
--cc=tuhs@tuhs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).