At 2018-12-07T04:27:41+0000, Caipenghui wrote: > Why can't c language comments be nested? What is the historical reason > for language design?Feeling awkward. I'm a callow youth compared to many on this list but I'll give it a try. My understanding is that it's not so much a historical reason[1] as a design choice motivated by ease of lexical analysis. As you may be aware, interpretation and compilation of programming languages often split into two parts: lexical analysis and semantic parsing. For instance, in int a = 1; A lexical analyzer breaks this input into several "tokens": * A type declarator; * A variable name; * An assignment operator; * A numerical literal (constant); * A statement separator; * A newline. Because we'll need it in a moment, here's another example: char s[] = "foobar"; /* initial value */ The tokens here are: * A type declarator; * A variable name; * An array indicator, which is really part of the type declarator (nobody said C was easy); * An assignment operator; * A string literal; * A statement separator; * A comment; * A newline. The lexical analyzer ("lexer") categorizes the tokens and hands them to a semantic parser to give them meaning and build a "machine" that will execute the code. (That "machine" is then translated into instructions that will run on a general-purpose computer, either in silicon or software, as we see in virtual machines like Java's.) There is such a thing as "lexerless parsing" which combines these two steps into one, but tokenization vs. semantic analysis remains a useful way to learn about how programs actually become things that execute. To answer your question, it is often desirable, because it can keep matters simple and comprehensible, to have a programming language that is easy to tokenize. Regular expressions are a popular means of doing tokenization, and the classic Unix tool "lex" has convenient support for this. (Its classic counterpart, a parser generator, is known as "yacc".) If you have experience with regular expressions you may realize that there are things that it is hard (or impossible[2]) for them to do. In classic C, there is only one type of comment. It starts with '/*' not inside a string literal, and continues until a '*/' is seen. It is a simple rule. If you wanted to nest comments, then the lexer would have to keep track of state--specifically, how many '/*'s it had seen, and pop one and only one of them for each '*/' it encounters. Furthermore, you have another design decision to make; should '*/' count to close a comment if it occurs inside a string literal inside a comment? People might comment out code containing string literals, after all, and then you have to worry about what those string literals might contain. Not only is it easier on a programmer's suffering brain to keep a programming language lexically simple--see the recent thread on the nightmare that is Fortran lexing, for instance--but it also affords easier opportunities for things that are not language implementations to lexically analyze your language. A tremendously successful example of this is "syntax" highlighting in text editors and IDE editor windows, which mark up your code with pretty colors to help you understand what you are doing. At this point you may see, incidentally, why it is more correct to call "syntax highlighting" lexical highlighting instead. A well-trained lexical analyzer can correctly tokenize and highlight: int a = 42; int a = "foobar"; But a syntax parser knows that a string literal cannot be assigned to a variable of integral type--that's a syntax error. It might be nice if our text editors would catch this kind of mistake, too, and for all I know Eclipse or NetBeans does. But doing so adds significantly more machinery to the development environment. In my experience, lexical highlighting largely forecloses major categories of fumble-fingers or rookie mistakes that used to linger until a compilation was attempted. Back in the old days (1993!) a freshman programmer on SunOS 4 would be subjected to a truly incomprehensible chain of compiler errors that arose from a single lexical mistake like a missing semicolon. With the arms race of helpful compiler diagnostics currently going between LLVM and GCC, and with our newfangled text editors doing lexical analysis and spraying terminal windows with avant-garde SGR escapes making things colorful, the learning process for C seems less savage than it used to be. If you'd like to learn more about lexing and parsing from a practical perspective, with the fun of implementing your own C-like language step-by-step which you can then customize to your heart's content, I recommend chapter 8 of: _The Unix Programming Environment_, by Kernighan and Pike, Prentice Hall 1984. I have to qualify that recommendation a bit because you will have to do some work to port traditional K&R C to ANSI C, and point out that these days people use flex and bison (or flex and byacc) instead of lex and yacc, but if you're a moderately seasoned C programmer who hasn't checked off the "written a compiler" box, K&P hold one's hand pretty well through the process. It's how I got my feet wet, it taught me a lot, and was less intimidating than Aho, Sethi, and Ullman. I will venture that programming languages that are simple to parse tend to be easier to learn and retain, and promote more uniformity in presentation. In spite of the feats on display in the IOCCC, and interminable wars over brace style and whitespace, we see less variation in source code layout in lexically simple languages than we (historically) did in Fortran. As much as I would love to have another example of a hard-to-lex language, I don't know one. As others pointed out here, Backus knew the revolution when he saw it, and swiftly chose the winning side. I welcome correction on any of the above points by the sages on this list. Regards, Branden [1] A historical choice would be the BCPL comment style of '//', reintroduced in C++ and eventually admitted into C with the C99 standard. An ahistorical choice would have been using '@@' for this purpose, for instance. [2] The identity between the CS regular languages and what can be recognized by "regular expression" implementations was broken long ago, and I am loath to make claims about what something like perlre can't do.