From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.2 Received: from minnie.tuhs.org (minnie.tuhs.org [45.79.103.53]) by inbox.vuxu.org (OpenSMTPD) with ESMTP id e1dadebc for ; Fri, 7 Dec 2018 08:32:56 +0000 (UTC) Received: by minnie.tuhs.org (Postfix, from userid 112) id 09A27A24F4; Fri, 7 Dec 2018 18:32:55 +1000 (AEST) Received: from minnie.tuhs.org (localhost [127.0.0.1]) by minnie.tuhs.org (Postfix) with ESMTP id A15CFA1F03; Fri, 7 Dec 2018 18:32:14 +1000 (AEST) Received: by minnie.tuhs.org (Postfix, from userid 112) id BB086A1F03; Fri, 7 Dec 2018 18:32:06 +1000 (AEST) Received: from mail-ot1-f65.google.com (mail-ot1-f65.google.com [209.85.210.65]) by minnie.tuhs.org (Postfix) with ESMTPS id 2DA1D93D07 for ; Fri, 7 Dec 2018 18:32:01 +1000 (AEST) Received: by mail-ot1-f65.google.com with SMTP id s5so3069609oth.7 for ; Fri, 07 Dec 2018 00:32:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=Dfz/ubImEDp1sYCSXH7UTzPDBEamJn9an0cUygx5AAs=; b=fK0YJd9r3hz9WyEzJq62qMjmZEbQWSvwyUk08vR5dzLmVLIdoUqMpM9cY1ECFbFHKr 8txY3VQiSe2FZe8wMdpmV+zcReARW9bQaQrK+RxTwfr5yt7/LWOSssC6FJJHVEFS7cXe zMM0595/kcl/oqE1JQ0JLkarJLOfkfmgOLCj4UGwEe/0RXkIEjRBj8RocvLKohSaedZ/ az1tuWSrIfUXYwLd8Pm73FTFjBP/85lTQnqFoIeGouUopd7aFA8rBtYtdtdK2wyKHT7O cJLHoqDM9lctqVd12ciVgS2Ne0JR5ZgOCLeRdIaZeyC/S35sfcf/KMGSWL1tPsFyo/Pf lpsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Dfz/ubImEDp1sYCSXH7UTzPDBEamJn9an0cUygx5AAs=; b=SzzolI3P052e0BwhvytJTrU8/6RJ1AhQfFJuQUt/PhBhMG2IeYdNIERIg3/F2hTeZ1 PrdiF+tenpum8DdF/pbIcp9mcQaQ5EVdygDEh1zuojPs3luGk67r/AMgkYSXw3u8c/e8 OQ6yhNROWdwYz9sjCxcG3XSbqQeedjfJ6WPjbebl/lHIwEdtE3ZwIkC654bhaT3Z7cQG Ckf7M8qf1keW2DzyzigVcQgsvgkZSeOwa5I2U3mBrIlqmhnh9F1VK+kEu1WV5OUW6WeL CWCMIP70/NU2e5iOfhDRZU2IZVUaibptRuHA4NnMvDHZiezGy65r87cldg1ECqOzjPCd uluQ== X-Gm-Message-State: AA+aEWYP2CturdaFyAw/9G4MxC7zin4E32OEENSsmQiUKv5al3J2vt8N 8thKfNArFbmVjA+5HvRxIxEh4nvpcx4= X-Google-Smtp-Source: AFSGD/U06UQ9s+1+tcDEaNOGq8+/7o3tjMdQV+ENsnvrvZ0NqAyiX6cAbWY8leJg5i8anSuceKYy4Q== X-Received: by 2002:a9d:728a:: with SMTP id t10mr801869otj.216.1544171519817; Fri, 07 Dec 2018 00:31:59 -0800 (PST) Received: from crack.deadbeast.net (rrcs-24-171-184-100.midsouth.biz.rr.com. [24.171.184.100]) by smtp.gmail.com with ESMTPSA id v68sm1117761oie.16.2018.12.07.00.31.58 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 07 Dec 2018 00:31:59 -0800 (PST) Date: Fri, 7 Dec 2018 03:31:56 -0500 From: "G. Branden Robinson" To: tuhs@tuhs.org Message-ID: <20181207083154.ucpd6qim3ghclfhn@crack.deadbeast.net> References: <1C1BB8D2-9535-4A6D-96E8-3792F14BCBAE@163.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="zrx5ihnykqhcfrlz" Content-Disposition: inline In-Reply-To: <1C1BB8D2-9535-4A6D-96E8-3792F14BCBAE@163.com> User-Agent: NeoMutt/20180716 Subject: Re: [TUHS] c's comment X-BeenThere: tuhs@minnie.tuhs.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: The Unix Heritage Society mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: tuhs-bounces@minnie.tuhs.org Sender: "TUHS" --zrx5ihnykqhcfrlz Content-Type: text/plain; charset=us-ascii Content-Disposition: inline At 2018-12-07T04:27:41+0000, Caipenghui wrote: > Why can't c language comments be nested? What is the historical reason > for language design?Feeling awkward. I'm a callow youth compared to many on this list but I'll give it a try. My understanding is that it's not so much a historical reason[1] as a design choice motivated by ease of lexical analysis. As you may be aware, interpretation and compilation of programming languages often split into two parts: lexical analysis and semantic parsing. For instance, in int a = 1; A lexical analyzer breaks this input into several "tokens": * A type declarator; * A variable name; * An assignment operator; * A numerical literal (constant); * A statement separator; * A newline. Because we'll need it in a moment, here's another example: char s[] = "foobar"; /* initial value */ The tokens here are: * A type declarator; * A variable name; * An array indicator, which is really part of the type declarator (nobody said C was easy); * An assignment operator; * A string literal; * A statement separator; * A comment; * A newline. The lexical analyzer ("lexer") categorizes the tokens and hands them to a semantic parser to give them meaning and build a "machine" that will execute the code. (That "machine" is then translated into instructions that will run on a general-purpose computer, either in silicon or software, as we see in virtual machines like Java's.) There is such a thing as "lexerless parsing" which combines these two steps into one, but tokenization vs. semantic analysis remains a useful way to learn about how programs actually become things that execute. To answer your question, it is often desirable, because it can keep matters simple and comprehensible, to have a programming language that is easy to tokenize. Regular expressions are a popular means of doing tokenization, and the classic Unix tool "lex" has convenient support for this. (Its classic counterpart, a parser generator, is known as "yacc".) If you have experience with regular expressions you may realize that there are things that it is hard (or impossible[2]) for them to do. In classic C, there is only one type of comment. It starts with '/*' not inside a string literal, and continues until a '*/' is seen. It is a simple rule. If you wanted to nest comments, then the lexer would have to keep track of state--specifically, how many '/*'s it had seen, and pop one and only one of them for each '*/' it encounters. Furthermore, you have another design decision to make; should '*/' count to close a comment if it occurs inside a string literal inside a comment? People might comment out code containing string literals, after all, and then you have to worry about what those string literals might contain. Not only is it easier on a programmer's suffering brain to keep a programming language lexically simple--see the recent thread on the nightmare that is Fortran lexing, for instance--but it also affords easier opportunities for things that are not language implementations to lexically analyze your language. A tremendously successful example of this is "syntax" highlighting in text editors and IDE editor windows, which mark up your code with pretty colors to help you understand what you are doing. At this point you may see, incidentally, why it is more correct to call "syntax highlighting" lexical highlighting instead. A well-trained lexical analyzer can correctly tokenize and highlight: int a = 42; int a = "foobar"; But a syntax parser knows that a string literal cannot be assigned to a variable of integral type--that's a syntax error. It might be nice if our text editors would catch this kind of mistake, too, and for all I know Eclipse or NetBeans does. But doing so adds significantly more machinery to the development environment. In my experience, lexical highlighting largely forecloses major categories of fumble-fingers or rookie mistakes that used to linger until a compilation was attempted. Back in the old days (1993!) a freshman programmer on SunOS 4 would be subjected to a truly incomprehensible chain of compiler errors that arose from a single lexical mistake like a missing semicolon. With the arms race of helpful compiler diagnostics currently going between LLVM and GCC, and with our newfangled text editors doing lexical analysis and spraying terminal windows with avant-garde SGR escapes making things colorful, the learning process for C seems less savage than it used to be. If you'd like to learn more about lexing and parsing from a practical perspective, with the fun of implementing your own C-like language step-by-step which you can then customize to your heart's content, I recommend chapter 8 of: _The Unix Programming Environment_, by Kernighan and Pike, Prentice Hall 1984. I have to qualify that recommendation a bit because you will have to do some work to port traditional K&R C to ANSI C, and point out that these days people use flex and bison (or flex and byacc) instead of lex and yacc, but if you're a moderately seasoned C programmer who hasn't checked off the "written a compiler" box, K&P hold one's hand pretty well through the process. It's how I got my feet wet, it taught me a lot, and was less intimidating than Aho, Sethi, and Ullman. I will venture that programming languages that are simple to parse tend to be easier to learn and retain, and promote more uniformity in presentation. In spite of the feats on display in the IOCCC, and interminable wars over brace style and whitespace, we see less variation in source code layout in lexically simple languages than we (historically) did in Fortran. As much as I would love to have another example of a hard-to-lex language, I don't know one. As others pointed out here, Backus knew the revolution when he saw it, and swiftly chose the winning side. I welcome correction on any of the above points by the sages on this list. Regards, Branden [1] A historical choice would be the BCPL comment style of '//', reintroduced in C++ and eventually admitted into C with the C99 standard. An ahistorical choice would have been using '@@' for this purpose, for instance. [2] The identity between the CS regular languages and what can be recognized by "regular expression" implementations was broken long ago, and I am loath to make claims about what something like perlre can't do. --zrx5ihnykqhcfrlz Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEh3PWHWjjDgcrENwa0Z6cfXEmbc4FAlwKL/IACgkQ0Z6cfXEm bc6AWBAAmoSjN+ALTDldJejLAbpLrhtmbO6aZq/nldUgLiSHL10LT82EMrQbXCJD haSyniqIoNGxWkMR0TzXFhNfPHRxmeUFVpY228vZDVxUruDWrzLTD/Jx2XKat045 9T/Omj0W448NVzZ0nPRXQujfe9ImlL7PJ3wwtBRSG8kzYUKL0TvIfg5eqp46DkT5 pqv0KgloN/r72crI3KQY+5LLWMfqXWzgr42WXSn9GrSJmT60tK8PFc5KK+x7YN9e AJ4lxDC+shvWwAKBJRPGoXF9yVw5YNEFJI4fLYIaF+jLvqhX8BnylVPEEbZAlkqj PmYjp13pMVVMnUsTau7pCuGS/veNnxNLwubHVDRxXasBBu9m87nZ8a3Q9efSDdbr svaWupJ3O1QF64QBeiGj+oVxYY/ayK27kLEK2txq/lyUC+OTusXlA7m8dVWFmrIe DQjilzuNQStyYDVTf2LIirqtiXfHQr1+vSuecDP8oJwYhUThwpiTCQJ5YWeFJLRN r8Iok30454Wn6bi/tWAo+2D0Mi6l53BuzJbIYAODCXgj6B6chidBKeH12SHFDHuV RbOujtc9iRZIRjy/c8Xs4sDRQjV83Ke0y5WFL5Xufc5nEWrmF3gKUZ3YXBn2wu3v NHKUczwRgeZhaJHsTt1mBKQM2TWDRSBKhVnOvjmmd1cjWbKN8IE= =5UtV -----END PGP SIGNATURE----- --zrx5ihnykqhcfrlz--