From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from scc-mailout-kit-02.scc.kit.edu (scc-mailout-kit-02.scc.kit.edu [129.13.231.82]) by fantadrom.bsd.lv (OpenSMTPD) with ESMTP id a2555a27 for ; Sat, 4 Feb 2017 07:58:01 -0500 (EST) Received: from asta-nat.asta.uni-karlsruhe.de ([172.22.63.82] helo=hekate.usta.de) by scc-mailout-kit-02.scc.kit.edu with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (envelope-from ) id 1cZzue-0003QT-O1; Sat, 04 Feb 2017 13:58:00 +0100 Received: from donnerwolke.usta.de ([172.24.96.3]) by hekate.usta.de with esmtp (Exim 4.77) (envelope-from ) id 1cZzud-0004VL-N6; Sat, 04 Feb 2017 13:57:55 +0100 Received: from athene.usta.de ([172.24.96.10]) by donnerwolke.usta.de with esmtp (Exim 4.84_2) (envelope-from ) id 1cZzud-0000Vj-As; Sat, 04 Feb 2017 13:57:55 +0100 Received: from localhost (athene.usta.de [local]) by athene.usta.de (OpenSMTPD) with ESMTPA id c1e29eae; Sat, 4 Feb 2017 13:57:55 +0100 (CET) Date: Sat, 4 Feb 2017 13:57:55 +0100 From: Ingo Schwarze To: Michael Stapelberg Cc: tech@mdocml.bsd.lv Subject: Re: [PATCH] Implement -u (UNIX socket batch processing) Message-ID: <20170204125755.GB12010@athene.usta.de> References: X-Mailinglist: mdocml-tech Reply-To: tech@mdocml.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.2 (2016-07-01) Hi Michael, On Sat, Jan 21, 2017 at 1:29 PM, Michael Stapelberg wrote: > Quoting from the commit message for your convenience: >> The option argument identifies the file descriptor number of a UNIX >> listening socket which was inherited from the parent process when said >> parent exec'd mandoc. >> >> mandoc will accept one connection at a time on that socket. As long as >> the connection lives, mandoc reads 1-byte messages with ancillary data >> (control information) from the connection. The ancillary data is used to >> pass three file descriptors from the parent process to mandoc. These >> file descriptors are then dup2'ed to fd 0, fd 1 and fd 2, respectively. >> >> After the file descriptors have been set up, a single manpage is read >> from stdin and processed to stdout/stderr. >> >> This change effectively adds a batch processing mode to mandoc, which >> eliminates the fork+exec overhead that otherwise accompanies the >> conversion of a large number of manpages. The resulting speed-up has >> been observed to exceed a factor of 6, reducing the total conversion >> wall-clock time of about a 470k manpages from more than 2 hours to >> a mere 22 minutes. > To expand on the rationale: quicker conversion > > 1. allows us to provide our users more recent manpages (we can run > more frequently) > 2. hogs fewer resources which can instead be used for actually serving > manpages > 3. massively simplifies our disaster recovery story: we can just > re-generate the entire repository within a small number of hours, > instead of facing days of downtime I think i understand why you need this. I repeatedly thought about it and came to the conclusion that there is little to improve with the basic concept. In particular, opening the file descriptors, in particular those for writing, in the controlling process is the right thing to do. We do not want the same process that is doing the parsing and formatting to open files for writing. That would be unsafe in case of programming errors, and parsers are notoriously buggy. So your original idea of passing the descriptors is clearly better than my first idea of passing filenames instead. > As per http://man.openbsd.org/OpenBSD-current/man3/CMSG_DATA.3, > control information has been around since 4.2BSD, so support across > operating systems should be good. Yes, but i don't like the idea of clobbering that into main.c. At some point, i will have to do privilege separation in mandoc. That will almost certainly use the OpenBSD imsg library, which internally uses cmsg. It will not be a protability problem: imsg is very small and can easily be bundled for systems not having it, exactly like ohash. But if i would commit your code now, it would conflict later, using cmsg on two different levels. Besides, even though your code is required for your purpose, and even though it may also help speeding up bulk comparisons that's i'm sometimes doing to catch regressions, it is not useful for the general public, so it does not belong into main.c. Just like the code in cgi.c, only a handful of server administrators will need it. Besides, for the daemon you need, main.c is mostly the wrong codebase. It contains huge amounts of code that, in your server, are just in the way: many command line options for interactive use, makewhatis(8) integration, database and search functionality, pager and process group control, file system lookup, preformatted passthrough... So your code should be in a separate program. I have called it mandocd.c. It now has 270 lines (compare to the 1108 in main.c), your read_fds() is the largest non-main function, and even in main(), the meat of the code is what you wrote, plus relatively small glue. There was one major problem with what you wrote. You used abstract sockets, and that is non-portable and Linux only. On other systems, you cannot call bind(2) with a name starting in '\0'. To be portable, you have to create the socket with socketpair(2) and without bind(2), listen(2), and accept(2). The fact that you cannot reconnect when you lose the connection may seem like a downside at first, but i don't think it's really any inconvenience. If the controlling process wants to reconnect for whatever reason, it can just as well spawn a new mandocd. I'm not even sure that your client code contained support for reconnecting, even though your server side supported it. If you absolutely need reconnection capability, we have to use a named socket in the file system. But i'd rather not do that if we can avoid it. It just causes security grief because we would have to protect the socket against unauthorized access. I committed new files mandocd.c and catman.c to bsd.lv such that we can work on them in the tree. catman.c is an improved version of your demo.c that actually does something useful: Try making directories src/ and dst/, put some mdoc(7) and/or man(7) manual pages into src/, and run ./catman src dst or ./catman -T html src dst Formatted versions appear in dst/, with the same relative file names. Of course, this is not polished. For example, if you use any subdirs in src/, you have to create the same subdirs in dst/ by hand - i'll probably fix that, and other quirks, later. You didn't have a license on demo.c, but i assume you intended to license it under the same license as your patch to main.c. Please review the Copyright/license headers in the two files and protest if i misunderstood what you intended. The important questions right now are: * Does the concept of a stand-alone file mandocd.c and a "mandocd" executable separate from "mandoc" work for you? * Does using socketpair(2) instead of socket(2), bind(2), listen(2), accept(2) work for you? Of course, any other comments on the committed code are also welcome. Right now, i don't plan to merge this back to VERSION_1_13. You don't need SQLite3 support anyway, so you should be fine switching to mandoc-1.14.1 when it comes out. Yours, Ingo -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv