tech@mandoc.bsd.lv
 help / color / mirror / Atom feed
From: Michael Stapelberg <stapelberg@debian.org>
To: tech@mdocml.bsd.lv
Subject: Re: [PATCH] Implement -u (UNIX socket batch processing)
Date: Sun, 5 Feb 2017 16:23:05 +0100	[thread overview]
Message-ID: <CANnVG6k9JEQVYtLPLPGN4TrBUAkNWjerTo2F7LTm_A=uS9_PSg@mail.gmail.com> (raw)
In-Reply-To: <20170204125755.GB12010@athene.usta.de>

[-- Attachment #1: Type: text/plain, Size: 7167 bytes --]

Thanks for the thoughtful response, replies inline:

On Sat, Feb 4, 2017 at 1:57 PM, Ingo Schwarze <schwarze@usta.de> wrote:
> Hi Michael,
>
> On Sat, Jan 21, 2017 at 1:29 PM, Michael Stapelberg wrote:
>> Quoting from the commit message for your convenience:
>
>>> The option argument identifies the file descriptor number of a UNIX
>>> listening socket which was inherited from the parent process when said
>>> parent exec'd mandoc.
>>>
>>> mandoc will accept one connection at a time on that socket. As long as
>>> the connection lives, mandoc reads 1-byte messages with ancillary data
>>> (control information) from the connection. The ancillary data is used to
>>> pass three file descriptors from the parent process to mandoc. These
>>> file descriptors are then dup2'ed to fd 0, fd 1 and fd 2, respectively.
>>>
>>> After the file descriptors have been set up, a single manpage is read
>>> from stdin and processed to stdout/stderr.
>>>
>>> This change effectively adds a batch processing mode to mandoc, which
>>> eliminates the fork+exec overhead that otherwise accompanies the
>>> conversion of a large number of manpages. The resulting speed-up has
>>> been observed to exceed a factor of 6, reducing the total conversion
>>> wall-clock time of about a 470k manpages from more than 2 hours to
>>> a mere 22 minutes.
>
>> To expand on the rationale: quicker conversion
>>
>> 1. allows us to provide our users more recent manpages (we can run
>> more frequently)
>> 2. hogs fewer resources which can instead be used for actually serving
>> manpages
>> 3. massively simplifies our disaster recovery story: we can just
>> re-generate the entire repository within a small number of hours,
>> instead of facing days of downtime
>
> I think i understand why you need this.  I repeatedly thought about
> it and came to the conclusion that there is little to improve with
> the basic concept.  In particular, opening the file descriptors,
> in particular those for writing, in the controlling process is the
> right thing to do.  We do not want the same process that is doing
> the parsing and formatting to open files for writing.  That would
> be unsafe in case of programming errors, and parsers are notoriously
> buggy.  So your original idea of passing the descriptors is clearly
> better than my first idea of passing filenames instead.
>
>> As per http://man.openbsd.org/OpenBSD-current/man3/CMSG_DATA.3,
>> control information has been around since 4.2BSD, so support across
>> operating systems should be good.
>
> Yes, but i don't like the idea of clobbering that into main.c.
> At some point, i will have to do privilege separation in mandoc.
> That will almost certainly use the OpenBSD imsg library, which
> internally uses cmsg.  It will not be a protability problem: imsg
> is very small and can easily be bundled for systems not having it,
> exactly like ohash.
>
> But if i would commit your code now, it would conflict later, using
> cmsg on two different levels.  Besides, even though your code is
> required for your purpose, and even though it may also help speeding
> up bulk comparisons that's i'm sometimes doing to catch regressions,
> it is not useful for the general public, so it does not belong into
> main.c.  Just like the code in cgi.c, only a handful of server
> administrators will need it.
>
> Besides, for the daemon you need, main.c is mostly the wrong codebase.
> It contains huge amounts of code that, in your server, are just in
> the way: many command line options for interactive use, makewhatis(8)
> integration, database and search functionality, pager and process
> group control, file system lookup, preformatted passthrough...
>
> So your code should be in a separate program.  I have called it
> mandocd.c.  It now has 270 lines (compare to the 1108 in main.c),
> your read_fds() is the largest non-main function, and even in main(),
> the meat of the code is what you wrote, plus relatively small glue.

That looks good, thanks for merging the patch!

>
>
> There was one major problem with what you wrote.  You used abstract
> sockets, and that is non-portable and Linux only.  On other systems,
> you cannot call bind(2) with a name starting in '\0'.  To be portable,
> you have to create the socket with socketpair(2) and without bind(2),
> listen(2), and accept(2).  The fact that you cannot reconnect when
> you lose the connection may seem like a downside at first, but i don't
> think it's really any inconvenience.  If the controlling process
> wants to reconnect for whatever reason, it can just as well spawn
> a new mandocd.  I'm not even sure that your client code contained
> support for reconnecting, even though your server side supported it.

Agreed. I added reconnects in the early stages of development of this
patch, when I was still using a named socket. With the feature in its
current form, I agree that we do not need reconnects.

>
> If you absolutely need reconnection capability, we have to use a
> named socket in the file system.  But i'd rather not do that if we
> can avoid it.  It just causes security grief because we would have
> to protect the socket against unauthorized access.
>
> I committed new files mandocd.c and catman.c to bsd.lv such that
> we can work on them in the tree.  catman.c is an improved version
> of your demo.c that actually does something useful: Try making
> directories src/ and dst/, put some mdoc(7) and/or man(7) manual
> pages into src/, and run
>
>   ./catman src dst
>
> or
>
>   ./catman -T html src dst
>
> Formatted versions appear in dst/, with the same relative file
> names.  Of course, this is not polished.  For example, if you use
> any subdirs in src/, you have to create the same subdirs in dst/
> by hand - i'll probably fix that, and other quirks, later.
>
> You didn't have a license on demo.c, but i assume you intended to
> license it under the same license as your patch to main.c.

Yes, you are correct.

> Please review the Copyright/license headers in the two files
> and protest if i misunderstood what you intended.
>
>
> The important questions right now are:
>
>  * Does the concept of a stand-alone file mandocd.c
>    and a "mandocd" executable separate from "mandoc"
>    work for you?

Absolutely.

>
>  * Does using socketpair(2) instead of socket(2), bind(2), listen(2),
>    accept(2) work for you?

Yes.

I went ahead and changed debiman to use this interface (and I realize
you still marked it as experimental, so I’ll gladly update when
necessary): https://github.com/Debian/debiman/commit/3715b1eaf9c1793b9a8c7b1787e2d6511ca2b004

>
> Of course, any other comments on the committed code are also
> welcome.

I added a tiny patch which includes stdint.h which is required to find
uint8_t on Linux (fixes compilation of mandocd).

>
> Right now, i don't plan to merge this back to VERSION_1_13.
> You don't need SQLite3 support anyway, so you should be fine
> switching to mandoc-1.14.1 when it comes out.

Yep, sounds good.

-- 
Best regards,
Michael

[-- Attachment #2: 0001-mandocd-add-stdint.h-to-fix-compilation-on-Linux.patch --]
[-- Type: text/x-patch, Size: 532 bytes --]

From ed80ba44eedd0543f9da9b1e231204262121fbd8 Mon Sep 17 00:00:00 2001
From: Michael Stapelberg <stapelberg@debian.org>
Date: Sun, 5 Feb 2017 15:39:46 +0100
Subject: [PATCH] mandocd: add stdint.h to fix compilation on Linux

---
 mandocd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mandocd.c b/mandocd.c
index fce2812..406e991 100644
--- a/mandocd.c
+++ b/mandocd.c
@@ -28,6 +28,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <stdint.h>
 
 #include "mandoc.h"
 #include "roff.h"
-- 
2.11.0


      reply	other threads:[~2017-02-05 15:23 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-21 12:29 Michael Stapelberg
2017-01-28 13:02 ` Michael Stapelberg
2017-02-03 18:07   ` Ingo Schwarze
2017-02-04 12:57   ` Ingo Schwarze
2017-02-05 15:23     ` Michael Stapelberg [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANnVG6k9JEQVYtLPLPGN4TrBUAkNWjerTo2F7LTm_A=uS9_PSg@mail.gmail.com' \
    --to=stapelberg@debian.org \
    --cc=tech@mdocml.bsd.lv \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).