Re: Improve performance of makewhatis

tech@mandoc.bsd.lv
 help / color / mirror / Atom feed

From: Ingo Schwarze <schwarze@usta.de>
To: Wesley Moore <wes@wezm.net>
Cc: tech@mandoc.bsd.lv
Subject: Re: Improve performance of makewhatis
Date: Wed, 15 Nov 2023 18:36:57 +0100	[thread overview]
Message-ID: <ZVUBuQtSRxjQkBLd@asta-kit.de> (raw)
In-Reply-To: <1ff732ac-40f7-44e3-a140-ba9771cad72a@app.fastmail.com>

Hi Wesley,

Wesley Moore wrote on Wed, Nov 15, 2023 at 06:29:15PM +1000:
> On Wed, 15 Nov 2023, at 11:26 AM, Ingo Schwarze wrote:

[...]
>>  * It uses the mandoc(1) parsing and formatting engine by default
>>    since the same date.

> Yes (pretty much). It looks like it was added a day later
> https://github.com/chimera-linux/cports/commit/3c493733f7f0df345816cc69e273fca32fe70830

Ah, being unfamiliar with the Chimera Linux packaging system,
i did not find that commit.  I just added HTML comments referencing
the two commit hashes to  https://mandoc.bsd.lv/ports.html .

It appears the dates are already correct, though,
Github lists both commits as "Oct 30, 2021".

[...]
>> So i expect most users won't feel offended if installing new
>> software takes a few seconds.

> Perhaps that is the case but it bothered me enough to go to the
> effort of writing these patches...

:-)

> [snip]
>> So only 10-15% of the time is spent in roff_getstrn(), which cannot
>> possibly result in the 25-35% overall speedup you report, so something
>> looks fishy here.  Then again, even 25-35% is not a large gain.

> Those are the numbers I saw from my basic benchmarking measuring the
> wall clock time of makewhatis -Tutf -n across three runs and averaging
> the results for each machine. The raw data from my notes is available here:
> 
> https://gist.github.com/wezm/4043bde0dc2974c88b7706c60b58f900

The first "Before" lines for OpenBSD and Raspberry are apparently
off to the upper side, possibly because the pages were not yet
in the buffer cache and had to be physically read from disk.
But even excluding these, we have

OpenBSD    3.89 -> 2.71  -30%
Raspberry  7.18 -> 5.91  -18%
Torrent    2.65 -> 1.90  -38%

which is still way above what might be possible according to my
measurements.

I'm not accusing you of lying, btw.  :-)

This could be due to bugs in mandoc, bugs in your patch, mistakes
in your or my measurement methodology, differences in the samples
of manual page trees we used, differences in the operating systems,
C libraries, or compilers we used, or differences in CFLAGS.

None of these look particularly likely causes to me.  But that only
means i have to be extra careful to not commit anything that breaks
parsing or formatting, given that we are clearly not fully understanding
what is going on.

> [snip]
>> One other thing.  I hate patch series, don't ever send them.
>> They are usually a symptom of poor development methodology,
>> and usually i reject them without even inspecting them.
>> Here, i made an exception because your idea seems to have some merit.
>>
>> Every patch needs to
>>  * perform one well-defined task and
>>  * make the code better even if nothing else is ever committed
>>    building on it.
>>
>> Splitting a patch like you did it here only adds obfuscation
>> If a patch becomes too big for review, then the *task* it performs
>> needs to be split into logically separate steps, rather than
>> splitting code changes into mutiple patch files that essentially
>> all save the same purpose and are similar and closely related in
>> their content.  Such *logical* splitting can become tricky, but
>> i doubt that is needed here.

> Sorry about that. While I've been programming for a couple of decades
> the email based patch workflow is very new to me and different to all
> my experience to date.

Sure, when working for hire, it also happened to me that i had to cope
with dubious methodologies like patch series (which encourage sloppy
design, sloppy implementation, very sloppy code review or no code review
at all) or feature branches (which encourage bit rot, logical conflicts,
integration mishaps, and sloppy and delayed merging and testing on top
of the problems of patch series).  But even if most of the biomass in a
state comes in the form of cows, that doesn't imply cows are particularly
intelligent and everybody should strive for bovinity.  ;-)

> If I post any more changes I'll be sure to follow your suggestions.

Thanks!

>> I'm not yet completely sure what the best way forward is.
>> Probably something like:
>>  1. Identifying all places that do string lookup in tables or lists.
>>     Some can probably be left alone:
>>       arch.c, att.c, cgi.c, chars.c, eqn.c, mansearch.c,
>>       mdoc_argnames in mdoc.c, msec.c, st.c, regtab in roff.c,
>>       tbl_layout.c, tbl_opts.c
>>     But the others should probably be dealt with in a unified manner:
>>       reqtab, mdocmac, manmac  on the one hand
>>       strtab, rentab, pretab, xmbtab  on the other hand
>>       Not sure yet whether xtab can be included into xmbtab.
>>  2. Design and implement a common handling for these using ohash
>>     that is as simple as possible.

> When I have a moment I'll take a look through these and try to come up
> with an approach/plan and post it for feedback before making code changes.

After consulting my pillow, i believe the following may work:

 1st step:
   Change strtab, rentab, xmbtab, roff_freestr, roff_setstrn,
   roff_getstrn, and roff_strdup to use ohash.
   That's very close to what you did in your first two patches
   but avoids any additional complication.  It's one logical
   step because it changes the way struct roffkv is used.

That likely results in the main performance gain.  Once that is
committed, we can optionally consider further steps, one by one,
for example:

 2nd step:
   Implement pretab using the facilities from step 1
   similar to what you did in your third patch.
   That's independent because predefs[] is only used by
   roff_getstrn and does not use struct roffkv yet.

 3rd step:
   Maybe unify reqtab, mdocmac, and manmac in order to resolve
   the two ugly roff_name[] loops in roff_getstrn.
   That's likely independent, too, because it won't use
   struct roffkv even in the final state of the code.

Of course, i cannot guarantee that will work - to provide such a
guarantee, i would have to do the work myself.  ;-)
But it looks promising to me, well in line with your goal,
and similar enough to your original implementation.

By the way, *if* this ends up in an amount of code that is non-trivial
in the sense of Copyright, are you OK with having

 * Copyright (c) 2023 Wesley Moore <wes@wezm.net>

Added to the relevant files and to the LICENSE file in the portable
mandoc root directory?

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

     prev parent reply	other threads:[~2023-11-15 17:37 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-13  1:42 Wesley Moore
2023-11-13  1:42 ` [PATCH 1/3] Use ohash for strtab Wesley Moore
2023-11-13  1:42 ` [PATCH 2/3] Use ohash for rentab Wesley Moore
2023-11-13  1:42 ` [PATCH 3/3] Use ohash for predefined strings Wesley Moore
2023-11-15  1:26 ` Improve performance of makewhatis Ingo Schwarze
2023-11-15  8:29   ` Wesley Moore
2023-11-15 17:36     ` Ingo Schwarze [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZVUBuQtSRxjQkBLd@asta-kit.de \
    --to=schwarze@usta.de \
    --cc=tech@mandoc.bsd.lv \
    --cc=wes@wezm.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).