tech@mandoc.bsd.lv

mandoc does not implement Eq

2024-04-14T17:54:41Z

Hi,

mandoc does not implement the mdoc(7) Eq macro.

This was discussed recently in the got(1) IRC channel, where it was
noticed that the literal string 'Eq' needed to be escaped in the
manpage, but only when viewed with groff.

groff_mdoc(7) says:

   .Eq   The first and second arguments of this macro are the
         opening and closing strings respectively, followed by the
         arguments to be enclosed.

-- 
Anthony J. Bentley
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: Improve performance of makewhatis

2023-11-15T17:37:02Z

Hi Wesley,

Wesley Moore wrote on Wed, Nov 15, 2023 at 06:29:15PM +1000:
> On Wed, 15 Nov 2023, at 11:26 AM, Ingo Schwarze wrote:

[...]
>>  * It uses the mandoc(1) parsing and formatting engine by default
>>    since the same date.

> Yes (pretty much). It looks like it was added a day later
> https://github.com/chimera-linux/cports/commit/3c493733f7f0df345816cc69e273fca32fe70830

Ah, being unfamiliar with the Chimera Linux packaging system,
i did not find that commit.  I just added HTML comments referencing
the two commit hashes to  https://mandoc.bsd.lv/ports.html .

It appears the dates are already correct, though,
Github lists both commits as "Oct 30, 2021".

[...]
>> So i expect most users won't feel offended if installing new
>> software takes a few seconds.

> Perhaps that is the case but it bothered me enough to go to the
> effort of writing these patches...

:-)

> [snip]
>> So only 10-15% of the time is spent in roff_getstrn(), which cannot
>> possibly result in the 25-35% overall speedup you report, so something
>> looks fishy here.  Then again, even 25-35% is not a large gain.

> Those are the numbers I saw from my basic benchmarking measuring the
> wall clock time of makewhatis -Tutf -n across three runs and averaging
> the results for each machine. The raw data from my notes is available here:
> 
> https://gist.github.com/wezm/4043bde0dc2974c88b7706c60b58f900

The first "Before" lines for OpenBSD and Raspberry are apparently
off to the upper side, possibly because the pages were not yet
in the buffer cache and had to be physically read from disk.
But even excluding these, we have

OpenBSD    3.89 -> 2.71  -30%
Raspberry  7.18 -> 5.91  -18%
Torrent    2.65 -> 1.90  -38%

which is still way above what might be possible according to my
measurements.

I'm not accusing you of lying, btw.  :-)

This could be due to bugs in mandoc, bugs in your patch, mistakes
in your or my measurement methodology, differences in the samples
of manual page trees we used, differences in the operating systems,
C libraries, or compilers we used, or differences in CFLAGS.

None of these look particularly likely causes to me.  But that only
means i have to be extra careful to not commit anything that breaks
parsing or formatting, given that we are clearly not fully understanding
what is going on.

> [snip]
>> One other thing.  I hate patch series, don't ever send them.
>> They are usually a symptom of poor development methodology,
>> and usually i reject them without even inspecting them.
>> Here, i made an exception because your idea seems to have some merit.
>>
>> Every patch needs to
>>  * perform one well-defined task and
>>  * make the code better even if nothing else is ever committed
>>    building on it.
>>
>> Splitting a patch like you did it here only adds obfuscation
>> If a patch becomes too big for review, then the *task* it performs
>> needs to be split into logically separate steps, rather than
>> splitting code changes into mutiple patch files that essentially
>> all save the same purpose and are similar and closely related in
>> their content.  Such *logical* splitting can become tricky, but
>> i doubt that is needed here.

> Sorry about that. While I've been programming for a couple of decades
> the email based patch workflow is very new to me and different to all
> my experience to date.

Sure, when working for hire, it also happened to me that i had to cope
with dubious methodologies like patch series (which encourage sloppy
design, sloppy implementation, very sloppy code review or no code review
at all) or feature branches (which encourage bit rot, logical conflicts,
integration mishaps, and sloppy and delayed merging and testing on top
of the problems of patch series).  But even if most of the biomass in a
state comes in the form of cows, that doesn't imply cows are particularly
intelligent and everybody should strive for bovinity.  ;-)

> If I post any more changes I'll be sure to follow your suggestions.

Thanks!

>> I'm not yet completely sure what the best way forward is.
>> Probably something like:
>>  1. Identifying all places that do string lookup in tables or lists.
>>     Some can probably be left alone:
>>       arch.c, att.c, cgi.c, chars.c, eqn.c, mansearch.c,
>>       mdoc_argnames in mdoc.c, msec.c, st.c, regtab in roff.c,
>>       tbl_layout.c, tbl_opts.c
>>     But the others should probably be dealt with in a unified manner:
>>       reqtab, mdocmac, manmac  on the one hand
>>       strtab, rentab, pretab, xmbtab  on the other hand
>>       Not sure yet whether xtab can be included into xmbtab.
>>  2. Design and implement a common handling for these using ohash
>>     that is as simple as possible.

> When I have a moment I'll take a look through these and try to come up
> with an approach/plan and post it for feedback before making code changes.

After consulting my pillow, i believe the following may work:

 1st step:
   Change strtab, rentab, xmbtab, roff_freestr, roff_setstrn,
   roff_getstrn, and roff_strdup to use ohash.
   That's very close to what you did in your first two patches
   but avoids any additional complication.  It's one logical
   step because it changes the way struct roffkv is used.

That likely results in the main performance gain.  Once that is
committed, we can optionally consider further steps, one by one,
for example:

 2nd step:
   Implement pretab using the facilities from step 1
   similar to what you did in your third patch.
   That's independent because predefs[] is only used by
   roff_getstrn and does not use struct roffkv yet.

 3rd step:
   Maybe unify reqtab, mdocmac, and manmac in order to resolve
   the two ugly roff_name[] loops in roff_getstrn.
   That's likely independent, too, because it won't use
   struct roffkv even in the final state of the code.

Of course, i cannot guarantee that will work - to provide such a
guarantee, i would have to do the work myself.  ;-)
But it looks promising to me, well in line with your goal,
and similar enough to your original implementation.

By the way, *if* this ends up in an amount of code that is non-trivial
in the sense of Copyright, are you OK with having

 * Copyright (c) 2023 Wesley Moore 

Added to the relevant files and to the LICENSE file in the portable
mandoc root directory?

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: Improve performance of makewhatis

2023-11-15T08:29:40Z

Hi Ingo,

Thanks for the detailed reply.

On Wed, 15 Nov 2023, at 11:26 AM, Ingo Schwarze wrote:
>
> Can you confirm that these statements are accurate?
>
>  * Chimera Linux includes mandoc in its base system since 2021 Oct 30.

That's correct.

>  * There were no official nor unofficial packages of mandoc
>    for Chimera Linux before that.

Yes that's right.

>  * It uses the mandoc(1) parsing and formatting engine by default
>    since the same date.

Yes (pretty much). It looks like it was added a day later
https://github.com/chimera-linux/cports/commit/3c493733f7f0df345816cc69e273fca32fe70830

>  * The default apropos(1) program is mandoc since the same date.
>  * The default man(1) program is mandoc since the same date.
>  * Chimera Linux does not provide an officially supported
>    alternative for the man(1) program that users can select.
>  * Chimera Linux does not publish its manual pages on the web
>    on a site similar to https://manpages.debian.org/
>    or https://man.openbsd.org/ or https://man.voidlinux.org/ .

Yes all of above are correct.

>> makewhatis -Tutf8 is run to rebuild the index
>> whenever a package containing man pages is installed or updated.

>> I noticed this took multiple seconds even on a Ryzen 9 7950X system.
>
> That may not be a major problem because installing an additional
> package is not usually a fast operation (it usually requires both
> network access and expensive database locking and management outside
> the domain of manual pages) and it isn't done often either.
>
> So i expect most users won't feel offended if installing new
> software takes a few seconds.

Perhaps that is the case but it bothered me enough to go to the effort of writing these patches...

[snip]

> So only 10-15% of the time is spent in roff_getstrn(), which cannot
> possibly result in the 25-35% overall speedup you report, so something
> looks fishy here.  Then again, even 25-35% is not a large gain.

Those are the numbers I saw from my basic benchmarking measuring the wall clock time of makewhatis -Tutf -n across three runs and averaging the results for each machine. The raw data from my notes is available here:

https://gist.github.com/wezm/4043bde0dc2974c88b7706c60b58f900

[snip]

> One other thing.  I hate patch series, don't ever send them.
> They are usually a symptom of poor development methodology,
> and usually i reject them without even inspecting them.
> Here, i made an exception because your idea seems to have some merit.
>
> Every patch needs to
>  * perform one well-defined task and
>  * make the code better even if nothing else is ever committed
>    building on it.
>
> Splitting a patch like you did it here only adds obfuscation
> If a patch becomes too big for review, then the *task* it performs
> needs to be split into logically separate steps, rather than
> splitting code changes into mutiple patch files that essentially
> all save the same purpose and are similar and closely related in
> their content.  Such *logical* splitting can become tricky, but
> i doubt that is needed here.

Sorry about that. While I've been programming for a couple of decades the email based patch workflow is very new to me and different to all my experience to date. If I post any more changes I'll be sure to follow your suggestions.

> I'm not yet completely sure what the best way forward is.
> Probably something like:
>  1. Identifying all places that do string lookup in tables or lists.
>     Some can probably be left alone:
>       arch.c, att.c, cgi.c, chars.c, eqn.c, mansearch.c,
>       mdoc_argnames in mdoc.c, msec.c, st.c, regtab in roff.c,
>       tbl_layout.c, tbl_opts.c
>     But the others should probably be dealt with in a unified manner:
>       reqtab, mdocmac, manmac  on the one hand
>       strtab, rentab, pretab, xmbtab  on the other hand
>       Not sure yet whether xtab can be included into xmbtab.
>  2. Design and implement a common handling for these using ohash
>     that is as simple as possible.

When I have a moment I'll take a look through these and try to come up with an approach/plan and post it for feedback before making code changes.

> Not sure yet whether these are multiple steps or a single step,
> but doing one separate step for every *tab is definitely not the
> way.
>
> I guess that's enough for today...
>
> Yours,
>   Ingo

Wes
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: Improve performance of makewhatis

2023-11-15T01:26:10Z

Hi Wesley,

Wesley Moore wrote on Mon, Nov 13, 2023 at 11:42:42AM +1000:

> In this patch set I aimed to improve the performance of makewhatis.
> On my Chimera Linux systems

Interesting.  I just added Chimera Linux to these two pages:

  https://mandoc.bsd.lv/ports.html
  https://mandoc.bsd.lv/porthistory.html

Can you confirm that these statements are accurate?

 * Chimera Linux includes mandoc in its base system since 2021 Oct 30.
 * There were no official nor unofficial packages of mandoc
   for Chimera Linux before that.
 * It uses the mandoc(1) parsing and formatting engine by default
   since the same date.
 * The default apropos(1) program is mandoc since the same date.
 * The default man(1) program is mandoc since the same date.
 * Chimera Linux does not provide an officially supported
   alternative for the man(1) program that users can select.
 * Chimera Linux does not publish its manual pages on the web
   on a site similar to https://manpages.debian.org/
   or https://man.openbsd.org/ or https://man.voidlinux.org/ .

> makewhatis -Tutf8 is run to rebuild the index
> whenever a package containing man pages is installed or updated.

That's not particularly smart.  Rebuilding the whole database from
scratch is unavoidably slow.  The makewhatis(8) options -d and -u
are specifically provided to serve the needs of package management
systems.  Specifically, if a package management systems cares about
performance, it can use -d to add just those manual page files it
just installed to the existing database, rather than rebuilding the
whole database from scratch, which obviously implies reading every
manual page file in the whole system from disk and parsing all those
files.

> I noticed this took multiple seconds even on a Ryzen 9 7950X system.

That may not be a major problem because installing an additional
package is not usually a fast operation (it usually requires both
network access and expensive database locking and management outside
the domain of manual pages) and it isn't done often either.

So i expect most users won't feel offended if installing new
software takes a few seconds.

> Profiling showed a lot of time spent in the loops of roff_getstrn.

Hmm, even though i'm not convinced your usecase is particularly
important, that's interesting, and i wouldn't have expected it.
It may be worth looking into that.

> To speed this up I made use of hash tables. I tested the changes on
> several different systems:
> 
> - OpenBSD amd64 in VM
> - Linux on Raspberry Pi 4
> - Linux on Ryzen 9 7950X
> 
> In each case the wall clock run time for makewhatis -Tutf8 -n showed
> an reduction of between 25% and 35%.

For makewhatis(8), i would say that's such a small gain that it likely
isn't worth much complication.  Then again, it also affects mandoc(1),
so maybe looking into it is worthwhile after all.

Here is what i found with gprof(1) on OpenBSD-current:

share local
 100%  100% main -> mandocdb
  92%   95% (mandocdb) -> mpages_merge
  66%   67% (mpages_merge) -> mparse_readfd -> mparse_buf_r
  32%   26% (mparse_buf_r) -> roff_parseln
  19%   21% MANY -> free -> ofree
16.3%  2.5% (mparse_buf_r) -> mdoc_parseln
 6.8% 18.4% (mparse_buf_r) -> man_parseln
15.6% 14.7% MANY -> roff_word_alloc
14.9% 17.0% (_libc_malloc etc) -> omalloc
14.7% 10.9% (mpages_merge) -> mparse_reset
14.6% 10.6% (roff_parse, roff_expand) -> roff_getstrn

The "share" column refers to /usr/share/man/, i.e. the OpenBSD base
system manuals, a good example of a large, mdoc(7)-heavy collection
of manual pages.  The "local" column refers to /usr/local/man/,
i.e. the optional third-party software i happen to have installed,
which contains mostly man(7) manual pages.

On each line, the two numbers give the time spent in the respective
function including functions called from it, relative to the time
spent in main() including all functions called from there.
The last name on each line is the respective function.
The names before the arrows specify from where it is usually called.

So only 10-15% of the time is spent in roff_getstrn(), which cannot
possibly result in the 25-35% overall speedup you report, so something
looks fishy here.  Then again, even 25-35% is not a large gain.

In general, improving the structure is much more important than
improving the performance because the code has grown historically,
becoming more and more complicated and less and less uniform in
structure, i.e. harder and harder to maintain.  In particular, such
a small gain is clearly not worth further complication.

On the other hand, the critical code path for the optimization
potential you found is:  roff_parse -> roff_getstrn
In that area, code structure is not particularly good.
Parsing for macros and strings is done in various different ways:

 1. The code path roff_parse -> roff_getstrn looks for
    user-defined and renamed macros using r->strtab and r->rentab.
 2. The code path roff_expand -> roff_getstrn looks for
    predefined strings using a linear search in predefs[].
 3. The code paths roff_block(ROFF_am) -> roff_getstrn and
    roff_rn -> roff_getstrn look for standard high-level macros
    directly in the roff_name[] array.
 4. The functions mdoc_pmacro() in mdoc.c and man_pmacro() in man.c
    look for standard high-level macros using local mdoc->mdocmac
    and man->manmac hash tables with roffhash_find().

So it is likely this can be better structured with less duplication
and not gratuitiously using different data structures, while also
using hash tables for macro and string lookup throughout.

> The one caveat of this change is the roff/while/into regression test is
> failing with my changes. I've tried to work out why but I don't know
> roff well enough understand the intent of the test nor have I been able
> to clearly determine why my changes affect it. I was hoping to get some
> help with this.

The roff/while/into test uses .de in a very strange way, and .de
internally uses the string storage, so that's how it may be related,
but i would have to look at the details.  Then again, studying that
is not urgent because your patch is definitely not ready for commit.
It adds complication rather than removing some.  In particular:

 * Do not touch roff_int.h.  It is only intended for internal API
   features needed by more than one parser, see mandoc_headers(3)
   for details, but what you are doing here is local to the roff
   parser (roff.c).
 * struct roff_entry is essentially the same as struct roffkv.
   There should not be two data structures for the same purpose.
 * roff_free_entry() is trivial and only used at one place,
   so it should be inlined.
 * Renaming roff_setstrn to roff_setentry causes quite some churn,
   so don't do that.  If that means using ohash for xmbtab, too,
   so be it.
 * roff_strhash_alloc is also next to trivial.  Maybe it can be
   inlined - maybe not because it is called from a few more
   places.  If we introduce it, roffhash_alloc should also
   use it.
 * The duplication between roffhash_free() and roff_strhash_free()
   and between roffhash_find() and roff_strhash_find() is rather
   ugly.  Not yet sure what to do about that.
 * roff_strhash_insert() is trivial and called only from one place.

Some ideas are also good:

 * Using struct ohash *strtab directly in struct roff is good,
   just like it is already done for struct ohash *reqtab.
 * Introducing r->pretab and clearing it in roff_free()
   rather than in the badly named roff_free1() - just like
   r->reqtab - is good.

One other thing.  I hate patch series, don't ever send them.
They are usually a symptom of poor development methodology,
and usually i reject them without even inspecting them.
Here, i made an exception because your idea seems to have some merit.

Every patch needs to
 * perform one well-defined task and
 * make the code better even if nothing else is ever committed
   building on it.

Splitting a patch like you did it here only adds obfuscation
If a patch becomes too big for review, then the *task* it performs
needs to be split into logically separate steps, rather than
splitting code changes into mutiple patch files that essentially
all save the same purpose and are similar and closely related in
their content.  Such *logical* splitting can become tricky, but
i doubt that is needed here.

I'm not yet completely sure what the best way forward is.
Probably something like:
 1. Identifying all places that do string lookup in tables or lists.
    Some can probably be left alone:
      arch.c, att.c, cgi.c, chars.c, eqn.c, mansearch.c,
      mdoc_argnames in mdoc.c, msec.c, st.c, regtab in roff.c,
      tbl_layout.c, tbl_opts.c
    But the others should probably be dealt with in a unified manner:
      reqtab, mdocmac, manmac  on the one hand
      strtab, rentab, pretab, xmbtab  on the other hand
      Not sure yet whether xtab can be included into xmbtab.
 2. Design and implement a common handling for these using ohash
    that is as simple as possible.

Not sure yet whether these are multiple steps or a single step,
but doing one separate step for every *tab is definitely not the
way.

I guess that's enough for today...

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

[PATCH 3/3] Use ohash for predefined strings

2023-11-13T01:44:05Z

---
 roff.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/roff.c b/roff.c
index 74afaab..ff85a63 100644
--- a/roff.c
+++ b/roff.c
@@ -113,6 +113,7 @@ struct	roff {
 	struct mctx	*mstack; /* stack of macro contexts */
 	int		*rstack; /* stack of inverted `ie' values */
 	struct ohash	*reqtab; /* request lookup table */
+	struct ohash	*pretab; /* predefined strings table */
 	struct roffreg	*regtab; /* number registers */
 	struct ohash	*strtab; /* user-defined strings & macros */
 	struct ohash	*rentab; /* renamed strings & macros */
@@ -241,6 +242,7 @@ static	int		 roff_parse_comment(struct roff *, struct buf *,
 				int, int, char);
 static	int		 roff_parsetext(struct roff *, struct buf *,
 				int, int *);
+static	struct ohash	*roff_pretab_alloc(void);
 static	int		 roff_renamed(ROFF_ARGS);
 static	int		 roff_req_or_macro(ROFF_ARGS);
 static	int		 roff_return(ROFF_ARGS);
@@ -700,6 +702,21 @@ roffhash_find(struct ohash *htab, const char *name, size_t sz)
 
 /* --- key-value table ---------------------------------------------------- */
 
+struct ohash *
+roff_pretab_alloc(void)
+{
+	int i;
+	struct ohash *htab;
+	struct predef pre;
+
+	htab = roff_strhash_alloc();
+	for (i = 0; i < PREDEFS_MAX; i++) {
+		pre = predefs[i];
+		roff_setentry(htab, pre.name, strlen(pre.name), pre.str, strlen(pre.str), 0);
+	}
+	return htab;
+}
+
 struct ohash *
 roff_strhash_alloc(void)
 {
@@ -852,6 +869,7 @@ roff_free(struct roff *r)
 	int		 i;
 
 	roff_free1(r);
+	roff_strhash_free(r->pretab);
 	for (i = 0; i < r->mstacksz; i++)
 		free(r->mstack[i].argv);
 	free(r->mstack);
@@ -866,6 +884,7 @@ roff_alloc(int options)
 
 	r = mandoc_calloc(1, sizeof(struct roff));
 	r->reqtab = roffhash_alloc(0, ROFF_RENAMED);
+	r->pretab = roff_pretab_alloc();
 	r->strtab = roff_strhash_alloc();
 	r->rentab = roff_strhash_alloc();
 	r->options = options | MPARSE_COMMENT;
@@ -4318,8 +4337,8 @@ static const char *
 roff_getstrn(struct roff *r, const char *name, size_t len,
     int *deftype)
 {
-	int			 found, i;
 	const struct roff_entry	*entry;
+	int			 found;
 	enum roff_tok		 tok;
 
 	found = 0;
@@ -4341,16 +4360,13 @@ roff_getstrn(struct roff *r, const char *name, size_t len,
 			found = 1;
 		}
 	}
-	for (i = 0; i < PREDEFS_MAX; i++) {
-		if (strncmp(name, predefs[i].name, len) != 0 ||
-		    predefs[i].name[len] != '\0')
-			continue;
+	entry = roff_strhash_find(r->pretab, name, len);
+	if (entry != NULL) {
 		if (*deftype & ROFFDEF_PRE) {
 			*deftype = ROFFDEF_PRE;
-			return predefs[i].str;
+			return entry->val.p;
 		} else {
 			found = 1;
-			break;
 		}
 	}
 	if (r->man->meta.macroset != MACROSET_MAN) {
-- 
2.42.1

--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

[PATCH 2/3] Use ohash for rentab

2023-11-13T01:44:02Z

---
 roff.c | 50 ++++++++++++++++++++++++--------------------------
 1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/roff.c b/roff.c
index 4e5bbf9..74afaab 100644
--- a/roff.c
+++ b/roff.c
@@ -114,8 +114,8 @@ struct	roff {
 	int		*rstack; /* stack of inverted `ie' values */
 	struct ohash	*reqtab; /* request lookup table */
 	struct roffreg	*regtab; /* number registers */
-	struct roffkv	*rentab; /* renamed strings & macros */
 	struct ohash	*strtab; /* user-defined strings & macros */
+	struct ohash	*rentab; /* renamed strings & macros */
 	struct roffkv	*xmbtab; /* multi-byte trans table (`tr') */
 	struct roffstr	*xtab; /* single-byte trans table (`tr') */
 	const char	*current_string; /* value of last called user macro */
@@ -818,10 +818,10 @@ roff_free1(struct roff *r)
 	r->regtab = NULL;
 
 	roff_strhash_free(r->strtab);
-	roff_freestr(r->rentab);
+	roff_strhash_free(r->rentab);
 	roff_freestr(r->xmbtab);
-	r->strtab = NULL;
-	r->rentab = r->xmbtab = NULL;
+	r->strtab = r->rentab = NULL;
+	r->xmbtab = NULL;
 
 	if (r->xtab)
 		for (i = 0; i < 128; i++)
@@ -835,6 +835,7 @@ roff_reset(struct roff *r)
 {
 	roff_free1(r);
 	r->strtab = roff_strhash_alloc();
+	r->rentab = roff_strhash_alloc();
 	r->options |= MPARSE_COMMENT;
 	r->format = r->options & (MPARSE_MDOC | MPARSE_MAN);
 	r->control = '\0';
@@ -866,6 +867,7 @@ roff_alloc(int options)
 	r = mandoc_calloc(1, sizeof(struct roff));
 	r->reqtab = roffhash_alloc(0, ROFF_RENAMED);
 	r->strtab = roff_strhash_alloc();
+	r->rentab = roff_strhash_alloc();
 	r->options = options | MPARSE_COMMENT;
 	r->format = options & (MPARSE_MDOC | MPARSE_MAN);
 	r->mstackpos = -1;
@@ -2084,8 +2086,8 @@ roff_parse(struct roff *r, char *buf, int *pos, int ln, int ppos)
 		*pos = cp - buf;
 	else if (deftype == ROFFDEF_UNDEF) {
 		/* Using an undefined macro defines it to be empty. */
-		roff_setstrn(&r->rentab, mac, maclen, NULL, 0, 0);
 		roff_setentry(r->strtab, mac, maclen, "", 0, 0);
+		roff_setentry(r->rentab, mac, maclen, NULL, 0, 0);
 	}
 	return t;
 }
@@ -2252,8 +2254,8 @@ roff_block(ROFF_ARGS)
 	 */
 
 	if (tok == ROFF_de || tok == ROFF_dei) {
-		roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
 		roff_setentry(r->strtab, name, namesz, "", 0, 0);
+		roff_setentry(r->rentab, name, namesz, NULL, 0, 0);
 	} else if (tok == ROFF_am || tok == ROFF_ami) {
 		deftype = ROFFDEF_ANY;
 		value = roff_getstrn(r, iname, namesz, &deftype);
@@ -2265,13 +2267,13 @@ roff_block(ROFF_ARGS)
 		case ROFFDEF_REN: /* call original standard macro. */
 			csz = mandoc_asprintf(&call, ".%.*s \\$* \\\"\n",
 			    (int)strlen(value), value);
-			roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
 			roff_setentry(r->strtab, name, namesz, call, csz, 0);
+			roff_setentry(r->rentab, name, namesz, NULL, 0, 0);
 			free(call);
 			break;
 		case ROFFDEF_STD:  /* rename and call standard macro. */
 			rsz = mandoc_asprintf(&rname, "__%s_renamed", name);
-			roff_setstrn(&r->rentab, rname, rsz, name, namesz, 0);
+			roff_setentry(r->rentab, rname, rsz, name, namesz, 0);
 			csz = mandoc_asprintf(&call, ".%.*s \\$* \\\"\n",
 			    (int)rsz, rname);
 			roff_setentry(r->strtab, name, namesz, call, csz, 0);
@@ -2852,7 +2854,7 @@ roff_ds(ROFF_ARGS)
 	/* The rest is the value. */
 	roff_setentry(r->strtab, name, namesz, string, strlen(string),
 	    ROFF_as == tok);
-	roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
+	roff_setentry(r->rentab, name, namesz, NULL, 0, 0);
 	return ROFF_IGN;
 }
 
@@ -3281,8 +3283,8 @@ roff_rm(ROFF_ARGS)
 	while (*cp != '\0') {
 		name = cp;
 		namesz = roff_getname(r, &cp, ln, (int)(cp - buf->buf));
-		roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
 		roff_setentry(r->strtab, name, namesz, NULL, 0, 0);
+		roff_setentry(r->rentab, name, namesz, NULL, 0, 0);
 		if (name[namesz] == '\\' || name[namesz] == '\t')
 			break;
 	}
@@ -3628,8 +3630,8 @@ roff_als(ROFF_ARGS)
 
 	valsz = mandoc_asprintf(&value, ".%.*s \\$@\\\"\n",
 	    (int)oldsz, oldn);
-	roff_setstrn(&r->rentab, newn, newsz, NULL, 0, 0);
 	roff_setentry(r->strtab, newn, newsz, value, valsz, 0);
+	roff_setentry(r->rentab, newn, newsz, NULL, 0, 0);
 	free(value);
 	return ROFF_IGN;
 }
@@ -3912,26 +3914,26 @@ roff_rn(ROFF_ARGS)
 	value = roff_getstrn(r, oldn, oldsz, &deftype);
 	switch (deftype) {
 	case ROFFDEF_USER:
-		roff_setstrn(&r->rentab, newn, newsz, NULL, 0, 0);
 		roff_setentry(r->strtab, newn, newsz, value, strlen(value), 0);
 		roff_setentry(r->strtab, oldn, oldsz, NULL, 0, 0);
+		roff_setentry(r->rentab, newn, newsz, NULL, 0, 0);
 		break;
 	case ROFFDEF_PRE:
-		roff_setstrn(&r->rentab, newn, newsz, NULL, 0, 0);
 		roff_setentry(r->strtab, newn, newsz, value, strlen(value), 0);
+		roff_setentry(r->rentab, newn, newsz, NULL, 0, 0);
 		break;
 	case ROFFDEF_REN:
-		roff_setstrn(&r->rentab, newn, newsz, value, strlen(value), 0);
-		roff_setstrn(&r->rentab, oldn, oldsz, NULL, 0, 0);
+		roff_setentry(r->rentab, newn, newsz, value, strlen(value), 0);
+		roff_setentry(r->rentab, oldn, oldsz, NULL, 0, 0);
 		roff_setentry(r->strtab, newn, newsz, NULL, 0, 0);
 		break;
 	case ROFFDEF_STD:
-		roff_setstrn(&r->rentab, newn, newsz, oldn, oldsz, 0);
+		roff_setentry(r->rentab, newn, newsz, oldn, oldsz, 0);
 		roff_setentry(r->strtab, newn, newsz, NULL, 0, 0);
 		break;
 	default:
-		roff_setstrn(&r->rentab, newn, newsz, NULL, 0, 0);
 		roff_setentry(r->strtab, newn, newsz, NULL, 0, 0);
+		roff_setentry(r->rentab, newn, newsz, NULL, 0, 0);
 		break;
 	}
 	return ROFF_IGN;
@@ -4167,7 +4169,7 @@ roff_setstr(struct roff *r, const char *name, const char *string,
 	namesz = strlen(name);
 	roff_setentry(r->strtab, name, namesz, string,
 	    string ? strlen(string) : 0, append);
-	roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
+	roff_setentry(r->rentab, name, namesz, NULL, 0, 0);
 }
 
 static void
@@ -4316,7 +4318,6 @@ static const char *
 roff_getstrn(struct roff *r, const char *name, size_t len,
     int *deftype)
 {
-	const struct roffkv	*n;
 	int			 found, i;
 	const struct roff_entry	*entry;
 	enum roff_tok		 tok;
@@ -4331,16 +4332,13 @@ roff_getstrn(struct roff *r, const char *name, size_t len,
 			found = 1;
 		}
 	}
-	for (n = r->rentab; n != NULL; n = n->next) {
-		if (strncmp(name, n->key.p, len) != 0 ||
-		    n->key.p[len] != '\0' || n->val.p == NULL)
-			continue;
+	entry = roff_strhash_find(r->rentab, name, len);
+	if (entry != NULL && entry->val.p != NULL) {
 		if (*deftype & ROFFDEF_REN) {
 			*deftype = ROFFDEF_REN;
-			return n->val.p;
+			return entry->val.p;
 		} else {
 			found = 1;
-			break;
 		}
 	}
 	for (i = 0; i < PREDEFS_MAX; i++) {
@@ -4396,8 +4394,8 @@ roff_getstrn(struct roff *r, const char *name, size_t len,
 
 		/* Using an undefined string defines it to be empty. */
 
-		roff_setstrn(&r->rentab, name, len, NULL, 0, 0);
 		roff_setentry(r->strtab, name, len, "", 0, 0);
+		roff_setentry(r->rentab, name, len, NULL, 0, 0);
 	}
 
 	*deftype = 0;
-- 
2.42.1

--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

[PATCH 1/3] Use ohash for strtab

2023-11-13T01:44:01Z

---
 roff.c     | 190 +++++++++++++++++++++++++++++++++++++++++++++--------
 roff_int.h |   6 ++
 2 files changed, 170 insertions(+), 26 deletions(-)

diff --git a/roff.c b/roff.c
index bdb0210..4e5bbf9 100644
--- a/roff.c
+++ b/roff.c
@@ -71,6 +71,14 @@ struct	roffkv {
 	struct roffkv	*next; /* next in list */
 };
 
+/*
+ * A key-value string pair used as an entry in an ohash.
+ */
+struct	roff_entry {
+	struct roffstr	 val;
+	char		 key[];
+};
+
 /*
  * A single number register as part of a singly-linked list.
  */
@@ -106,8 +114,8 @@ struct	roff {
 	int		*rstack; /* stack of inverted `ie' values */
 	struct ohash	*reqtab; /* request lookup table */
 	struct roffreg	*regtab; /* number registers */
-	struct roffkv	*strtab; /* user-defined strings & macros */
 	struct roffkv	*rentab; /* renamed strings & macros */
+	struct ohash	*strtab; /* user-defined strings & macros */
 	struct roffkv	*xmbtab; /* multi-byte trans table (`tr') */
 	struct roffstr	*xtab; /* single-byte trans table (`tr') */
 	const char	*current_string; /* value of last called user macro */
@@ -204,6 +212,7 @@ static	void		 roff_expand_patch(struct buf *, int,
 static	void		 roff_free1(struct roff *);
 static	void		 roff_freereg(struct roffreg *);
 static	void		 roff_freestr(struct roffkv *);
+static	void		 roff_free_entry(struct roff_entry *entry);
 static	size_t		 roff_getname(struct roff *, char **, int, int);
 static	int		 roff_getnum(const char *, int *, int *, int);
 static	int		 roff_getop(const char *, int *, char *);
@@ -244,6 +253,8 @@ static	void		 roff_setstr(struct roff *,
 				const char *, const char *, int);
 static	void		 roff_setstrn(struct roffkv **, const char *,
 				size_t, const char *, size_t, int);
+static	void		 roff_setentry(struct ohash *r, const char *name,
+				size_t namesz, const char *string, size_t stringsz, int append);
 static	int		 roff_shift(ROFF_ARGS);
 static	int		 roff_so(ROFF_ARGS);
 static	int		 roff_tr(ROFF_ARGS);
@@ -687,6 +698,55 @@ roffhash_find(struct ohash *htab, const char *name, size_t sz)
 	return req == NULL ? TOKEN_NONE : req->tok;
 }
 
+/* --- key-value table ---------------------------------------------------- */
+
+struct ohash *
+roff_strhash_alloc(void)
+{
+	struct ohash	*htab;
+
+	htab = mandoc_malloc(sizeof(*htab));
+	mandoc_ohash_init(htab, 6, offsetof(struct roff_entry, key));
+	return htab;
+}
+
+void
+roff_strhash_free(struct ohash *htab)
+{
+	struct roff_entry	*entry;
+	unsigned int	 slot;
+
+	if (htab == NULL)
+		return;
+	for (entry = ohash_first(htab, &slot); entry != NULL;
+	     entry = ohash_next(htab, &slot))
+		roff_free_entry(entry);
+	ohash_delete(htab);
+	free(htab);
+}
+
+struct roff_entry	*
+roff_strhash_find(struct ohash *htab, const char *name, size_t sz)
+{
+	struct roff_entry	*entry;
+	const char	*end;
+
+	if (sz) {
+		end = name + sz;
+		entry = ohash_find(htab, ohash_qlookupi(htab, name, &end));
+	} else
+		entry = ohash_find(htab, ohash_qlookup(htab, name));
+	return entry;
+}
+
+void
+roff_strhash_insert(struct ohash *htab, struct roff_entry *entry) {
+	unsigned int slot;
+
+	slot = ohash_qlookup(htab, entry->key);
+	ohash_insert(htab, slot, entry);
+}
+
 /* --- stack of request blocks -------------------------------------------- */
 
 /*
@@ -757,10 +817,11 @@ roff_free1(struct roff *r)
 	roff_freereg(r->regtab);
 	r->regtab = NULL;
 
-	roff_freestr(r->strtab);
+	roff_strhash_free(r->strtab);
 	roff_freestr(r->rentab);
 	roff_freestr(r->xmbtab);
-	r->strtab = r->rentab = r->xmbtab = NULL;
+	r->strtab = NULL;
+	r->rentab = r->xmbtab = NULL;
 
 	if (r->xtab)
 		for (i = 0; i < 128; i++)
@@ -773,6 +834,7 @@ void
 roff_reset(struct roff *r)
 {
 	roff_free1(r);
+	r->strtab = roff_strhash_alloc();
 	r->options |= MPARSE_COMMENT;
 	r->format = r->options & (MPARSE_MDOC | MPARSE_MAN);
 	r->control = '\0';
@@ -803,6 +865,7 @@ roff_alloc(int options)
 
 	r = mandoc_calloc(1, sizeof(struct roff));
 	r->reqtab = roffhash_alloc(0, ROFF_RENAMED);
+	r->strtab = roff_strhash_alloc();
 	r->options = options | MPARSE_COMMENT;
 	r->format = options & (MPARSE_MDOC | MPARSE_MAN);
 	r->mstackpos = -1;
@@ -1458,7 +1521,7 @@ roff_expand(struct roff *r, struct buf *buf, int ln, int pos, char ec)
 			if (iendarg - iarg == 2 &&
 			    buf->buf[iarg] == '.' &&
 			    buf->buf[iarg + 1] == 'T') {
-				roff_setstrn(&r->strtab, ".T", 2, NULL, 0, 0);
+				roff_setentry(r->strtab, ".T", 2, NULL, 0, 0);
 				pos = iend;
 				continue;
 			}
@@ -2021,8 +2084,8 @@ roff_parse(struct roff *r, char *buf, int *pos, int ln, int ppos)
 		*pos = cp - buf;
 	else if (deftype == ROFFDEF_UNDEF) {
 		/* Using an undefined macro defines it to be empty. */
-		roff_setstrn(&r->strtab, mac, maclen, "", 0, 0);
 		roff_setstrn(&r->rentab, mac, maclen, NULL, 0, 0);
+		roff_setentry(r->strtab, mac, maclen, "", 0, 0);
 	}
 	return t;
 }
@@ -2189,21 +2252,21 @@ roff_block(ROFF_ARGS)
 	 */
 
 	if (tok == ROFF_de || tok == ROFF_dei) {
-		roff_setstrn(&r->strtab, name, namesz, "", 0, 0);
 		roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
+		roff_setentry(r->strtab, name, namesz, "", 0, 0);
 	} else if (tok == ROFF_am || tok == ROFF_ami) {
 		deftype = ROFFDEF_ANY;
 		value = roff_getstrn(r, iname, namesz, &deftype);
 		switch (deftype) {  /* Before appending, ... */
 		case ROFFDEF_PRE: /* copy predefined to user-defined. */
-			roff_setstrn(&r->strtab, name, namesz,
+			roff_setentry(r->strtab, name, namesz,
 			    value, strlen(value), 0);
 			break;
 		case ROFFDEF_REN: /* call original standard macro. */
 			csz = mandoc_asprintf(&call, ".%.*s \\$* \\\"\n",
 			    (int)strlen(value), value);
-			roff_setstrn(&r->strtab, name, namesz, call, csz, 0);
 			roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
+			roff_setentry(r->strtab, name, namesz, call, csz, 0);
 			free(call);
 			break;
 		case ROFFDEF_STD:  /* rename and call standard macro. */
@@ -2211,7 +2274,7 @@ roff_block(ROFF_ARGS)
 			roff_setstrn(&r->rentab, rname, rsz, name, namesz, 0);
 			csz = mandoc_asprintf(&call, ".%.*s \\$* \\\"\n",
 			    (int)rsz, rname);
-			roff_setstrn(&r->strtab, name, namesz, call, csz, 0);
+			roff_setentry(r->strtab, name, namesz, call, csz, 0);
 			free(call);
 			free(rname);
 			break;
@@ -2787,7 +2850,7 @@ roff_ds(ROFF_ARGS)
 		string++;
 
 	/* The rest is the value. */
-	roff_setstrn(&r->strtab, name, namesz, string, strlen(string),
+	roff_setentry(r->strtab, name, namesz, string, strlen(string),
 	    ROFF_as == tok);
 	roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
 	return ROFF_IGN;
@@ -3218,8 +3281,8 @@ roff_rm(ROFF_ARGS)
 	while (*cp != '\0') {
 		name = cp;
 		namesz = roff_getname(r, &cp, ln, (int)(cp - buf->buf));
-		roff_setstrn(&r->strtab, name, namesz, NULL, 0, 0);
 		roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
+		roff_setentry(r->strtab, name, namesz, NULL, 0, 0);
 		if (name[namesz] == '\\' || name[namesz] == '\t')
 			break;
 	}
@@ -3565,8 +3628,8 @@ roff_als(ROFF_ARGS)
 
 	valsz = mandoc_asprintf(&value, ".%.*s \\$@\\\"\n",
 	    (int)oldsz, oldn);
-	roff_setstrn(&r->strtab, newn, newsz, value, valsz, 0);
 	roff_setstrn(&r->rentab, newn, newsz, NULL, 0, 0);
+	roff_setentry(r->strtab, newn, newsz, value, valsz, 0);
 	free(value);
 	return ROFF_IGN;
 }
@@ -3849,26 +3912,26 @@ roff_rn(ROFF_ARGS)
 	value = roff_getstrn(r, oldn, oldsz, &deftype);
 	switch (deftype) {
 	case ROFFDEF_USER:
-		roff_setstrn(&r->strtab, newn, newsz, value, strlen(value), 0);
-		roff_setstrn(&r->strtab, oldn, oldsz, NULL, 0, 0);
 		roff_setstrn(&r->rentab, newn, newsz, NULL, 0, 0);
+		roff_setentry(r->strtab, newn, newsz, value, strlen(value), 0);
+		roff_setentry(r->strtab, oldn, oldsz, NULL, 0, 0);
 		break;
 	case ROFFDEF_PRE:
-		roff_setstrn(&r->strtab, newn, newsz, value, strlen(value), 0);
 		roff_setstrn(&r->rentab, newn, newsz, NULL, 0, 0);
+		roff_setentry(r->strtab, newn, newsz, value, strlen(value), 0);
 		break;
 	case ROFFDEF_REN:
 		roff_setstrn(&r->rentab, newn, newsz, value, strlen(value), 0);
 		roff_setstrn(&r->rentab, oldn, oldsz, NULL, 0, 0);
-		roff_setstrn(&r->strtab, newn, newsz, NULL, 0, 0);
+		roff_setentry(r->strtab, newn, newsz, NULL, 0, 0);
 		break;
 	case ROFFDEF_STD:
 		roff_setstrn(&r->rentab, newn, newsz, oldn, oldsz, 0);
-		roff_setstrn(&r->strtab, newn, newsz, NULL, 0, 0);
+		roff_setentry(r->strtab, newn, newsz, NULL, 0, 0);
 		break;
 	default:
-		roff_setstrn(&r->strtab, newn, newsz, NULL, 0, 0);
 		roff_setstrn(&r->rentab, newn, newsz, NULL, 0, 0);
+		roff_setentry(r->strtab, newn, newsz, NULL, 0, 0);
 		break;
 	}
 	return ROFF_IGN;
@@ -4102,7 +4165,7 @@ roff_setstr(struct roff *r, const char *name, const char *string,
 	size_t	 namesz;
 
 	namesz = strlen(name);
-	roff_setstrn(&r->strtab, name, namesz, string,
+	roff_setentry(r->strtab, name, namesz, string,
 	    string ? strlen(string) : 0, append);
 	roff_setstrn(&r->rentab, name, namesz, NULL, 0, 0);
 }
@@ -4179,25 +4242,93 @@ roff_setstrn(struct roffkv **r, const char *name, size_t namesz,
 	n->val.sz = (int)(c - n->val.p);
 }
 
+static void
+roff_setentry(struct ohash *r, const char *name, size_t namesz,
+		const char *string, size_t stringsz, int append)
+{
+	struct roff_entry	*entry;
+	char			*c;
+	int			 i;
+	size_t			 oldch, newch;
+
+	/* Search for an existing string with the same name. */
+	entry = roff_strhash_find(r, name, namesz);
+	if (NULL == entry) {
+		/* Create a new string table entry. */
+		entry = mandoc_malloc(sizeof(*entry) + namesz + 1);
+		memcpy(entry->key, name, namesz);
+		entry->key[namesz] = '\0';
+
+		entry->val.p = NULL;
+		entry->val.sz = 0;
+		roff_strhash_insert(r, entry);
+	} else if (0 == append) {
+		/* string is already in hash table, free it in
+		 * preparation for updating.
+		 */
+		free(entry->val.p);
+		entry->val.p = NULL;
+		entry->val.sz = 0;
+	}
+
+	if (NULL == string)
+		return;
+
+	/*
+	 * One additional byte for the '\n' in multiline mode,
+	 * and one for the terminating '\0'.
+	 */
+	newch = stringsz + (1 < append ? 2u : 1u);
+
+	if (NULL == entry->val.p) {
+		entry->val.p = mandoc_malloc(newch);
+		*entry->val.p = '\0';
+		oldch = 0;
+	} else {
+		oldch = entry->val.sz;
+		entry->val.p = mandoc_realloc(entry->val.p, oldch + newch);
+	}
+
+	/* Skip existing content in the destination buffer. */
+	c = entry->val.p + (int)oldch;
+
+	/* Append new content to the destination buffer. */
+	i = 0;
+	while (i < (int)stringsz) {
+		/*
+		 * Rudimentary roff copy mode:
+		 * Handle escaped backslashes.
+		 */
+		if ('\\' == string[i] && '\\' == string[i + 1])
+			i++;
+		*c++ = string[i++];
+	}
+
+	/* Append terminating bytes. */
+	if (1 < append)
+		*c++ = '\n';
+
+	*c = '\0';
+	entry->val.sz = (int)(c - entry->val.p);
+}
+
 static const char *
 roff_getstrn(struct roff *r, const char *name, size_t len,
     int *deftype)
 {
 	const struct roffkv	*n;
 	int			 found, i;
+	const struct roff_entry	*entry;
 	enum roff_tok		 tok;
 
 	found = 0;
-	for (n = r->strtab; n != NULL; n = n->next) {
-		if (strncmp(name, n->key.p, len) != 0 ||
-		    n->key.p[len] != '\0' || n->val.p == NULL)
-			continue;
+	entry = roff_strhash_find(r->strtab, name, len);
+	if (entry != NULL && entry->val.p != NULL) {
 		if (*deftype & ROFFDEF_USER) {
 			*deftype = ROFFDEF_USER;
-			return n->val.p;
+			return entry->val.p;
 		} else {
 			found = 1;
-			break;
 		}
 	}
 	for (n = r->rentab; n != NULL; n = n->next) {
@@ -4265,8 +4396,8 @@ roff_getstrn(struct roff *r, const char *name, size_t len,
 
 		/* Using an undefined string defines it to be empty. */
 
-		roff_setstrn(&r->strtab, name, len, "", 0, 0);
 		roff_setstrn(&r->rentab, name, len, NULL, 0, 0);
+		roff_setentry(r->strtab, name, len, "", 0, 0);
 	}
 
 	*deftype = 0;
@@ -4286,6 +4417,13 @@ roff_freestr(struct roffkv *r)
 	}
 }
 
+static void
+roff_free_entry(struct roff_entry *entry)
+{
+	free(entry->val.p);
+	free(entry);
+}
+
 /* --- accessors and utility functions ------------------------------------ */
 
 /*
diff --git a/roff_int.h b/roff_int.h
index a26afa9..b18a782 100644
--- a/roff_int.h
+++ b/roff_int.h
@@ -19,6 +19,7 @@
  */
 
 struct	ohash;
+struct	roff_entry;
 struct	roff_node;
 struct	roff_meta;
 struct	roff;
@@ -82,6 +83,11 @@ struct ohash	 *roffhash_alloc(enum roff_tok, enum roff_tok);
 enum roff_tok	  roffhash_find(struct ohash *, const char *, size_t);
 void		  roffhash_free(struct ohash *);
 
+struct ohash	 *roff_strhash_alloc(void);
+struct roff_entry*roff_strhash_find(struct ohash *htab, const char *name, size_t sz);
+void		  roff_strhash_insert(struct ohash *htab, struct roff_entry *entry);
+void		  roff_strhash_free(struct ohash *htab);
+
 enum mandoc_esc	  roff_escape(const char *, const int, const int,
 			int *, int *, int *, int *, int *);
 void		  roff_state_reset(struct roff_man *);
-- 
2.42.1

--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Improve performance of makewhatis

2023-11-13T01:44:00Z

Hello,

In this patch set I aimed to improve the performance of makewhatis. On
my Chimera Linux systems makewhatis -Tutf8 is run to rebuild the index
whenever a package containing man pages is installed or updated. I
noticed this took multiple seconds even on a Ryzen 9 7950X system.

Profiling showed a lot of time spent in the loops of roff_getstrn. To
speed this up I made use of hash tables. I tested the changes on several
different systems:

- OpenBSD amd64 in VM
- Linux on Raspberry Pi 4
- Linux on Ryzen 9 7950X

In each case the wall clock run time for makewhatis -Tutf8 -n showed an
reduction of between 25% and 35%.

The one caveat of this change is the roff/while/into regression test is
failing with my changes. I've tried to work out why but I don't know
roff well enough understand the intent of the test nor have I been able
to clearly determine why my changes affect it. I was hoping to get some
help with this.

Regards,
Wes


--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: mandoc mishandles tzfile(5)'s .IP \(bu "\w'\(bu 'u"

2023-10-24T19:36:15Z

On 10/23/23 14:23, Ingo Schwarze wrote:
> Does that patch work for you as well as your own?

Yes, thanks, it works for tzfile(5).
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: mandoc mishandles tzfile(5)'s .IP \(bu "\w'\(bu 'u"

2023-10-23T21:24:02Z

Hi Paul,

Paul Eggert wrote on Sun, Oct 22, 2023 at 05:41:28PM -0700:
> On 2023-10-22 14:06, Ingo Schwarze wrote:

>> mandoc only supports ASCII strings as arguments to \w, not escape
>> sequences or formatting instructions.

> For the TZDB man pages mandoc need not support all that, just \(bu.

Thank you for identifying a subset of the functionality that is
both useful and feasible.  That's so much better than the TODO
item for \w i have lying around - that one is much less important
than what you suggested but horrifically difficult.

> Just to make sure we're on the same page, I reproduced the problem by 
> running the command "mandoc -man -Tascii t.5", where t.5 contains the 
> following lines:
> 
>          .TH tzfile 5
>          .SH NAME
>          .IP \(bu "\w'\(bu  'u"
>          xxx
>          .PP
>          yyy
> 
> The output should contain two spaces between the bullet's "o" and the 
> "x", but with current mandoc it contains five spaces.

Yes, that matches my understanding.

> Proposed mandoc patch attached. This isn't a perfect emulation of groff, 

In this case, that's a virtue, even though in many other cases,
compatibility with groff is indeed among the main goals.

> nor have I tested with fancy constructs,

Not a major roadblock; such testing becomes significantly easier
with the existing test suite, and by adding to it.

> but it should be good enough for tzfile(5).

And likely for some other pages, too.

Regarding the implementation:

I did not like that you did most of the work in roff_escape.c.
That function is called each and every time any parser or formatter
wants to deal with any escape sequence.  Your patch did additional
work for almost every kind of escape sequence, even though \w is
the only escape sequence needing that kind of work.

On top of that, while changing the internal API mandoc_escape(3)
is not unheard of and can be done when it is really important,
the need for an API break did not seem very urgent to me here.

Consequently, i chose to do all the work at one local place in the
internal function roff_expand().  Even though that requires iterating
the argument of every \w escape sequence twice, it causes less work
regarding the grand total because \w is a rare escape sequence,
occurring much less frequently than other sequences, and even for \w
relevant extra work is only done if the argument contains embedded
escape sequences, which will occur still less frequently.  Encountering
long, complicated, deeply nested esacpe sequences inside \w would
definitely be unusual.  Doing extra work for every parsing and every
formatting of each and every escape sequence sounds clearly worse.

Besides, roff_expand() is really the place where the logic belongs.
The function roff_escape() is a parser.  Its job is to figure out where
the various syntax elements (sequence, name, argument) begin and end
and what the class of the sequence is.  Doing calculations bases on
the content of the argument is out of scope there, and considering that
it's a parser, doing calculations related to formatting even more so.

Calculating the information needed for interpolation really belongs
into the string interpolation function roff_expand(), and given that \w
in particular is related to formatting, the normal internal ESCAPE_*
API that the formatters use for such purposes should be used here,
too.  Potentially, that might also make future refinements of the
functionality easier: the more this code already resenmbles typical
formatter code, the better.

To summarize, i committed the following patch.

Does that patch work for you as well as your own?

Yours,
  Ingo

Log Message:
-----------
Support some escape sequences, in particular character escape sequences,
inside \w arguments, and skip most other escape sequences when measuring
the output length in this way because most escape sequences contribute
little or nothing to text width: for example, consider font escapes in
terminal output.

This implementation is very rudimentary.  In particular, it assumes that
every character has the same width.  No attempt is made to detect 
double-width or zero-width Unicode characters or to take dependencies on
output devices or fonts into account.  These limitations are hard to
avoid because mandoc has to interpolate \w at the parsing stage when the
output device is not yet known.  I really do not want the content of the
syntax tree to depend on the output device.

Feature requested by Paul , who also
submitted a patch, but i chose to commit this very different patch
with almost the same functionality.
His input was still very valuable because complete support for \w is
out of the question, and consequently, the main task is identifying
subsets of the feature that are needed for real-world manual pages 
and can be supported without uprooting the whole forest.

Modified Files:
--------------
    mandoc:
        roff.7
        roff.c
    mandoc/regress/roff/esc:
        w.in
        w.out_ascii
        w.out_lint

Revision Data
-------------
Index: roff.7
===================================================================
RCS file: /home/cvs/mandoc/mandoc/roff.7,v
retrieving revision 1.120
retrieving revision 1.121
diff -Lroff.7 -Lroff.7 -u -p -r1.120 -r1.121
--- roff.7
+++ roff.7
@@ -1,6 +1,6 @@
 .\" $Id$
 .\"
-.\" Copyright (c) 2010-2019, 2022 Ingo Schwarze 
+.\" Copyright (c) 2010-2019, 2022-2023 Ingo Schwarze 
 .\" Copyright (c) 2010, 2011, 2012 Kristaps Dzonsons 
 .\"
 .\" Permission to use, copy, modify, and distribute this software for any
@@ -2224,7 +2224,8 @@ The
 .Xr mandoc 1
 implementation assumes that after expansion of user-defined strings, the
 .Ar string
-only contains normal characters, no escape sequences, and that each
+only contains normal characters, characters expressed as escape sequences,
+and zero-width escape sequences, and that each
 character has a width of 24 basic units.
 .It Ic \eX\(aq Ns Ar string Ns Ic \(aq
 Output
Index: roff.c
===================================================================
RCS file: /home/cvs/mandoc/mandoc/roff.c,v
retrieving revision 1.398
retrieving revision 1.399
diff -Lroff.c -Lroff.c -u -p -r1.398 -r1.399
--- roff.c
+++ roff.c
@@ -1,6 +1,6 @@
 /* $Id$ */
 /*
- * Copyright (c) 2010-2015, 2017-2022 Ingo Schwarze 
+ * Copyright (c) 2010-2015, 2017-2023 Ingo Schwarze 
  * Copyright (c) 2008-2012, 2014 Kristaps Dzonsons 
  *
  * Permission to use, copy, modify, and distribute this software for any
@@ -1362,6 +1362,7 @@ roff_expand(struct roff *r, struct buf *
 	const char	*res;		/* the string to be pasted */
 	const char	*src;		/* source for copying */
 	char		*dst;		/* destination for copying */
+	enum mandoc_esc	 subtype;	/* return value from roff_escape */
 	int		 iesc;		/* index of leading escape char */
 	int		 inam;		/* index of the escape name */
 	int		 iarg;		/* index beginning the argument */
@@ -1551,8 +1552,34 @@ roff_expand(struct roff *r, struct buf *
 			res = ubuf;
 			break;
 		case 'w':
-			(void)snprintf(ubuf, sizeof(ubuf),
-			    "%d", (iendarg - iarg) * 24);
+			rsz = 0;
+			subtype = ESCAPE_UNDEF;
+			while (iarg < iendarg) {
+				asz = subtype == ESCAPE_SKIPCHAR ? 0 : 1;
+				if (buf->buf[iarg] != '\\') {
+					rsz += asz;
+					iarg++;
+					continue;
+				}
+				switch ((subtype = roff_escape(buf->buf, 0,
+				    iarg, NULL, NULL, NULL, NULL, &iarg))) {
+				case ESCAPE_SPECIAL:
+				case ESCAPE_NUMBERED:
+				case ESCAPE_UNICODE:
+				case ESCAPE_OVERSTRIKE:
+				case ESCAPE_UNDEF:
+					break;
+				case ESCAPE_DEVICE:
+					asz *= 8;
+					break;
+				case ESCAPE_EXPAND:
+					abort();
+				default:
+					continue;
+				}
+				rsz += asz;
+			}
+			(void)snprintf(ubuf, sizeof(ubuf), "%d", rsz * 24);
 			res = ubuf;
 			break;
 		default:
Index: w.out_ascii
===================================================================
RCS file: /home/cvs/mandoc/mandoc/regress/roff/esc/w.out_ascii,v
retrieving revision 1.3
retrieving revision 1.4
diff -Lregress/roff/esc/w.out_ascii -Lregress/roff/esc/w.out_ascii -u -p -r1.3 -r1.4
--- regress/roff/esc/w.out_ascii
+++ regress/roff/esc/w.out_ascii
@@ -8,6 +8,13 @@ D\bDE\bES\bSC\bCR\bRI\bIP\bPT\bTI\bIO\bON\bN
      character: 24
      blank: 24
      text: 96
+     special: 24
+     numbered: 24
+     Unicode: 24
+     overstrike: 24
+     undefined: 24
+     zero-width: 0
+     skipchar: 48

    A\bAr\brg\bgu\bum\bme\ben\bnt\bt d\bde\bel\bli\bim\bmi\bit\bte\ber\brs\bs
      unsupported \r: 24u
@@ -27,4 +34,4 @@ D\bDE\bES\bSC\bCR\bRI\bIP\bPT\bTI\bIO\bON\bN
      overstrike: 24u
      unterminated: 72

-OpenBSD                          June 8, 2022                          OpenBSD
+OpenBSD                        October 23, 2023                        OpenBSD
Index: w.out_lint
===================================================================
RCS file: /home/cvs/mandoc/mandoc/regress/roff/esc/w.out_lint,v
retrieving revision 1.7
retrieving revision 1.8
diff -Lregress/roff/esc/w.out_lint -Lregress/roff/esc/w.out_lint -u -p -r1.7 -r1.8
--- regress/roff/esc/w.out_lint
+++ regress/roff/esc/w.out_lint
@@ -1,4 +1,5 @@
-mandoc: w.in:17:20: UNSUPP: unsupported escape sequence: \r
-mandoc: w.in:17:23: UNSUPP: unsupported escape sequence: \r
-mandoc: w.in:23:16: WARNING: undefined escape, printing literally: \G
-mandoc: w.in:51:15: ERROR: incomplete escape sequence: \w'foo
+mandoc: w.in:25:15: WARNING: undefined escape, printing literally: \G
+mandoc: w.in:31:20: UNSUPP: unsupported escape sequence: \r
+mandoc: w.in:31:23: UNSUPP: unsupported escape sequence: \r
+mandoc: w.in:37:16: WARNING: undefined escape, printing literally: \G
+mandoc: w.in:65:15: ERROR: incomplete escape sequence: \w'foo
Index: w.in
===================================================================
RCS file: /home/cvs/mandoc/mandoc/regress/roff/esc/w.in,v
retrieving revision 1.3
retrieving revision 1.4
diff -Lregress/roff/esc/w.in -Lregress/roff/esc/w.in -u -p -r1.3 -r1.4
--- regress/roff/esc/w.in
+++ regress/roff/esc/w.in
@@ -1,4 +1,4 @@
-.\" $OpenBSD: w.in,v 1.4 2022/06/08 13:08:00 schwarze Exp $
+.\" $OpenBSD: w.in,v 1.5 2023/10/23 20:07:19 schwarze Exp $
 .Dd $Mdocdate$
 .Dt ESC-W 1
 .Os
@@ -13,6 +13,20 @@ character: \w'n'
 blank: \w' '
 .br
 text: \w'text'
+.br
+special: \w'\(bu'
+.br
+numbered: \w'\N'100''
+.br
+Unicode: \w'\[u2013]'
+.br
+overstrike: \w'\o'ab''
+.br
+undefined: \w'\G'
+.br
+zero-width: \w'\fB\&\fP'
+.br
+skipchar: \w'a\zb\z\(buc'
 .Ss Argument delimiters
 unsupported \er: \w\rM\ru
 .br
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: typo in mandoc_escape.3

2023-10-23T12:48:41Z

Hi Paul,

Paul Eggert wrote on Sun, Oct 22, 2023 at 05:22:48PM -0700:

> There's a typo in mandoc_escape.3: it refers to an internal function 
> that has been renamed. Proposed patch attached.

Thank you, committed.
Nice to see the internal documentation is actually being used.  :-)

Yours,
  Ingo


Log Message:
-----------
update the internal function name roff_res() to roff_expand(),
it was changed some time ago;
patch from Paul 

Modified Files:
--------------
    mandoc:
        mandoc_escape.3

Revision Data
-------------
Index: mandoc_escape.3
===================================================================
RCS file: /home/cvs/mandoc/mandoc/mandoc_escape.3,v
retrieving revision 1.4
retrieving revision 1.5
diff -Lmandoc_escape.3 -Lmandoc_escape.3 -u -p -r1.4 -r1.5
--- mandoc_escape.3
+++ mandoc_escape.3
@@ -100,7 +100,7 @@ width measurements
 and numerical expression control
 .Ic \eB .
 These are handled by
-.Fn roff_res ,
+.Fn roff_expand ,
 a private preprocessor function called from
 .Fn roff_parseln ,
 see the file
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: mandoc mishandles tzfile(5)'s .IP \(bu "\w'\(bu 'u"

2023-10-23T10:33:58Z

[-- Attachment #1: Type: text/plain, Size: 2119 bytes --]

Hi Branden,

On Mon, Oct 23, 2023 at 03:30:59AM -0500, G. Branden Robinson wrote:
> Hi Paul,
> 
> At 2023-10-22T17:41:28-0700, Paul Eggert wrote:
> > On 2023-10-22 14:06, Ingo Schwarze wrote:
> > > mandoc only supports ASCII strings as arguments to \w, not escape
> > > sequences or formatting instructions.
> > 
> > For the TZDB man pages mandoc need not support all that, just \(bu.
> > 
> > Just to make sure we're on the same page, I reproduced the problem by
> > running the command "mandoc -man -Tascii t.5", where t.5 contains the
> > following lines:
> > 
> >         .TH tzfile 5
> >         .SH NAME
> >         .IP \(bu "\w'\(bu  'u"
> >         xxx
> >         .PP
> >         yyy
> 
> At the risk of being simplistic, why not just give `IP` an explicit
> measurement as an argument?
> 
> .IP \(bu 2n

He feels that IP \(bu 3n is too long of a space in PDF.  "\w'\(bu  'u"
has the benefit of being 3n in terminals, but shorter in PDF.

This was triggered after my suggestion of using 3[n] instead of 2[n] to
clearly separate the bullet from the bulleted text, as docuemented in
man-pages(7).

Cheers,
Alex

> 
> (Or 3n, or 4n, or whatever looks best to you.)
> 
> > The output should contain two spaces between the bullet's "o" and the
> > "x", but with current mandoc it contains five spaces.
> 
> If you're viewing on a terminal, `.IP \(bu 3n` should achieve this.[1]
> 
> (Typesetters are a different story because how wide a bullet is depends
> on the output device and the font.)
> 
> I'm not saying that better mandoc(1) support for `\w` would be an awful
> thing to have, but it doesn't seem necessary, to me, to solve this
> specific problem.
> 
> Regards,
> Branden
> 
> [1] Strictly, you can leave the "n" off, but I consider that slightly
>     sloppy, and I think that the explicit scaling unit is also helpful
>     as a reminder to the man page author that `IP`'s second argument,
>     unlike most arguments to man(7) macros, will _not_ be formatted as
>     text.

I might take that.

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

mandoc mishandles tzfile(5)'s .IP \(bu "\w'\(bu 'u"

2023-10-23T00:41:31Z

[-- Attachment #1: Type: text/plain, Size: 781 bytes --]

On 2023-10-22 14:06, Ingo Schwarze wrote:
> mandoc only supports 
> ASCII strings as arguments to \w, not escape sequences or formatting 
> instructions.

For the TZDB man pages mandoc need not support all that, just \(bu.

Just to make sure we're on the same page, I reproduced the problem by 
running the command "mandoc -man -Tascii t.5", where t.5 contains the 
following lines:

         .TH tzfile 5
         .SH NAME
         .IP \(bu "\w'\(bu  'u"
         xxx
         .PP
         yyy

The output should contain two spaces between the bullet's "o" and the 
"x", but with current mandoc it contains five spaces.

Proposed mandoc patch attached. This isn't a perfect emulation of groff, 
nor have I tested with fancy constructs, but it should be good enough 
for tzfile(5).

[-- Attachment #2: mandoc-tzfile-fix.txt --]
[-- Type: text/plain, Size: 5541 bytes --]

Index: roff.c
===================================================================
RCS file: /cvs/mandoc/roff.c,v
retrieving revision 1.398
diff -u -r1.398 roff.c
--- roff.c	22 Oct 2023 16:02:01 -0000	1.398
+++ roff.c	22 Oct 2023 20:59:52 -0000
@@ -1367,6 +1367,7 @@
 	int		 iarg;		/* index beginning the argument */
 	int		 iendarg;	/* index right after the argument */
 	int		 iend;		/* index right after the sequence */
+	int		 icols;		/* output columns of sequence */
 	int		 isrc, idst;	/* to reduce \\ and \. in names */
 	int		 deftype;	/* type of definition to paste */
 	int		 argi;		/* macro argument index */
@@ -1404,7 +1405,7 @@
 		 */

 		if (roff_escape(buf->buf, ln, pos, &iesc, &inam,
-		    &iarg, &iendarg, &iend) != ESCAPE_EXPAND) {
+		    &iarg, &iendarg, &iend, &icols) != ESCAPE_EXPAND) {
 			while (pos < iend) {
 				if (buf->buf[pos] == ec) {
 					buf->buf[pos] = '\\';
@@ -1552,7 +1553,7 @@
 			break;
 		case 'w':
 			(void)snprintf(ubuf, sizeof(ubuf),
-			    "%d", (iendarg - iarg) * 24);
+			    "%d", icols * 24);
 			res = ubuf;
 			break;
 		default:
@@ -4030,7 +4031,7 @@
 		if (cp[1] == '{' || cp[1] == '}')
 			break;
 		if (roff_escape(cp, 0, 0, NULL, &inam,
-		    NULL, NULL, &iend) != ESCAPE_UNDEF) {
+		    NULL, NULL, &iend, NULL) != ESCAPE_UNDEF) {
 			mandoc_msg(MANDOCERR_NAMESC, ln, pos,
 			    "%.*s%.*s", namesz, name, iend, cp);
 			cp += iend;
Index: roff_escape.c
===================================================================
RCS file: /cvs/mandoc/roff_escape.c,v
retrieving revision 1.14
diff -u -r1.14 roff_escape.c
--- roff_escape.c	8 Jun 2022 13:23:57 -0000	1.14
+++ roff_escape.c	22 Oct 2023 20:59:52 -0000
@@ -42,7 +42,7 @@
         enum mandoc_esc  rval;

         rval = roff_escape(--*rendarg, 0, 0,
-	    NULL, NULL, &iarg, &iendarg, &iend);
+	    NULL, NULL, &iarg, &iendarg, &iend, NULL);
         assert(rval != ESCAPE_EXPAND);
         if (rarg != NULL)
 	       *rarg = *rendarg + iarg;
@@ -64,14 +64,16 @@
  */
 enum mandoc_esc
 roff_escape(const char *buf, const int ln, const int aesc,
-    int *resc, int *rnam, int *rarg, int *rendarg, int *rend)
+    int *resc, int *rnam, int *rarg, int *rendarg, int *rend, int *rcols)
 {
 	int		 iesc;		/* index of leading escape char */
 	int		 inam;		/* index of escape name */
 	int		 iarg;		/* index beginning the argument */
 	int		 iendarg;	/* index right after the argument */
 	int		 iend;		/* index right after the sequence */
-	int		 sesc, snam, sarg, sendarg, send; /* for sub-escape */
+	int		 icols;		/* column width of sequence */
+	int		 sesc, snam, sarg, sendarg, send, scols;
+					/* for sub-escape */
 	int		 escterm;	/* whether term is escaped */
 	int		 maxl;		/* expected length of the argument */
 	int		 argl;		/* actual length of the argument */
@@ -98,6 +100,7 @@
 	 */

 	iarg = iendarg = iend = inam + 1;
+	icols = 0;
 	maxl = INT_MAX;
 	term = '\0';
 	err = MANDOCERR_OK;
@@ -141,11 +144,13 @@
 	case '\'':
 	case '-':
 	case '0':
-	case ':':
 	case '_':
 	case '`':
 	case 'e':
 	case '~':
+		icols++;
+		/* FALLTHROUGH */
+	case ':':
 		iarg--;
 		argl = 1;
 		rval = ESCAPE_SPECIAL;
@@ -179,6 +184,7 @@
 		break;
 	case '(':
 	case '[':
+		icols++;
 		rval = ESCAPE_SPECIAL;
 		iendarg = iend = --iarg;
 		break;
@@ -208,6 +214,7 @@
 		term = '\b';
 		break;
 	case 'C':
+		icols++;
 		rval = ESCAPE_SPECIAL;
 		term = '\b';
 		break;
@@ -224,6 +231,7 @@
 		term = '\b';
 		break;
 	case 'o':
+		icols++;
 		rval = ESCAPE_OVERSTRIKE;
 		term = '\b';
 		break;
@@ -271,7 +279,7 @@
 	if ((term == '\b' || (term == '\0' && maxl == INT_MAX)) &&
 	    buf[iarg] == buf[iesc]) {
 		stype = roff_escape(buf, ln, iendarg,
-		    &sesc, &snam, &sarg, &sendarg, &send);
+		    &sesc, &snam, &sarg, &sendarg, &send, &scols);
 		if (stype == ESCAPE_EXPAND)
 			goto out_sub;
 	}
@@ -285,11 +293,13 @@
 			    buf[snam]) != NULL) {
 				err = MANDOCERR_ESC_DELIM;
 				iend = send;
+				icols += scols;
 				iarg = iendarg = sesc;
 				goto out;
 			}
 			escterm = 1;
 			iarg = send;
+			icols += scols;
 			term = buf[snam];
 		} else if (strchr("BDHLRSvxNhl", buf[inam]) != NULL &&
 		    strchr(" %&()*+-./0123456789:<=>", buf[iarg]) != NULL) {
@@ -347,10 +357,11 @@
 		}
 		if (buf[iendarg] == buf[iesc]) {
 			stype = roff_escape(buf, ln, iendarg,
-			    &sesc, &snam, &sarg, &sendarg, &send);
+			    &sesc, &snam, &sarg, &sendarg, &send, &scols);
 			if (stype == ESCAPE_EXPAND)
 				goto out_sub;
 			iend = send;
+			icols += scols;
 			if (escterm == 1 &&
 			    (buf[snam] == term || buf[inam] == 'N'))
 				break;
@@ -366,6 +377,8 @@
 				valid_A = 0;
 			if (maxl != INT_MAX)
 				maxl--;
+			if (term == '\'')
+				icols++;
 			iend = ++iendarg;
 		}
 	}
@@ -502,6 +515,7 @@
 	iarg = sarg;
 	iendarg = sendarg;
 	iend = send;
+	icols = scols;
 	rval = ESCAPE_EXPAND;

 out:
@@ -515,6 +529,8 @@
 		*rendarg = iendarg;
 	if (rend != NULL)
 		*rend = iend;
+	if (rcols != NULL)
+		*rcols = icols;
 	if (ln == 0)
 		return rval;

Index: roff_int.h
===================================================================
RCS file: /cvs/mandoc/roff_int.h,v
retrieving revision 1.20
diff -u -r1.20 roff_int.h
--- roff_int.h	2 Jun 2022 11:29:07 -0000	1.20
+++ roff_int.h	22 Oct 2023 20:59:52 -0000
@@ -83,7 +83,7 @@
 void		  roffhash_free(struct ohash *);

 enum mandoc_esc	  roff_escape(const char *, const int, const int,
-			int *, int *, int *, int *, int *);
+			int *, int *, int *, int *, int *, int *);
 void		  roff_state_reset(struct roff_man *);
 void		  roff_validate(struct roff_man *);

typo in mandoc_escape.3

2023-10-23T00:22:56Z

[-- Attachment #1: Type: text/plain, Size: 117 bytes --]

There's a typo in mandoc_escape.3: it refers to an internal function 
that has been renamed. Proposed patch attached.

[-- Attachment #2: mandoc_escape.3.diff --]
[-- Type: text/x-patch, Size: 483 bytes --]

Index: mandoc_escape.3
===================================================================
RCS file: /cvs/mandoc/mandoc_escape.3,v
retrieving revision 1.4
diff -u -r1.4 mandoc_escape.3
--- mandoc_escape.3	4 Jul 2017 23:40:01 -0000	1.4
+++ mandoc_escape.3	22 Oct 2023 20:59:51 -0000
@@ -100,7 +100,7 @@
 and numerical expression control
 .Ic \eB .
 These are handled by
-.Fn roff_res ,
+.Fn roff_expand ,
 a private preprocessor function called from
 .Fn roff_parseln ,
 see the file

Re: [man]: Bullet lists

2023-10-22T11:15:08Z

[-- Attachment #1: Type: text/plain, Size: 21184 bytes --]

Oops, I should have CCd tech@, not discuss@.

On Sun, Oct 22, 2023 at 12:58:17PM +0200, Alejandro Colomar wrote:
> Hi Paul,
> 
> On Sat, Oct 21, 2023 at 05:36:03PM -0700, Paul Eggert wrote:
> > On 2023-10-18 04:16, Alejandro Colomar wrote:
> > 
> > > I guess using '\[bu]' (or '\(bu') might not be appropriate for the tz
> > > project, as it may be less portable to old systems
> > 
> > We should be ok with \(bu even with older nroff; I just checked with Solaris
> > 10 nroff and it generated an ASCII 'o', same as GNU/Linux.
> > 
> > 
> > > The change would be to use '.IP * 3' instead of '.IP * 2'.  3 is for the
> > > width of '*' +2.
> > 
> > 3 is equivalent to 3n, which is too much for PDF output. I installed the
> > attached, which uses \w'\(bu  'u instead of 3. This is equivalent for to 3
> > for nroff, and has a better look for troff.
> 
> Hmm, interesting trick.  To me, 3 doesn't look so excessive; it's at the
> verge of being it, but still acceptable, but can understand your
> preference.
> 
> The only downside I see to your approach (appart from making the source
> more complex, but I'm willing to pay that for better output), is that
> mandoc(1) doesn't seem to understand that, and falls back to the default
> indentation, which is even more excessive.
> 
> Maybe we can report the problem to the maintainers of mandoc(1),
> although I'm not sure how much they'll enjoy such low level roff(7)
> stuff.
> 
> > 
> > This patch also cleans up some of the other indenting irregularities in TZDB
> > man pages; for example, plain ".TP" is good enough, and we don't need ".TP
> > 10". Hope it works for you.
> 
> Thanks!  LGTM.
> 
> > From 4a577028c65c70f6a85726ee34d307ee4a51b24d Mon Sep 17 00:00:00 2001
> > From: Paul Eggert 
> > Date: Sat, 21 Oct 2023 16:10:29 -0700
> > Subject: [PROPOSED] Be more systematic about man page indenting
> > 
> > This responds to a suggestion by Alejandro Colomar in:
> > https://mm.icann.org/pipermail/tz/2023-October/033116.html
> > ---
> >  newtzset.3   |  6 ++--
> >  time2posix.3 |  2 +-
> >  tzfile.5     | 86 +++++++++++++++++++++++++++++-----------------------
> >  zdump.8      |  5 ++-
> >  zic.8        | 48 ++++++++++++++---------------
> >  5 files changed, 78 insertions(+), 69 deletions(-)
> > 
> > diff --git a/newtzset.3 b/newtzset.3
> > index 80617cd7..45ddbd24 100644
> > --- a/newtzset.3
> > +++ b/newtzset.3
> > @@ -115,7 +115,7 @@ it must have the following syntax (spaces inserted for clarity):
> >  .PP
> >  Where:
> >  .RS
> > -.TP 15
> > +.TP
> >  .IR std " and " dst
> >  Three or more bytes that are the designation for the standard
> >  .RI ( std )
> > @@ -205,7 +205,7 @@ The format of
> >  .I date
> >  is one of the following:
> >  .RS
> > -.TP 10
> > +.TP
> >  .BI J n
> >  The Julian day
> >  .I n
> > @@ -238,7 +238,7 @@ first week in which the
> >  .IR d' th
> >  day occurs.  Day zero is Sunday.
> >  .RE
> > -.IP "" 15
> > +.IP
> >  The
> >  .I time
> >  has the same format as
> > diff --git a/time2posix.3 b/time2posix.3
> > index f48402b9..6644060a 100644
> > --- a/time2posix.3
> > +++ b/time2posix.3
> > @@ -100,7 +100,7 @@ and back from,
> >  the POSIX representation over the leap second inserted at the end of June,
> >  1993.
> >  .nf
> > -.ta \w'93/06/30 'u +\w'23:59:59 'u +\w'A+0 'u +\w'X=time2posix(T) 'u
> > +.ta \w'93/06/30\0'u +\w'23:59:59\0'u +\w'A+0\0'u +\w'X=time2posix(T)\0'u
> >  DATE	TIME	T	X=time2posix(T)	posix2time(X)
> >  93/06/30	23:59:59	A+0	B+0	A+0
> >  93/06/30	23:59:60	A+1	B+1	A+1 or A+2
> > diff --git a/tzfile.5 b/tzfile.5
> > index 59d9f6ba..55280282 100644
> > --- a/tzfile.5
> > +++ b/tzfile.5
> > @@ -26,23 +26,24 @@ a signed binary integer is represented using two's complement,
> >  and a boolean is represented by a one-byte binary integer that is
> >  either 0 (false) or 1 (true).
> >  The format begins with a 44-byte header containing the following fields:
> > -.IP * 2
> > +.RS "\w'  'u"
> > +.IP \(bu "\w'\(bu  'u"
> >  The magic four-byte ASCII sequence
> >  .q "TZif"
> >  identifies the file as a timezone information file.
> > -.IP *
> > +.IP \(bu
> >  A byte identifying the version of the file's format
> >  (as of 2021, either an ASCII NUL,
> >  .q "2",
> >  .q "3",
> >  or
> >  .q "4" ).
> > -.IP *
> > +.IP \(bu
> >  Fifteen bytes containing zeros reserved for future use.
> > -.IP *
> > +.IP \(bu
> >  Six four-byte integer values, in the following order:
> > -.RS
> > -.TP
> > +.RS "\w'  \(bu    'u"
> > +.TP "\w'  'u"
> >  .B tzh_ttisutcnt
> >  The number of UT/local indicators stored in the file.
> >  (UT is Universal Time.)
> > @@ -68,14 +69,15 @@ stored in the file.
> >  .PP
> >  The above header is followed by the following fields, whose lengths
> >  depend on the contents of the header:
> > -.IP * 2
> > +.RS "\w'  'u"
> > +.IP \(bu "\w'\(bu  'u"
> >  .B tzh_timecnt
> >  four-byte signed integer values sorted in ascending order.
> >  These values are written in network byte order.
> >  Each is used as a transition time (as returned by
> >  .BR time (2))
> >  at which the rules for computing local time change.
> > -.IP *
> > +.IP \(bu
> >  .B tzh_timecnt
> >  one-byte unsigned integer values;
> >  each one but the last tells which of the different types of local time types
> > @@ -85,20 +87,20 @@ and continuing up to but not including the next transition time.
> >  (The last time type is present only for consistency checking with the
> >  POSIX-style TZ string described below.)
> >  These values serve as indices into the next field.
> > -.IP *
> > +.IP \(bu
> >  .B tzh_typecnt
> >  .B ttinfo
> >  entries, each defined as follows:
> > -.in +.5i
> > +.in +2
> 
> Hmm, I think I'll take this.  The Linux man-pages currently use .in +4n
> for examples, but maybe I can simplify to just +4 without the 'n'.  I'll
> check if the PDF isn't excessive.
> 
> Cheers,
> Alex
> 
> >  .sp
> >  .nf
> > -.ta .5i +\w'unsigned char\0\0'u
> > +.ta \w'\0\0\0\0'u +\w'unsigned char\0'u
> >  struct ttinfo {
> >  	int32_t	tt_utoff;
> >  	unsigned char	tt_isdst;
> >  	unsigned char	tt_desigidx;
> >  };
> > -.in -.5i
> > +.in
> >  .fi
> >  .sp
> >  Each structure is written as a four-byte signed integer value for
> > @@ -132,7 +134,8 @@ Also, in realistic applications
> >  is in the range [\-89999, 93599] (i.e., more than \-25 hours and less
> >  than 26 hours); this allows easy support by implementations that
> >  already support the POSIX-required range [\-24:59:59, 25:59:59].
> > -.IP *
> > +.RS "\w'  'u"
> > +.IP \(bu "\w'\(bu  'u"
> >  .B tzh_charcnt
> >  bytes that represent time zone designations,
> >  which are null-terminated byte strings, each indexed by the
> > @@ -140,7 +143,7 @@ which are null-terminated byte strings, each indexed by the
> >  values mentioned above.
> >  The byte strings can overlap if one is a suffix of the other.
> >  The encoding of these strings is not specified.
> > -.IP *
> > +.IP \(bu
> >  .B tzh_leapcnt
> >  pairs of four-byte values, written in network byte order;
> >  the first value of each pair gives the nonnegative time
> > @@ -167,18 +170,19 @@ otherwise, for timestamps before the first occurrence time,
> >  the leap-second correction is zero if the first pair's correction is 1 or \-1,
> >  and is unspecified otherwise (which can happen only in files
> >  truncated at the start).
> > -.IP *
> > +.IP \(bu
> >  .B tzh_ttisstdcnt
> >  standard/wall indicators, each stored as a one-byte boolean;
> >  they tell whether the transition times associated with local time types
> >  were specified as standard time or local (wall clock) time.
> > -.IP *
> > +.IP \(bu
> >  .B tzh_ttisutcnt
> >  UT/local indicators, each stored as a one-byte boolean;
> >  they tell whether the transition times associated with local time types
> >  were specified as UT or local time.
> >  If a UT/local indicator is set, the corresponding standard/wall indicator
> >  must also be set.
> > +.RE
> >  .PP
> >  The standard/wall and UT/local indicators were designed for
> >  transforming a TZif file's transition times into transitions appropriate
> > @@ -312,15 +316,17 @@ This section documents common problems in reading or writing TZif files.
> >  Most of these are problems in generating TZif files for use by
> >  older readers.
> >  The goals of this section are:
> > -.IP * 2
> > +.RS "\w'  'u"
> > +.IP \(bu "\w'\(bu  'u"
> >  to help TZif writers output files that avoid common
> >  pitfalls in older or buggy TZif readers,
> > -.IP *
> > +.IP \(bu
> >  to help TZif readers avoid common pitfalls when reading
> >  files generated by future TZif writers, and
> > -.IP *
> > +.IP \(bu
> >  to help any future specification authors see what sort of
> >  problems arise when the TZif format is changed.
> > +.RE
> >  .PP
> >  When new versions of the TZif format have been defined, a
> >  design goal has been that a reader can successfully use a TZif
> > @@ -335,21 +341,22 @@ workarounds, as well as to document other common bugs in
> >  readers.
> >  .PP
> >  Interoperability problems with TZif include the following:
> > -.IP * 2
> > +.RS "\w'  'u"
> > +.IP \(bu "\w'\(bu  'u"
> >  Some readers examine only version 1 data.
> >  As a partial workaround, a writer can output as much version 1
> >  data as possible.
> >  However, a reader should ignore version 1 data, and should use
> >  version 2+ data even if the reader's native timestamps have only
> >  32 bits.
> > -.IP *
> > +.IP \(bu
> >  Some readers designed for version 2 might mishandle
> >  timestamps after a version 3 or higher file's last transition, because
> >  they cannot parse extensions to POSIX in the TZ-like string.
> >  As a partial workaround, a writer can output more transitions
> >  than necessary, so that only far-future timestamps are
> >  mishandled by version 2 readers.
> > -.IP *
> > +.IP \(bu
> >  Some readers designed for version 2 do not support
> >  permanent daylight saving time with transitions after 24:00
> >  \(en e.g., a TZ string
> > @@ -367,22 +374,22 @@ for the next time zone east \(en e.g.,
> >  .q "AST4"
> >  for permanent
> >  Atlantic Standard Time (\-04).
> > -.IP *
> > +.IP \(bu
> >  Some readers designed for version 2 or 3, and that require strict
> >  conformance to RFC 8536, reject version 4 files whose leap second
> >  tables are truncated at the start or that end in expiration times.
> > -.IP *
> > +.IP \(bu
> >  Some readers ignore the footer, and instead predict future
> >  timestamps from the time type of the last transition.
> >  As a partial workaround, a writer can output more transitions
> >  than necessary.
> > -.IP *
> > +.IP \(bu
> >  Some readers do not use time type 0 for timestamps before
> >  the first transition, in that they infer a time type using a
> >  heuristic that does not always select time type 0.
> >  As a partial workaround, a writer can output a dummy (no-op)
> >  first transition at an early time.
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle timestamps before the first
> >  transition that has a timestamp not less than \-2**31.
> >  Readers that support only 32-bit timestamps are likely to be
> > @@ -391,11 +398,11 @@ more prone to this problem, for example, when they process
> >  bits.
> >  As a partial workaround, a writer can output a dummy
> >  transition at timestamp \-2**31.
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle a transition if its timestamp has
> >  the minimum possible signed 64-bit value.
> >  Timestamps less than \-2**59 are not recommended.
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle POSIX-style TZ strings that
> >  contain
> >  .q "<"
> > @@ -407,11 +414,11 @@ or
> >  .q ">"
> >  for time zone abbreviations containing only alphabetic
> >  characters.
> > -.IP *
> > +.IP \(bu
> >  Many readers mishandle time zone abbreviations that contain
> >  non-ASCII characters.
> >  These characters are not recommended.
> > -.IP *
> > +.IP \(bu
> >  Some readers may mishandle time zone abbreviations that
> >  contain fewer than 3 or more than 6 characters, or that
> >  contain ASCII characters other than alphanumerics,
> > @@ -419,7 +426,7 @@ contain ASCII characters other than alphanumerics,
> >  and
> >  .q "+".
> >  These abbreviations are not recommended.
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle TZif files that specify
> >  daylight-saving time UT offsets that are less than the UT
> >  offsets for the corresponding standard time.
> > @@ -435,7 +442,7 @@ thus swapping standard and daylight saving time.
> >  Although this workaround misidentifies which part of the year
> >  uses daylight saving time, it records UT offsets and time zone
> >  abbreviations correctly.
> > -.IP *
> > +.IP \(bu
> >  Some readers generate ambiguous timestamps for positive leap seconds
> >  that occur when the UTC offset is not a multiple of 60 seconds.
> >  For example, in a timezone with UTC offset +01:23:45 and with
> > @@ -446,38 +453,41 @@ instead of mapping the latter to 01:23:46, and they will map 78796815 to
> >  This has not yet been a practical problem, since no civil authority
> >  has observed such UTC offsets since leap seconds were
> >  introduced in 1972.
> > +.RE
> >  .PP
> >  Some interoperability problems are reader bugs that
> >  are listed here mostly as warnings to developers of readers.
> > -.IP * 2
> > +.RS "\w'  'u"
> > +.IP \(bu "\w'\(bu  'u"
> >  Some readers do not support negative timestamps.
> >  Developers of distributed applications should keep this
> >  in mind if they need to deal with pre-1970 data.
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle timestamps before the first
> >  transition that has a nonnegative timestamp.
> >  Readers that do not support negative timestamps are likely to
> >  be more prone to this problem.
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle time zone abbreviations like
> >  .q "\*-08"
> >  that contain
> >  .q "+",
> >  .q "\*-",
> >  or digits.
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle UT offsets that are out of the
> >  traditional range of \-12 through +12 hours, and so do not
> >  support locations like Kiritimati that are outside this
> >  range.
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle UT offsets in the range [\-3599, \-1]
> >  seconds from UT, because they integer-divide the offset by
> >  3600 to get 0 and then display the hour part as
> >  .q "+00".
> > -.IP *
> > +.IP \(bu
> >  Some readers mishandle UT offsets that are not a multiple
> >  of one hour, or of 15 minutes, or of 1 minute.
> > +.RE
> >  .SH SEE ALSO
> >  .BR time (2),
> >  .BR localtime (3),
> > diff --git a/zdump.8 b/zdump.8
> > index f77c0c79..c3f0bba6 100644
> > --- a/zdump.8
> > +++ b/zdump.8
> > @@ -152,10 +152,9 @@ tabbed columns line up.)
> >  .nf
> >  .sp
> >  .if \n(.g .ft CR
> > -.if t .in +.5i
> > -.if n .in +2
> > +.in +2
> >  .nr w \w'1896-01-13  'u+\n(.i
> > -.ta \w'1896-01-13  'u +\w'12:01:26  'u +\w'-103126  'u +\w'HWT  'u
> > +.ta \w'1896-01-13\0\0'u +\w'12:01:26\0\0'u +\w'-103126\0\0'u +\w'HWT\0\0'u
> >  TZ="Pacific/Honolulu"
> >  -	-	-103126	LMT
> >  1896-01-13	12:01:26	-1030	HST
> > diff --git a/zic.8 b/zic.8
> > index 6bcef7ae..a958ddd1 100644
> > --- a/zic.8
> > +++ b/zic.8
> > @@ -95,7 +95,7 @@ as local time.
> >  .B zic
> >  will act as if the input contained a link line of the form
> >  .sp
> > -.ti +.5i
> > +.ti +2
> >  .ta \w'Link\0\0'u  +\w'\fItimezone\fP\0\0'u
> >  Link	\fItimezone\fP		localtime
> >  .sp
> > @@ -118,7 +118,7 @@ TZ strings like "EET\*-2EEST" that lack transition rules.
> >  .B zic
> >  will act as if the input contained a link line of the form
> >  .sp
> > -.ti +.5i
> > +.ti +2
> >  Link	\fItimezone\fP		posixrules
> >  .sp
> >  If
> > @@ -330,19 +330,19 @@ abbreviation must be unambiguous in context.
> >  .PP
> >  A rule line has the form
> >  .nf
> > -.ti +.5i
> > +.ti +2
> >  .ta \w'Rule\0\0'u +\w'NAME\0\0'u +\w'FROM\0\0'u +\w'1973\0\0'u +\w'\*-\0\0'u +\w'Apr\0\0'u +\w'lastSun\0\0'u +\w'2:00w\0\0'u +\w'1:00d\0\0'u
> >  .sp
> >  Rule	NAME	FROM	TO	\*-	IN	ON	AT	SAVE	LETTER/S
> >  .sp
> >  For example:
> > -.ti +.5i
> > +.ti +2
> >  .sp
> >  Rule	US	1967	1973	\*-	Apr	lastSun	2:00w	1:00d	D
> >  .sp
> >  .fi
> >  The fields that make up a rule line are:
> > -.TP "\w'LETTER/S'u"
> > +.TP
> >  .B NAME
> >  Gives the name of the rule set that contains this line.
> >  The name must start with a character that is neither
> > @@ -404,7 +404,7 @@ Month names may be abbreviated.
> >  Gives the day on which the rule takes effect.
> >  Recognized forms include:
> >  .nf
> > -.in +.5i
> > +.in +2
> >  .sp
> >  .ta \w'Sun<=25\0\0'u
> >  5	the fifth of the month
> > @@ -413,7 +413,7 @@ lastMon	the last Monday in the month
> >  Sun>=8	first Sunday on or after the eighth
> >  Sun<=25	last Sunday on or before the 25th
> >  .fi
> > -.in -.5i
> > +.in
> >  .sp
> >  A weekday name (e.g.,
> >  .BR "Sunday" )
> > @@ -440,7 +440,7 @@ Gives the time of day at which the rule takes effect,
> >  relative to 00:00, the start of a calendar day.
> >  Recognized forms include:
> >  .nf
> > -.in +.5i
> > +.in +2
> >  .sp
> >  .ta \w'00:19:32.13\0\0'u
> >  2	time in hours
> > @@ -454,7 +454,7 @@ Recognized forms include:
> >  \*-2:30	2.5 hours before 00:00
> >  \*-	equivalent to 0
> >  .fi
> > -.in -.5i
> > +.in
> >  .sp
> >  Although
> >  .B zic
> > @@ -532,18 +532,18 @@ the variable part is null.
> >  A zone line has the form
> >  .sp
> >  .nf
> > -.ti +.5i
> > +.ti +2
> >  .ta \w'Zone\0\0'u +\w'Asia/Amman\0\0'u +\w'STDOFF\0\0'u +\w'Jordan\0\0'u +\w'FORMAT\0\0'u
> >  Zone	NAME	STDOFF	RULES	FORMAT	[UNTIL]
> >  .sp
> >  For example:
> >  .sp
> > -.ti +.5i
> > +.ti +2
> >  Zone	Asia/Amman	2:00	Jordan	EE%sT	2017 Oct 27 01:00
> >  .sp
> >  .fi
> >  The fields that make up a zone line are:
> > -.TP "\w'STDOFF'u"
> > +.TP
> >  .B NAME
> >  The name of the timezone.
> >  This is the name used in creating the time conversion information file for the
> > @@ -663,15 +663,15 @@ For example:
> >  .br
> >  .ne 7
> >  .nf
> > -.in +2m
> > +.in +2
> >  .ta \w'# Rule\0\0'u +\w'NAME\0\0'u +\w'FROM\0\0'u +\w'2006\0\0'u +\w'\*-\0\0'u +\w'Oct\0\0'u +\w'lastSun\0\0'u +\w'2:00\0\0'u +\w'SAVE\0\0'u
> >  .sp
> >  # Rule	NAME	FROM	TO	\*-	IN	ON	AT	SAVE	LETTER/S
> >  Rule	US	1967	2006	-	Oct	lastSun	2:00	0	S
> >  Rule	US	1967	1973	-	Apr	lastSun	2:00	1:00	D
> > -.ta \w'Zone\0\0America/Menominee\0\0'u +\w'STDOFF\0\0'u +\w'RULES\0\0'u +\w'FORMAT\0\0'u
> > -# Zone\0\0NAME	STDOFF	RULES	FORMAT	[UNTIL]
> > -Zone\0\0America/Menominee	\*-5:00	\*-	EST	1973 Apr 29 2:00
> > +.ta \w'# Zone\0\0'u +\w'America/Menominee\0\0'u +\w'STDOFF\0\0'u +\w'RULES\0\0'u +\w'FORMAT\0\0'u
> > +# Zone	NAME	STDOFF	RULES	FORMAT	[UNTIL]
> > +Zone	America/Menominee	\*-5:00	\*-	EST	1973 Apr 29 2:00
> >  	\*-6:00	US	C%sT
> >  .sp
> >  .in
> > @@ -687,13 +687,13 @@ interprets this more sensibly as a single transition from 02:00 CST (\*-05) to
> >  A link line has the form
> >  .sp
> >  .nf
> > -.ti +.5i
> > +.ti +2
> >  .ta \w'Link\0\0'u +\w'Europe/Istanbul\0\0'u
> >  Link	TARGET	LINK-NAME
> >  .sp
> >  For example:
> >  .sp
> > -.ti +.5i
> > +.ti +2
> >  Link	Europe/Istanbul	Asia/Istanbul
> >  .sp
> >  .fi
> > @@ -717,7 +717,7 @@ For example:
> >  .sp
> >  .ne 3
> >  .nf
> > -.in +2m
> > +.in +2
> >  .ta \w'Zone\0\0'u +\w'Greenwich\0\0'u
> >  Link	Greenwich	G_M_T
> >  Link	Etc/GMT	Greenwich
> > @@ -737,13 +737,13 @@ The file that describes leap seconds can have leap lines and an
> >  expiration line.
> >  Leap lines have the following form:
> >  .nf
> > -.ti +.5i
> > +.ti +2
> >  .ta \w'Leap\0\0'u +\w'YEAR\0\0'u +\w'MONTH\0\0'u +\w'DAY\0\0'u +\w'HH:MM:SS\0\0'u +\w'CORR\0\0'u
> >  .sp
> >  Leap	YEAR	MONTH	DAY	HH:MM:SS	CORR	R/S
> >  .sp
> >  For example:
> > -.ti +.5i
> > +.ti +2
> >  .sp
> >  Leap	2016	Dec	31	23:59:60	+	S
> >  .sp
> > @@ -791,13 +791,13 @@ option is used.
> >  .PP
> >  The expiration line, if present, has the form:
> >  .nf
> > -.ti +.5i
> > +.ti +2
> >  .ta \w'Expires\0\0'u +\w'YEAR\0\0'u +\w'MONTH\0\0'u +\w'DAY\0\0'u
> >  .sp
> >  Expires	YEAR	MONTH	DAY	HH:MM:SS
> >  .sp
> >  For example:
> > -.ti +.5i
> > +.ti +2
> >  .sp
> >  Expires	2020	Dec	28	00:00:00
> >  .sp
> > @@ -816,7 +816,7 @@ Here is an extended example of
> >  .B zic
> >  input, intended to illustrate many of its features.
> >  .nf
> > -.in +2m
> > +.in +2
> >  .ta \w'# Rule\0\0'u +\w'NAME\0\0'u +\w'FROM\0\0'u +\w'1973\0\0'u +\w'\*-\0\0'u +\w'Apr\0\0'u +\w'lastSun\0\0'u +\w'2:00\0\0'u +\w'SAVE\0\0'u
> >  .sp
> >  # Rule	NAME	FROM	TO	\*-	IN	ON	AT	SAVE	LETTER/S
> > -- 
> > 2.39.2
> > 
> 
> 
> -- 
> <https://www.alejandro-colomar.es/>



-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

Re: Crash core dump when parsing krb5_openlog.3

2023-10-21T17:54:40Z

Hi Bapt,

Ingo Schwarze wrote on Fri, Oct 13, 2023 at 12:57:37PM +0200:
> Baptiste Daroussin wrote on Fri, Oct 13, 2023 at 09:06:56AM +0200:

>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266882
>> I haven't had time yet to debug, do you actually need more information?

I believe this is fixed by the commit appended below.

I'm also going to update the FreeBSD bug report.

[...]
> Having a quick look at your reproducing input file, i suspect that
> the following input confuses mandoc:
> 
> .\" ouch!
> .ds xx \\*(fP\fR(\fP\\*(lI*\\*(fP
> 
> Notably, you already marked that line with "ouch" in the manual
> page source code, presumably acknowledging that doing such low-level
> gymnastics in a manual page is asking for trouble.  To not handle such
> madness gracefully is still a bug in mandoc though, i do not deny that.
> 
> I suspect that using the extremely special escape sequence "\\"
> inside .ds is confusing mandoc - that sequence is mostly intended
> for being used inside .de.  The pre-parser still sees the "\\" and
> consequently sees no user-defined string replacement escape sequences.
> But when the "xx" string is later used, the "\\*" likely gets resolved
> to "\*", i.e. to a string replacement request, but at the point, the
> pre-parser has already been run so the string does not get replaced
> and makes it through to the formatters, and those cannot handle it.
> Or something like that, i will have to investigate in detail.

That turned out to be true, as far as it goes.

Two aspects are missing from the explanation:

 1. Not only is the "xx" string later used, but it is used
    (1) with delayed expansion (i.e. \\*(xx rather than \*(xx)
    and (2) inside a macro argument.
 2. The macro argument parser already handled delayed expansion.
    What it failed to handle was recursive expansion where both
    the outer and the inner expansion were delayed.

To summarize, as far as i can see, the assertion only triggers
when a manual pages uses delayed expansion recursively inside
a macro argument.

Note that using expansion inside a manual page is already
bad style, even if it is neither recursive nor delayed nor
inside a macro argument...  :-(

Feedback whether the patch works for you is welcome.

Thanks again for the report!
  Ingo


Log Message:
-----------
When parsing a macro argument results in delayed escape sequence 
expansion, re-check for all contained escape sequences whether they 
need delayed expansion, not just for the particular escape sequences
that triggered delayed expansion in the first place.  This is needed
because delayed expansion can result in strings containing nested
escape sequences recursively needing delayed expansion, too.

This fixes an assertion failure in krb5_openlog(3), see:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266882

Thanks to Wolfram Schneider  for reporting the bug
and to Baptiste Daroussin  for forwarding the report.

Modified Files:
--------------
    mandoc:
        mandoc.h
        roff.c

Revision Data
-------------
Index: mandoc.h
===================================================================
RCS file: /home/cvs/mandoc/mandoc/mandoc.h,v
retrieving revision 1.281
retrieving revision 1.282
diff -Lmandoc.h -Lmandoc.h -u -p -r1.281 -r1.282
--- mandoc.h
+++ mandoc.h
@@ -23,7 +23,6 @@
 #define ASCII_NBRZW	 30  /* non-breaking zero-width space */
 #define ASCII_BREAK	 29  /* breakable zero-width space */
 #define ASCII_HYPH	 28  /* breakable hyphen */
-#define ASCII_ESC	 27  /* escape sequence from copy-in processing */
 #define ASCII_TABREF	 26  /* reset tab reference position */
 
 /*
Index: roff.c
===================================================================
RCS file: /home/cvs/mandoc/mandoc/roff.c,v
retrieving revision 1.396
retrieving revision 1.397
diff -Lroff.c -Lroff.c -u -p -r1.396 -r1.397
--- roff.c
+++ roff.c
@@ -1387,7 +1387,7 @@ roff_expand(struct roff *r, struct buf *
 		 */
 
 		if (buf->buf[pos] != ec) {
-			if (ec != ASCII_ESC && buf->buf[pos] == '\\') {
+			if (buf->buf[pos] == '\\') {
 				roff_expand_patch(buf, pos, "\\e", pos + 1);
 				pos++;
 			}
@@ -1632,12 +1632,7 @@ roff_getarg(struct roff *r, char **cpp, 
 				cp++;
 				break;
 			case '\\':
-				/*
-				 * Signal to roff_expand() that an escape
-				 * sequence resulted from copy-in processing
-				 * and needs to be checked or interpolated.
-				 */
-				cp[-pairs] = ASCII_ESC;
+				cp[-pairs] = '\\';
 				newesc = 1;
 				pairs++;
 				cp++;
@@ -1694,7 +1689,7 @@ roff_getarg(struct roff *r, char **cpp, 
 	buf.buf = start;
 	buf.sz = strlen(start) + 1;
 	buf.next = NULL;
-	if (roff_expand(r, &buf, ln, 0, ASCII_ESC) & ROFF_IGN) {
+	if (roff_expand(r, &buf, ln, 0, '\\') & ROFF_IGN) {
 		free(buf.buf);
 		buf.buf = mandoc_strdup("");
 	}
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: mandoc -man -Thtml bug: inconsistent vertical space before .TP

2023-10-19T21:32:56Z

[-- Attachment #1: Type: text/plain, Size: 6564 bytes --]

On Thu, Oct 19, 2023 at 06:19:23PM +0200, Ingo Schwarze wrote:
> Hi Alejandro,
> 
> Alejandro Colomar wrote on Thu, Oct 19, 2023 at 05:17:10PM +0200:
> 
> > I had this gripe with man(7) some years ago.  I thought of using the
> > following instead, which slightly complicates the source code, but makes
> > it more logical.
> > 
> > 	$ cat nested_indent.man 
> > 	.TH nested_indent 7 2023-10-19 experiments
> > 	.SH Ingo said:
> > 	.TP
> > 	Todo
> > 	Currently, when formatting .TP or .IP with a non-empty head,
> > 	[yada yada]
> > 	.RS
> > 	.PP
> > 	When formatting .IP or .RS with an empty head, mandoc needs
> > 	[yada yada]
> > 	.RE
> > 
> > As you can see, here the indentation is controlled by a single RS/RE
> > pair, and everything within it uses PP as a normal paragraph separator.
> 
> While that also generates correct terminal and typographical (PS, PDF)
> output in the same purely presentational sense as .TP .IP .TP, it
> does not help with respect to the semantic problem we are discussing
> here.
> 
> Look at the AST generated by mandoc(1):
> 
>    $ mandoc -T tree nested_indent.man
>   title = "nested_indent"
>   sec   = "7"
>   vol   = "Miscellaneous Information Manual"
>   os    = "experiments"
>   date  = "2023-10-19"
>   
>   SH (block) *2:2
>     SH (head) 2:2 ID=HREF=Ingo_said:
>         Ingo (text) 2:5
>         said: (text) 2:10
>     SH (body) 2:2
>         TP (block) *3:2
>           TP (head) 3:2 ID=HREF
>               Todo (text) *4:1
>           TP (body) 4:1
>               Currently, when formatting .TP or .IP \
>                   with a nonempty head, (text) *5:1
>               [yada yada] (text) *6:1
>         RS (block) *7:2
>           RS (head) 7:2
>           RS (body) 7:2
>               PP (block) *8:2
>                 PP (head) 8:2
>                 PP (body) 8:2
>                     When formatting .IP or .RS with an empty head,
>                         mandoc needs (text) *9:1
>                     [yada yada] (text) *10:1
>         TP (block) *12:2
>           TP (head) 12:2 ID=HREF=final
>               final tag (text) *13:1
>           TP (body) 13:1
>               final body (text) *14:1
> 
> You see that the first .TP, the .RS, and the second .TP are all child
> nodes of the top-level .SH.  The .RS is not a child of the .TP but
> a sibling.  The two .TP nodes still aren't siblings of each other.
> 
> Now on first sight, you might blame me for that and call it a mandoc
> artifact, arguing that mandoc instead ought to treat the .RS as a
> child of the first .TP.  But no, that would be incorrect parsing
> for the following reason: the .TP inmplies an indentation, and
> the .RS also implies an indentation.  If the .RS were a child of
> the .TP, we would get double indentation.  You can make that
> argument even more convincing by adding a width argument to .RS
> and varying that argument.  That way, you see that the .RS is
> indented relative to the .SH, not relative to the .TP.
> 
> There are some cases where it is not completely clear whether one
> man(7) node following another man(7) node is a child or a sibling.
> mandoc(1) makes arbitrary choices in such ambiguous cases, usually
> opting for sibling relations where possible and avoiding unnecessary
> child relationships.  But this is not an ambiguous case.  Just like
> the .IP, the .RS is definitely a sibling and not a child of the .TP.
> As i said, no block can nest inside .TP.
> 
> That's why i brought up .RS in my reply and developed rules
> for handling it in a similar way as .IP, even though you did
> not mention .RS before.
> 
> > You could put the RS before the first paragraph, but then an unwanted
> > line break appears after the tag.
> 
> No matter where you put the .RS, it will never be a child of .TP.
> 
> > (Maybe man(7) could be tweaked so
> > that RS doesn't insert the line break after a TP.)
> 
> Not really a useful idea because .RS doesn't help with the actual
> problem in the first place.
> 
> > In the end I didn't switch to that scheme, because IP just worked, but
> > I might consider it if it proves to be useful.  What do you think?
> 
> As i said, i am not aware of a better solution than .TP .IP .TP.
> In particular, .RS is not better because it causes exactly the
> same trouble and potentially more trouble besides.
> 
> But i also said that trying to define "good style" for man(7)
> is a fool's errand.  Because man(7) code is so exceedingly difficult
> to write, man(7) code that is very clearly bad style is very often
> found in the wild, so there is ample opportunity for saying "this
> is bad style."  In some cases, it is also possible to point out
> better style, for example
> 
>   .BR "some word" .
> 
> is clearly better style than
> 
>   .B some word\c
>   \&.
> 
> even though both are correct man(7) code and even though there are
> situations in man(7) where \c is unavoidable.
> 
> But very frequently, situations arise where man(7) doesn't really
> allow any good solution, and the best you can do is not making
> the source gratuitiously worse than it needs to be.
> 
> The .TP .IP .TP idiom is such an example.  It's definitely ugly from
> both semantic and stylistic points of view, but no good solution
> is available.  I'm willing to go further and claim that no better
> solution can be designed even if you are willing to introduce a new
> macro or change the way the .TP API is defined, even in incompatible
> ways, because it's not this particular macro that is broken.  What is
> broken is the fundamental design of the language: the language not
> only predates the concept of semantic markup, but it also predates
> the concept of block nesting in markup languages.  Yes, that is hard
> to believe for people born after 1970 because those people have
> essentially grown up with HTML and LaTeX and those two markup
> languages have defined their concept of what a markup language is,
> but let's face it, man(7) predates those fundamental concepts,
> and it shows all over the place.
> 
> As long as people are using the language, mandoc(1) needs to somehow
> deal with the mess.  I'm not happy with that because it is wasting a
> lot of development time which could be spent in more productive ways,
> but what can i do...
> 
> Yours,
>   Ingo

Hi Ingo,

Hmmm.  You convinced me (about the problems of man(7)), I think.


Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

Re: mandoc -man -Thtml bug: inconsistent vertical space before .TP

2023-10-19T16:19:28Z

Hi Alejandro,

Alejandro Colomar wrote on Thu, Oct 19, 2023 at 05:17:10PM +0200:

> I had this gripe with man(7) some years ago.  I thought of using the
> following instead, which slightly complicates the source code, but makes
> it more logical.
> 
> 	$ cat nested_indent.man 
> 	.TH nested_indent 7 2023-10-19 experiments
> 	.SH Ingo said:
> 	.TP
> 	Todo
> 	Currently, when formatting .TP or .IP with a non-empty head,
> 	[yada yada]
> 	.RS
> 	.PP
> 	When formatting .IP or .RS with an empty head, mandoc needs
> 	[yada yada]
> 	.RE
> 
> As you can see, here the indentation is controlled by a single RS/RE
> pair, and everything within it uses PP as a normal paragraph separator.

While that also generates correct terminal and typographical (PS, PDF)
output in the same purely presentational sense as .TP .IP .TP, it
does not help with respect to the semantic problem we are discussing
here.

Look at the AST generated by mandoc(1):

   $ mandoc -T tree nested_indent.man
  title = "nested_indent"
  sec   = "7"
  vol   = "Miscellaneous Information Manual"
  os    = "experiments"
  date  = "2023-10-19"
  
  SH (block) *2:2
    SH (head) 2:2 ID=HREF=Ingo_said:
        Ingo (text) 2:5
        said: (text) 2:10
    SH (body) 2:2
        TP (block) *3:2
          TP (head) 3:2 ID=HREF
              Todo (text) *4:1
          TP (body) 4:1
              Currently, when formatting .TP or .IP \
                  with a nonempty head, (text) *5:1
              [yada yada] (text) *6:1
        RS (block) *7:2
          RS (head) 7:2
          RS (body) 7:2
              PP (block) *8:2
                PP (head) 8:2
                PP (body) 8:2
                    When formatting .IP or .RS with an empty head,
                        mandoc needs (text) *9:1
                    [yada yada] (text) *10:1
        TP (block) *12:2
          TP (head) 12:2 ID=HREF=final
              final tag (text) *13:1
          TP (body) 13:1
              final body (text) *14:1

You see that the first .TP, the .RS, and the second .TP are all child
nodes of the top-level .SH.  The .RS is not a child of the .TP but
a sibling.  The two .TP nodes still aren't siblings of each other.

Now on first sight, you might blame me for that and call it a mandoc
artifact, arguing that mandoc instead ought to treat the .RS as a
child of the first .TP.  But no, that would be incorrect parsing
for the following reason: the .TP inmplies an indentation, and
the .RS also implies an indentation.  If the .RS were a child of
the .TP, we would get double indentation.  You can make that
argument even more convincing by adding a width argument to .RS
and varying that argument.  That way, you see that the .RS is
indented relative to the .SH, not relative to the .TP.

There are some cases where it is not completely clear whether one
man(7) node following another man(7) node is a child or a sibling.
mandoc(1) makes arbitrary choices in such ambiguous cases, usually
opting for sibling relations where possible and avoiding unnecessary
child relationships.  But this is not an ambiguous case.  Just like
the .IP, the .RS is definitely a sibling and not a child of the .TP.
As i said, no block can nest inside .TP.

That's why i brought up .RS in my reply and developed rules
for handling it in a similar way as .IP, even though you did
not mention .RS before.

> You could put the RS before the first paragraph, but then an unwanted
> line break appears after the tag.

No matter where you put the .RS, it will never be a child of .TP.

> (Maybe man(7) could be tweaked so
> that RS doesn't insert the line break after a TP.)

Not really a useful idea because .RS doesn't help with the actual
problem in the first place.

> In the end I didn't switch to that scheme, because IP just worked, but
> I might consider it if it proves to be useful.  What do you think?

As i said, i am not aware of a better solution than .TP .IP .TP.
In particular, .RS is not better because it causes exactly the
same trouble and potentially more trouble besides.

But i also said that trying to define "good style" for man(7)
is a fool's errand.  Because man(7) code is so exceedingly difficult
to write, man(7) code that is very clearly bad style is very often
found in the wild, so there is ample opportunity for saying "this
is bad style."  In some cases, it is also possible to point out
better style, for example

  .BR "some word" .

is clearly better style than

  .B some word\c
  \&.

even though both are correct man(7) code and even though there are
situations in man(7) where \c is unavoidable.

But very frequently, situations arise where man(7) doesn't really
allow any good solution, and the best you can do is not making
the source gratuitiously worse than it needs to be.

The .TP .IP .TP idiom is such an example.  It's definitely ugly from
both semantic and stylistic points of view, but no good solution
is available.  I'm willing to go further and claim that no better
solution can be designed even if you are willing to introduce a new
macro or change the way the .TP API is defined, even in incompatible
ways, because it's not this particular macro that is broken.  What is
broken is the fundamental design of the language: the language not
only predates the concept of semantic markup, but it also predates
the concept of block nesting in markup languages.  Yes, that is hard
to believe for people born after 1970 because those people have
essentially grown up with HTML and LaTeX and those two markup
languages have defined their concept of what a markup language is,
but let's face it, man(7) predates those fundamental concepts,
and it shows all over the place.

As long as people are using the language, mandoc(1) needs to somehow
deal with the mess.  I'm not happy with that because it is wasting a
lot of development time which could be spent in more productive ways,
but what can i do...

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: mandoc -man -Thtml bug: inconsistent vertical space before .TP

2023-10-19T15:17:23Z

[-- Attachment #1: Type: text/plain, Size: 9447 bytes --]

On Thu, Oct 19, 2023 at 04:45:21PM +0200, Ingo Schwarze wrote:
> Hi Alejandro,
> 
> Alejandro Colomar wrote on Mon, Oct 16, 2023 at 07:22:11PM +0200:
> > On Mon, Oct 16, 2023 at 06:28:05PM +0200, Ingo Schwarze wrote:
> >> Alejandro Colomar wrote on Mon, Oct 16, 2023 at 01:32:30PM +0200:
> 
> >>> groff -man -Thtml seems to never produce a blank line before a TP.
> >> What do you mean by "blank line"?
> > What my eyes experience as a relatively large inter-paragraph space.
> 
> Heh.  That's not a very useful definition when talking about HTML code,
> given that the HTML language does not provide any non-deprecated syntax
> or semantics related to paragraph spacing.  :)

I know.  :)

> 
> >>> mandoc -man -Thtml produces one in some cases, and I can't see a
> >>> pattern.
> >>> 
> >>> I found this bug while reading feature_test_macros(7) in the Debian
> >>> online manpages:
> >>> <https://manpages.debian.org/bullseye/manpages/ftm.7.en.html>
> 
> >> You can see that particular page rendered here:
> >>   https://man.bsd.lv/Test/ftm.7
> 
> > I don't see the bug there.  I'm going to guess it's just another case of
> > a missing CSS file.
> 
> Actually, there *is* a problem with the HTML code in that page,
> even though you did not see it because the CSS file hides it.
> 
> That page contains this HTML code:
> 
>   
>     _ISOC11_SOURCE (since glibc 2.16)
>     Exposes declarations consistent with the ISO C11 standard.
>       Defining this macro also enables C99 and C95 features
>       (like _ISOC99_SOURCE).
>   
>   
>     
>     Invoking the C compiler with the option -std=c11
>       produces the same effects as defining this macro.
>   
>   
>     _LARGEFILE64_SOURCE
>     Expose definitions for the alternative API specified by the
>       LFS (Large File Summit) as a &quot;transitional extension&quot;
>       to the Single UNIX Specification. [...]
> 
> That's of course quite nonsensical because _ISOC11_SOURCE and
> _LARGEFILE64_SOURCE are intended by the page author as adjacent
> entries in the same list, but mandoc(1) puts them into different
> lists, with yet another single-item list in between.
> 
> >> It is a long page, and i have been unable to figure out what exactly
> >> you are talking about.
> >> 
> >> Please point me to the precise position in the file where vertical
> >> spacing before a .TP macro feels lacking or execessive to you,
> 
> > In the Debian bullseye page, check the inter-paragraph space before the
> > tag _LARGEFILE64_SOURCE (I see a long vertical space) and the tag
> > _LARGEFILE_SOURCE.
> 
> Now i see, thank you for pointing me to the specific place.
> 
> The trouble is caused by the following man(7) idiom:
> 
>   .TP
>   first tag
>   first body
>   .IP
>   still in the first body
>   .TP
>   second tag
>   second body

Hmm, that's an old enemy showing up.

> 
> The author's intent here is that the two .TP macros mark up adjacent
> items in the same list and the .IP marks up an ordinary paragraph
> break within the item body of the first list entry.
> 
> Now, using .IP for an ordinary paragraph break is no doubt surprising,
> but it works (from a purely presentational point of view) because .IP
> does assert the same vertical spacing as .PP would and because .IP
> asserts the same indentation as the previous .TP did.
> 
> So logically, what the author wanted was a list with two entries,
> one containing two paragraphs of text, the other containing one
> paragraph of text.  Technically, what we got is three paragraphs of
> text, all indented by the same amount, the first and last containing
> a tag and the middle one having an empty tag, with no indication that
> there is any relation between any of the paragraphs, let alone that
> they form a list.  Figuring out the logical list structure, if any,
> is left as a guessing exercise to the formatter.
> 
> Here the conceptual inadequacy of the man(7) language becomes
> blatantly obvious.  With very few exceptions, the language does not
> provide any concept of block nesting.  The only exception is that
> various macros can be nested inside .SH, .SS, and .RS.  But nothing
> can be nested inside a list item body, neither a paragraph break,
> nor .RS, nor another list.

I had this gripe with man(7) some years ago.  I thought of using the
following instead, which slightly complicates the source code, but makes
it more logical.

	$ cat nested_indent.man 
	.TH nested_indent 7 2023-10-19 experiments
	.SH Ingo said:
	.TP
	Todo
	Currently, when formatting .TP or .IP with a non-empty head,
	the HTML formatter looks at the previous and at the following
	abstract syntax tree (AST) node to figure out whether the
	tagged paragraph is part of a list.
	If that previous or follwing AST node is .IP or .RS with an
	empty head, it will have to iterate until it finds an AST node
	that is neither .IP nor .RS or has a non-empty head, evaluating
	the properties of that node instead of the directly preceding
	or following node.
	.RS
	.PP
	When formatting .IP or .RS with an empty head, mandoc needs
	to iterate backwards, searching for an AST node that is neither
	\&.IP nor .RS or has a non-empty head, and figure out whether that
	node is a list item, which again, as explained above, requires
	iterating both forwards and backwards.
	If it turns out we are inside a list, interrupting the list
	must be prevented.  Instead, .IP with an empty head must be
	formatted like .PP, and .RS with an empty head must be formatted
	somewhat like .br.
	.RE

As you can see, here the indentation is controlled by a single RS/RE
pair, and everything within it uses PP as a normal paragraph separator.
You could put the RS before the first paragraph, but then an unwanted
line break appears after the tag.  (Maybe man(7) could be tweaked so
that RS doesn't insert the line break after a TP.)

In the end I didn't switch to that scheme, because IP just worked, but
I might consider it if it proves to be useful.  What do you think?

[CC += Branden, in case he wants to add his opinion too]

Cheers,
Alex

> 
> Consequently, i think the fundamental design of the man(7) language
> is too weak to adequately express a list item containing more than a
> single paragraph of text, and the crude presentational workaround of
> splitting the item body into two unrelated paragraphs with different
> introducing macros indeed looks like the best workaround available.
> 
> The longer i look at the man(7) language, the more convinced i become
> that it is so rotten to the core that trying to provide a style
> guide to write good man(7) pages is nothing but a fool's errand,
> and trying to add a few semantical macros to the man(7) language is
> an even worse fool's errand because a few additions won't cure the
> fundamental design flaws.  If you put lipstick on a pig, it's still
> not going to win any Miss Espana contest.
> 
> Let me quote only one other example that i ran into just today.
> The first real-world example of .MR usage i encountered required
> a trailing \c escape sequence on the preceding line.  Now how
> ironic is that?  A brand-new macro introduced to improve semantics,
> but using is requires terribly arcane low-level presentational
> markup.  I'm progressively becoming convinced that the language
> is irredeemable.
> 
> 
> Consequently, the following needs to be done in mandoc:
> 
> 1. Currently, when formatting .TP or .IP with a non-empty head,
>    the HTML formatter looks at the previous and at the following
>    abstract syntax tree (AST) node to figure out whether the
>    tagged paragraph is part of a list.
>    If that previous or follwing AST node is .IP or .RS with an
>    empty head, it will have to iterate until it finds an AST node
>    that is neither .IP nor .RS or has a non-empty head, evaluating
>    the properties of that node instead of the directly preceding
>    or following node.
> 
> 2. When formatting .IP or .RS with an empty head, mandoc needs
>    to iterate backwards, searching for an AST node that is neither
>    .IP nor .RS or has a non-empty head, and figure out whether that
>    node is a list item, which again, as explained above, requires
>    iterating both forwards and backwards.
>    If it turns out we are inside a list, interrupting the list
>    must be prevented.  Instead, .IP with an empty head must be
>    formatted like .PP, and .RS with an empty head must be formatted
>    somewhat like .br.
> 
> Probably, doing all this in the HTML formatter module would be
> over the top.  I believe such complicated AST inspection should
> be done by the validation module (man_validate.c), which should
> set AST node flags similar to
> 
>  - this node starts a new list
>  - this node starts a new list item
>  - this node merely indicates a paragraph break
>  - this node ends a list
> 
> which the fotmatters can then readily use.
> 
> This required logic is so complicated that i won't code it right
> away, there are more urgent matters to be taken care of.
> Instead, i will add it to the mandoc TODO list.
> 
> Yours,
>   Ingo

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

Re: mandoc -man -Thtml bug: inconsistent vertical space before .TP

2023-10-19T15:10:35Z

Hi Alejandro,

Ingo Schwarze wrote on Thu, Oct 19, 2023 at 04:45:21PM +0200:

> Consequently, the following needs to be done in mandoc:
> 
> 1. Currently, when formatting .TP or .IP with a non-empty head,
>    the HTML formatter looks at the previous and at the following
>    abstract syntax tree (AST) node to figure out whether the
>    tagged paragraph is part of a list.
>    If that previous or follwing AST node is .IP or .RS with an
>    empty head, it will have to iterate until it finds an AST node
>    that is neither .IP nor .RS or has a non-empty head, evaluating
>    the properties of that node instead of the directly preceding
>    or following node.
> 
> 2. When formatting .IP or .RS with an empty head, mandoc needs
>    to iterate backwards, searching for an AST node that is neither
>    .IP nor .RS or has a non-empty head, and figure out whether that
>    node is a list item, which again, as explained above, requires
>    iterating both forwards and backwards.
>    If it turns out we are inside a list, interrupting the list
>    must be prevented.  Instead, .IP with an empty head must be
>    formatted like .PP, and .RS with an empty head must be formatted
>    somewhat like .br.
> 
> Probably, doing all this in the HTML formatter module would be
> over the top.  I believe such complicated AST inspection should
> be done by the validation module (man_validate.c), which should
> set AST node flags similar to
> 
>  - this node starts a new list
>  - this node starts a new list item
>  - this node merely indicates a paragraph break
>  - this node ends a list
> 
> which the fotmatters can then readily use.
> 
> This required logic is so complicated that i won't code it right
> away, there are more urgent matters to be taken care of.
> Instead, i will add it to the mandoc TODO list.

FYI: Added, see the commit below.
  Ingo


Log Message:
-----------
new entries: .MR and .TP .IP .TP

Modified Files:
--------------
    mandoc:
        TODO

Revision Data
-------------
Index: TODO
===================================================================
RCS file: /home/cvs/mandoc/mandoc/TODO,v
retrieving revision 1.329
retrieving revision 1.330
diff -LTODO -LTODO -u -p -r1.329 -r1.330
--- TODO
+++ TODO
@@ -239,6 +239,9 @@ are mere guesses, and some may be wrong.
 
 --- missing man features -----------------------------------------------
 
+- groff_man(7) .MR
+  loc **  exist *  algo *  size *  imp ***
+
 - MANWIDTH
   Markus Waldeck  9 Jun 2015 05:49:56 +0200
   Laura Morales  26 Apr 2018 08:15:55 +0200
@@ -504,6 +507,10 @@ are mere guesses, and some may be wrong.
   loc **  exist **  algo **  size *  imp **
 
 --- HTML issues --------------------------------------------------------
+
+- support the idiom .TP .IP .TP for multi-paragraph list item bodies
+  to: Alejandro Colomar Thu, 19 Oct 2023 16:45:21 +0200
+  loc **  exist **  algo **  size **  imp **
 
 - .Nm without an argument and .Bx cause premature

Nab Sun, 5 Jun 2022 18:30:09 +0200 -- To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: mandoc -man -Thtml bug: inconsistent vertical space before .TP

2023-10-19T14:45:25Z

Hi Alejandro,

Alejandro Colomar wrote on Mon, Oct 16, 2023 at 07:22:11PM +0200:
> On Mon, Oct 16, 2023 at 06:28:05PM +0200, Ingo Schwarze wrote:
>> Alejandro Colomar wrote on Mon, Oct 16, 2023 at 01:32:30PM +0200:

>>> groff -man -Thtml seems to never produce a blank line before a TP.
>> What do you mean by "blank line"?
> What my eyes experience as a relatively large inter-paragraph space.

Heh.  That's not a very useful definition when talking about HTML code,
given that the HTML language does not provide any non-deprecated syntax
or semantics related to paragraph spacing.  :)

>>> mandoc -man -Thtml produces one in some cases, and I can't see a
>>> pattern.
>>> 
>>> I found this bug while reading feature_test_macros(7) in the Debian
>>> online manpages:
>>> <https://manpages.debian.org/bullseye/manpages/ftm.7.en.html>

>> You can see that particular page rendered here:
>>   https://man.bsd.lv/Test/ftm.7

> I don't see the bug there.  I'm going to guess it's just another case of
> a missing CSS file.

Actually, there *is* a problem with the HTML code in that page,
even though you did not see it because the CSS file hides it.

That page contains this HTML code:

    _ISOC11_SOURCE (since glibc 2.16)
    Exposes declarations consistent with the ISO C11 standard.
      Defining this macro also enables C99 and C95 features
      (like _ISOC99_SOURCE).

    Invoking the C compiler with the option -std=c11
      produces the same effects as defining this macro.

    _LARGEFILE64_SOURCE
    Expose definitions for the alternative API specified by the
      LFS (Large File Summit) as a &quot;transitional extension&quot;
      to the Single UNIX Specification. [...]

That's of course quite nonsensical because _ISOC11_SOURCE and
_LARGEFILE64_SOURCE are intended by the page author as adjacent
entries in the same list, but mandoc(1) puts them into different
lists, with yet another single-item list in between.

>> It is a long page, and i have been unable to figure out what exactly
>> you are talking about.
>> 
>> Please point me to the precise position in the file where vertical
>> spacing before a .TP macro feels lacking or execessive to you,

> In the Debian bullseye page, check the inter-paragraph space before the
> tag _LARGEFILE64_SOURCE (I see a long vertical space) and the tag
> _LARGEFILE_SOURCE.

Now i see, thank you for pointing me to the specific place.

The trouble is caused by the following man(7) idiom:

  .TP
  first tag
  first body
  .IP
  still in the first body
  .TP
  second tag
  second body

The author's intent here is that the two .TP macros mark up adjacent
items in the same list and the .IP marks up an ordinary paragraph
break within the item body of the first list entry.

Now, using .IP for an ordinary paragraph break is no doubt surprising,
but it works (from a purely presentational point of view) because .IP
does assert the same vertical spacing as .PP would and because .IP
asserts the same indentation as the previous .TP did.

So logically, what the author wanted was a list with two entries,
one containing two paragraphs of text, the other containing one
paragraph of text.  Technically, what we got is three paragraphs of
text, all indented by the same amount, the first and last containing
a tag and the middle one having an empty tag, with no indication that
there is any relation between any of the paragraphs, let alone that
they form a list.  Figuring out the logical list structure, if any,
is left as a guessing exercise to the formatter.

Here the conceptual inadequacy of the man(7) language becomes
blatantly obvious.  With very few exceptions, the language does not
provide any concept of block nesting.  The only exception is that
various macros can be nested inside .SH, .SS, and .RS.  But nothing
can be nested inside a list item body, neither a paragraph break,
nor .RS, nor another list.

Consequently, i think the fundamental design of the man(7) language
is too weak to adequately express a list item containing more than a
single paragraph of text, and the crude presentational workaround of
splitting the item body into two unrelated paragraphs with different
introducing macros indeed looks like the best workaround available.

The longer i look at the man(7) language, the more convinced i become
that it is so rotten to the core that trying to provide a style
guide to write good man(7) pages is nothing but a fool's errand,
and trying to add a few semantical macros to the man(7) language is
an even worse fool's errand because a few additions won't cure the
fundamental design flaws.  If you put lipstick on a pig, it's still
not going to win any Miss Espana contest.

Let me quote only one other example that i ran into just today.
The first real-world example of .MR usage i encountered required
a trailing \c escape sequence on the preceding line.  Now how
ironic is that?  A brand-new macro introduced to improve semantics,
but using is requires terribly arcane low-level presentational
markup.  I'm progressively becoming convinced that the language
is irredeemable.

Consequently, the following needs to be done in mandoc:

1. Currently, when formatting .TP or .IP with a non-empty head,
   the HTML formatter looks at the previous and at the following
   abstract syntax tree (AST) node to figure out whether the
   tagged paragraph is part of a list.
   If that previous or follwing AST node is .IP or .RS with an
   empty head, it will have to iterate until it finds an AST node
   that is neither .IP nor .RS or has a non-empty head, evaluating
   the properties of that node instead of the directly preceding
   or following node.

2. When formatting .IP or .RS with an empty head, mandoc needs
   to iterate backwards, searching for an AST node that is neither
   .IP nor .RS or has a non-empty head, and figure out whether that
   node is a list item, which again, as explained above, requires
   iterating both forwards and backwards.
   If it turns out we are inside a list, interrupting the list
   must be prevented.  Instead, .IP with an empty head must be
   formatted like .PP, and .RS with an empty head must be formatted
   somewhat like .br.

Probably, doing all this in the HTML formatter module would be
over the top.  I believe such complicated AST inspection should
be done by the validation module (man_validate.c), which should
set AST node flags similar to

 - this node starts a new list
 - this node starts a new list item
 - this node merely indicates a paragraph break
 - this node ends a list

which the fotmatters can then readily use.

This required logic is so complicated that i won't code it right
away, there are more urgent matters to be taken care of.
Instead, i will add it to the mandoc TODO list.

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: mandoc -man -Thtml: unwanted line break after bullet (.IP)

2023-10-19T12:49:00Z

[-- Attachment #1: Type: text/plain, Size: 4249 bytes --]

Hi Ingo,

On Thu, Oct 19, 2023 at 01:59:22PM +0200, Ingo Schwarze wrote:
> Hi Alejandro,
> 
> Ingo Schwarze wrote on Wed, Oct 18, 2023 at 02:04:46AM +0200:
> 
> >  1. change the Makefile to always install mandoc.css
> 
> Done with the commit appended below.
> 
> I believe all issues discovered in the thread "unwanted line break
> after bullet" have now been fixed.  In case i missed something,
> please speak up!

Thanks!  Not as far as I can see.

Cheers,
Alex

> 
> Yours,
>   Ingo
> 
> 
> Log Message:
> -----------
> Install mandoc.css by default even if man.cgi(8) is not built.
> It matters because users of "mandoc -T html" typically need it.
> 
> Issue found in a conversation with Alejandro Colomar .
> 
> Modified Files:
> --------------
>     mandoc:
>         Makefile
>         configure
>         configure.local.example
> 
> Revision Data
> -------------
> Index: configure
> ===================================================================
> RCS file: /home/cvs/mandoc/mandoc/configure,v
> retrieving revision 1.82
> retrieving revision 1.83
> diff -Lconfigure -Lconfigure -u -p -r1.82 -r1.83
> --- configure
> +++ configure
> @@ -111,6 +111,7 @@ BIN_FROM_SBIN=
>  INCLUDEDIR=
>  LIBDIR=
>  MANDIR=
> +MISCDIR=
>  READ_ALLOWED_PATH=
>  
>  WWWPREFIX="/var/www"
> @@ -620,6 +621,7 @@ exec > Makefile.local
>  [ -z "${INCLUDEDIR}"      ] && INCLUDEDIR="${PREFIX}/include/mandoc"
>  [ -z "${LIBDIR}"          ] && LIBDIR="${PREFIX}/lib/mandoc"
>  [ -z "${MANDIR}"          ] && MANDIR="${PREFIX}/man"
> +[ -z "${MISCDIR}"         ] && MISCDIR="${PREFIX}/share/misc"
>  
>  [ -z "${HTDOCDIR}"        ] && HTDOCDIR="${WWWPREFIX}/htdocs"
>  [ -z "${CGIBINDIR}"       ] && CGIBINDIR="${WWWPREFIX}/cgi-bin"
> @@ -658,6 +660,7 @@ BIN_FROM_SBIN	= ${BIN_FROM_SBIN}
>  INCLUDEDIR	= ${INCLUDEDIR}
>  LIBDIR		= ${LIBDIR}
>  MANDIR		= ${MANDIR}
> +MISCDIR		= ${MISCDIR}
>  WWWPREFIX	= ${WWWPREFIX}
>  HTDOCDIR	= ${HTDOCDIR}
>  CGIBINDIR	= ${CGIBINDIR}
> Index: Makefile
> ===================================================================
> RCS file: /home/cvs/mandoc/mandoc/Makefile,v
> retrieving revision 1.542
> retrieving revision 1.543
> diff -LMakefile -LMakefile -u -p -r1.542 -r1.543
> --- Makefile
> +++ Makefile
> @@ -416,6 +416,7 @@ base-install: mandoc demandoc soelim
>  	mkdir -p $(DESTDIR)$(MANDIR)/man5
>  	mkdir -p $(DESTDIR)$(MANDIR)/man7
>  	mkdir -p $(DESTDIR)$(MANDIR)/man8
> +	mkdir -p $(DESTDIR)$(MISCDIR)
>  	$(INSTALL_PROGRAM) mandoc demandoc $(DESTDIR)$(BINDIR)
>  	$(INSTALL_PROGRAM) soelim $(DESTDIR)$(BINDIR)/$(BINM_SOELIM)
>  	cd $(DESTDIR)$(BINDIR) && $(LN) mandoc $(BINM_MAN)
> @@ -438,6 +439,7 @@ base-install: mandoc demandoc soelim
>  	$(INSTALL_MAN) mandoc_char.7 $(DESTDIR)$(MANDIR)/man7
>  	$(INSTALL_MAN) makewhatis.8 \
>  		$(DESTDIR)$(MANDIR)/man8/$(BINM_MAKEWHATIS).8
> +	$(INSTALL_DATA) mandoc.css $(DESTDIR)$(MISCDIR)
>  
>  lib-install: libmandoc.a
>  	mkdir -p $(DESTDIR)$(LIBDIR)
> Index: configure.local.example
> ===================================================================
> RCS file: /home/cvs/mandoc/mandoc/configure.local.example,v
> retrieving revision 1.44
> retrieving revision 1.45
> diff -Lconfigure.local.example -Lconfigure.local.example -u -p -r1.44 -r1.45
> --- configure.local.example
> +++ configure.local.example
> @@ -108,11 +108,15 @@ OSNAME="OpenBSD 7.0"
>  # there is no need to copy the whole block.
>  # Even if you set PREFIX to something else, the other variables
>  # pick it up without copying them all over.
> +# MISCDIR is only used for installing the file mandoc.css.
> +# That is important because users of "mandoc -T html" often need it
> +# even if they are not using man.cgi(8), see mandoc(1) for details.
>  
>  PREFIX="/usr/local"
>  BINDIR="${PREFIX}/bin"
>  SBINDIR="${PREFIX}/sbin"
>  MANDIR="${PREFIX}/man"
> +MISCDIR="${PREFIX}/share/misc"
>  
>  # If BINDIR and SBINDIR are not subdirectories of the same parent
>  # directory or if the basename(1) of BINDIR differs from "bin",
> --
>  To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

Re: mandoc -man -Thtml: unwanted line break after bullet (.IP)

2023-10-19T11:59:26Z

Hi Alejandro,

Ingo Schwarze wrote on Wed, Oct 18, 2023 at 02:04:46AM +0200:

>  1. change the Makefile to always install mandoc.css

Done with the commit appended below.

I believe all issues discovered in the thread "unwanted line break
after bullet" have now been fixed.  In case i missed something,
please speak up!

Yours,
  Ingo


Log Message:
-----------
Install mandoc.css by default even if man.cgi(8) is not built.
It matters because users of "mandoc -T html" typically need it.

Issue found in a conversation with Alejandro Colomar .

Modified Files:
--------------
    mandoc:
        Makefile
        configure
        configure.local.example

Revision Data
-------------
Index: configure
===================================================================
RCS file: /home/cvs/mandoc/mandoc/configure,v
retrieving revision 1.82
retrieving revision 1.83
diff -Lconfigure -Lconfigure -u -p -r1.82 -r1.83
--- configure
+++ configure
@@ -111,6 +111,7 @@ BIN_FROM_SBIN=
 INCLUDEDIR=
 LIBDIR=
 MANDIR=
+MISCDIR=
 READ_ALLOWED_PATH=
 
 WWWPREFIX="/var/www"
@@ -620,6 +621,7 @@ exec > Makefile.local
 [ -z "${INCLUDEDIR}"      ] && INCLUDEDIR="${PREFIX}/include/mandoc"
 [ -z "${LIBDIR}"          ] && LIBDIR="${PREFIX}/lib/mandoc"
 [ -z "${MANDIR}"          ] && MANDIR="${PREFIX}/man"
+[ -z "${MISCDIR}"         ] && MISCDIR="${PREFIX}/share/misc"
 
 [ -z "${HTDOCDIR}"        ] && HTDOCDIR="${WWWPREFIX}/htdocs"
 [ -z "${CGIBINDIR}"       ] && CGIBINDIR="${WWWPREFIX}/cgi-bin"
@@ -658,6 +660,7 @@ BIN_FROM_SBIN	= ${BIN_FROM_SBIN}
 INCLUDEDIR	= ${INCLUDEDIR}
 LIBDIR		= ${LIBDIR}
 MANDIR		= ${MANDIR}
+MISCDIR		= ${MISCDIR}
 WWWPREFIX	= ${WWWPREFIX}
 HTDOCDIR	= ${HTDOCDIR}
 CGIBINDIR	= ${CGIBINDIR}
Index: Makefile
===================================================================
RCS file: /home/cvs/mandoc/mandoc/Makefile,v
retrieving revision 1.542
retrieving revision 1.543
diff -LMakefile -LMakefile -u -p -r1.542 -r1.543
--- Makefile
+++ Makefile
@@ -416,6 +416,7 @@ base-install: mandoc demandoc soelim
 	mkdir -p $(DESTDIR)$(MANDIR)/man5
 	mkdir -p $(DESTDIR)$(MANDIR)/man7
 	mkdir -p $(DESTDIR)$(MANDIR)/man8
+	mkdir -p $(DESTDIR)$(MISCDIR)
 	$(INSTALL_PROGRAM) mandoc demandoc $(DESTDIR)$(BINDIR)
 	$(INSTALL_PROGRAM) soelim $(DESTDIR)$(BINDIR)/$(BINM_SOELIM)
 	cd $(DESTDIR)$(BINDIR) && $(LN) mandoc $(BINM_MAN)
@@ -438,6 +439,7 @@ base-install: mandoc demandoc soelim
 	$(INSTALL_MAN) mandoc_char.7 $(DESTDIR)$(MANDIR)/man7
 	$(INSTALL_MAN) makewhatis.8 \
 		$(DESTDIR)$(MANDIR)/man8/$(BINM_MAKEWHATIS).8
+	$(INSTALL_DATA) mandoc.css $(DESTDIR)$(MISCDIR)
 
 lib-install: libmandoc.a
 	mkdir -p $(DESTDIR)$(LIBDIR)
Index: configure.local.example
===================================================================
RCS file: /home/cvs/mandoc/mandoc/configure.local.example,v
retrieving revision 1.44
retrieving revision 1.45
diff -Lconfigure.local.example -Lconfigure.local.example -u -p -r1.44 -r1.45
--- configure.local.example
+++ configure.local.example
@@ -108,11 +108,15 @@ OSNAME="OpenBSD 7.0"
 # there is no need to copy the whole block.
 # Even if you set PREFIX to something else, the other variables
 # pick it up without copying them all over.
+# MISCDIR is only used for installing the file mandoc.css.
+# That is important because users of "mandoc -T html" often need it
+# even if they are not using man.cgi(8), see mandoc(1) for details.
 
 PREFIX="/usr/local"
 BINDIR="${PREFIX}/bin"
 SBINDIR="${PREFIX}/sbin"
 MANDIR="${PREFIX}/man"
+MISCDIR="${PREFIX}/share/misc"
 
 # If BINDIR and SBINDIR are not subdirectories of the same parent
 # directory or if the basename(1) of BINDIR differs from "bin",
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv

Re: MR macro (from GNU groff)

2023-10-18T21:54:20Z

[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]

Hi Ingo,

On Wed, Oct 18, 2023 at 11:31:48PM +0200, Ingo Schwarze wrote:
> Hi Alejandro,
> 
> Alejandro Colomar wrote on Wed, Oct 18, 2023 at 10:09:56PM +0200:
> 
> > As you know, I have plans to start using MR from GNU groff.  But doing
> > so without wide support from distros and formatters would be stupid.
> [...]
> > Do you have plans to add support for the MR macro?
> 
> Yes, it will be added.  It shouldn't bee too difficult because it's
> essentially a clone of mdoc(7) .Xr.  Also, it is somewhat urgent
> because the groff manual pages already use it, so i have to
> support it in mandoc(1) before i can update the OpenBSD groff port
> to 1.23.

Thanks!

> 
> The MF string register will be ignored, though.  mandoc(1) will
> always set the topic (or to use a more precise term, the *name*)
> of the manual page linked to in the R font.

Sounds reasonable to me, for consistency with mandoc(1)'s formatting of
mdoc(7).

Cheers,
Alex

> 
> Yours,
>   Ingo
> --
>  To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

Re: MR macro (from GNU groff)

2023-10-18T21:31:53Z

Hi Alejandro,

Alejandro Colomar wrote on Wed, Oct 18, 2023 at 10:09:56PM +0200:

> As you know, I have plans to start using MR from GNU groff.  But doing
> so without wide support from distros and formatters would be stupid.
[...]
> Do you have plans to add support for the MR macro?

Yes, it will be added.  It shouldn't bee too difficult because it's
essentially a clone of mdoc(7) .Xr.  Also, it is somewhat urgent
because the groff manual pages already use it, so i have to
support it in mandoc(1) before i can update the OpenBSD groff port
to 1.23.

The MF string register will be ignored, though.  mandoc(1) will
always set the topic (or to use a more precise term, the *name*)
of the manual page linked to in the R font.

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv