tech@mandoc.bsd.lv
 help / color / mirror / Atom feed
From: Ingo Schwarze <schwarze@usta.de>
To: tech@mdocml.bsd.lv
Subject: Re: Improve catman mandocdb(8) heuristics.
Date: Thu, 8 Dec 2011 02:10:01 +0100	[thread overview]
Message-ID: <20111208011001.GC19643@iris.usta.de> (raw)
In-Reply-To: <4EDF7EB4.7040906@bsd.lv>

Hi Kristaps,

Kristaps Dzonsons wrote on Wed, Dec 07, 2011 at 03:56:52PM +0100:

> Enclosed is a patch to de-backspace Nm/Nd lines for mandocdb(8).

I think that makes sense.

> This arose from seeing the results for some LAPACK manuals, which
> are notoriously shitty.  It also cleans up handling of the
> non-terminated string a bit

That part seems good, too.

> and adds a quick check to see if SYNOPSIS has been reached right
> after the NAME.  This occurs when manuals look like this:
> 
>  NAME
>  SYNOPSIS
>    Blah blah blah
> 
> Again, LAPACK...

I don't think i like that, it seems too specific, in particular
looking for the exact string "SYNOPSIS".

Perhaps we should move this check before stripping out the backspace
encoding and do just this:

	line = fgetln(stream, &len);
	if (NULL == line || ' ' != line[0] || '\n' != line[len-1]) {
		buf_appendb(dbuf, buf->cp, buf->size);
		hash_put(hash, buf, TYPE_Nd);
		fclose(stream);
		return;
	}
	fclose(stream);
	line[--len] = '\0';

That removes a lot of duplicate code and may even be better
heuristics.  ANY section header right after the first one
is fundamentally unusable, not just SYNOPSIS.

Note that your "} else if (0 == len) {" can be dropped as well
if we take that route.

> If it's relevant, a check for NAME could also occur, then loop back
> into the fgets().  Thoughts?

If there is another section before NAME, do we really want to use
the content of the NAME section?  I'd say just use the first
section.

Experience with the current makewhatis(8) tells me that being
clever and trying hard buys us very little: all reasonable pages
do not need clever tricks, so at best that finds a bit of additional
information from a small number of botched pages, i.e. most
probably not the most valuable info, and not the largest amount.

Being very resilient buys us more:  Never complain, always return
something at least semi-useful.  The biggest problem with the
current tool is that being so clever, it easily gets really badly
confused, and then it starts complaining loudly.

Yes, we should implement -t (test mode) later on, to help porters
spot pages with broken NAME sections.  But it is very important
to be absolutely silent and not too clever in production mode.

> This area still needs a bit more attention to handle situations like:
> 
>  foo -[\n]?
>  [whitespace]foo - bar[whitespace][\n]?

Oh well, i think delivering just

  foo(1) -
  foo(1) - bar

is good enough in these two cases, stripping trailing whitespace
and falling back to "foo(1) - foo" would be a bit more fancy
in the first case, probably worthwhile, but hardly critical.

One thing that i want to do is read through the current
Makewhatis::Formated and see which features should be ported
and which are better done in a simpler way.  Of course,
i won't complain if somebody beats me to it.

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

      reply	other threads:[~2011-12-08  1:10 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-07 14:56 Kristaps Dzonsons
2011-12-08  1:10 ` Ingo Schwarze [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111208011001.GC19643@iris.usta.de \
    --to=schwarze@usta.de \
    --cc=tech@mdocml.bsd.lv \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).