tech@mandoc.bsd.lv
 help / color / mirror / Atom feed
* Improve catman mandocdb(8) heuristics.
@ 2011-12-07 14:56 Kristaps Dzonsons
  2011-12-08  1:10 ` Ingo Schwarze
  0 siblings, 1 reply; 2+ messages in thread
From: Kristaps Dzonsons @ 2011-12-07 14:56 UTC (permalink / raw)
  To: tech

[-- Attachment #1: Type: text/plain, Size: 763 bytes --]

Hi,

Enclosed is a patch to de-backspace Nm/Nd lines for mandocdb(8).  This 
arose from seeing the results for some LAPACK manuals, which are 
notoriously shitty.  It also cleans up handling of the non-terminated 
string a bit and adds a quick check to see if SYNOPSIS has been reached 
right after the NAME.  This occurs when manuals look like this:

  NAME
  SYNOPSIS
    Blah blah blah

Again, LAPACK...

If it's relevant, a check for NAME could also occur, then loop back into 
the fgets().  Thoughts?

This area still needs a bit more attention to handle situations like:

  foo -[\n]?
  [whitespace]foo - bar[whitespace][\n]?

I need to check this over a bit more carefully to see if I'm not 
trampling past the array, but this is a start.

Best,

Kristaps

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 1954 bytes --]

Index: mandocdb.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mandocdb.c,v
retrieving revision 1.25
diff -u -p -r1.25 mandocdb.c
--- mandocdb.c	7 Dec 2011 01:57:20 -0000	1.25
+++ mandocdb.c	7 Dec 2011 14:55:39 -0000
@@ -1301,31 +1301,69 @@ pformatted(DB *hash, struct buf *buf, st
 	}
 	fclose(stream);
 
+	/* 
+	 * Strip out backspace-encoding.
+	 * Also handle the bogus case where the backspace is malformed
+	 * at the beginning or end of the line.
+	 */
+
+	while (NULL != (p = memchr(line, '\b', len))) {
+		plen = p - line;
+		if (plen == --len)
+			continue;
+		if (plen > 0) {
+			memmove(p - 1, p + 1, len - plen);
+			len--;
+		} else
+			memmove(p, p + 1, len);
+	}
+
+	/*
+	 * Check if there's no name/description information.  This
+	 * happens with some manuals, e.g., LAPACK.  If not, reuse our
+	 * title.
+	 */
+
+	if (len > 0 && '\n' == line[len - 1]) {
+		line[--len] = '\0';
+		if (0 == strcmp(line, "SYNOPSIS")) {
+			buf_appendb(dbuf, buf->cp, buf->size);
+			hash_put(hash, buf, TYPE_Nd);
+			return;
+		}
+	} else if (0 == len) {
+		buf_appendb(dbuf, buf->cp, buf->size);
+		hash_put(hash, buf, TYPE_Nd);
+		return;
+	}
+
 	/*
 	 * If there is a dash, skip to the text following it.
 	 */
 
-	for (p = line, plen = len; plen; p++, plen--)
-		if ('-' == *p)
-			break;
+	p = memchr(line, '-', len);
+	plen = len - (p - line);
+
 	for ( ; plen; p++, plen--)
-		if ('-' != *p && ' ' != *p && 8 != *p)
+		if ('-' != *p && ' ' != *p)
 			break;
-	if (0 == plen) {
-		p = line;
-		plen = len;
-	}
 
 	/*
 	 * Copy the rest of the line, but no more than 70 bytes.
 	 */
 
-	if (70 < plen)
+	if (0 == plen) {
+		p = line;
+		plen = len;
+	} else if (70 < plen)
 		plen = 70;
-	p[plen-1] = '\0';
+
 	buf_appendb(dbuf, p, plen);
+	buf_appendb(dbuf, "", 1);
+
 	buf->len = 0;
 	buf_appendb(buf, p, plen);
+	buf_appendb(buf, "", 1);
 	hash_put(hash, buf, TYPE_Nd);
 }
 

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Improve catman mandocdb(8) heuristics.
  2011-12-07 14:56 Improve catman mandocdb(8) heuristics Kristaps Dzonsons
@ 2011-12-08  1:10 ` Ingo Schwarze
  0 siblings, 0 replies; 2+ messages in thread
From: Ingo Schwarze @ 2011-12-08  1:10 UTC (permalink / raw)
  To: tech

Hi Kristaps,

Kristaps Dzonsons wrote on Wed, Dec 07, 2011 at 03:56:52PM +0100:

> Enclosed is a patch to de-backspace Nm/Nd lines for mandocdb(8).

I think that makes sense.

> This arose from seeing the results for some LAPACK manuals, which
> are notoriously shitty.  It also cleans up handling of the
> non-terminated string a bit

That part seems good, too.

> and adds a quick check to see if SYNOPSIS has been reached right
> after the NAME.  This occurs when manuals look like this:
> 
>  NAME
>  SYNOPSIS
>    Blah blah blah
> 
> Again, LAPACK...

I don't think i like that, it seems too specific, in particular
looking for the exact string "SYNOPSIS".

Perhaps we should move this check before stripping out the backspace
encoding and do just this:

	line = fgetln(stream, &len);
	if (NULL == line || ' ' != line[0] || '\n' != line[len-1]) {
		buf_appendb(dbuf, buf->cp, buf->size);
		hash_put(hash, buf, TYPE_Nd);
		fclose(stream);
		return;
	}
	fclose(stream);
	line[--len] = '\0';

That removes a lot of duplicate code and may even be better
heuristics.  ANY section header right after the first one
is fundamentally unusable, not just SYNOPSIS.

Note that your "} else if (0 == len) {" can be dropped as well
if we take that route.

> If it's relevant, a check for NAME could also occur, then loop back
> into the fgets().  Thoughts?

If there is another section before NAME, do we really want to use
the content of the NAME section?  I'd say just use the first
section.

Experience with the current makewhatis(8) tells me that being
clever and trying hard buys us very little: all reasonable pages
do not need clever tricks, so at best that finds a bit of additional
information from a small number of botched pages, i.e. most
probably not the most valuable info, and not the largest amount.

Being very resilient buys us more:  Never complain, always return
something at least semi-useful.  The biggest problem with the
current tool is that being so clever, it easily gets really badly
confused, and then it starts complaining loudly.

Yes, we should implement -t (test mode) later on, to help porters
spot pages with broken NAME sections.  But it is very important
to be absolutely silent and not too clever in production mode.

> This area still needs a bit more attention to handle situations like:
> 
>  foo -[\n]?
>  [whitespace]foo - bar[whitespace][\n]?

Oh well, i think delivering just

  foo(1) -
  foo(1) - bar

is good enough in these two cases, stripping trailing whitespace
and falling back to "foo(1) - foo" would be a bit more fancy
in the first case, probably worthwhile, but hardly critical.

One thing that i want to do is read through the current
Makewhatis::Formated and see which features should be ported
and which are better done in a simpler way.  Of course,
i won't complain if somebody beats me to it.

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-12-08  1:10 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-07 14:56 Improve catman mandocdb(8) heuristics Kristaps Dzonsons
2011-12-08  1:10 ` Ingo Schwarze

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).