From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp-1.sys.kth.se (smtp-1.sys.kth.se [130.237.32.175]) by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id pB7Ev1jX002216 for ; Wed, 7 Dec 2011 09:57:02 -0500 (EST) Received: from mailscan-1.sys.kth.se (mailscan-1.sys.kth.se [130.237.32.91]) by smtp-1.sys.kth.se (Postfix) with ESMTP id D31A4156B62 for ; Wed, 7 Dec 2011 15:56:55 +0100 (CET) X-Virus-Scanned: by amavisd-new at kth.se Received: from smtp-1.sys.kth.se ([130.237.32.175]) by mailscan-1.sys.kth.se (mailscan-1.sys.kth.se [130.237.32.91]) (amavisd-new, port 10024) with LMTP id dV8+mx5sRmyx for ; Wed, 7 Dec 2011 15:56:54 +0100 (CET) X-KTH-Auth: kristaps [193.10.49.5] X-KTH-mail-from: kristaps@bsd.lv X-KTH-rcpt-to: tech@mdocml.bsd.lv Received: from ctime.hhs.se (ctime.hhs.se [193.10.49.5]) by smtp-1.sys.kth.se (Postfix) with ESMTP id 660821551FC for ; Wed, 7 Dec 2011 15:56:53 +0100 (CET) Message-ID: <4EDF7EB4.7040906@bsd.lv> Date: Wed, 07 Dec 2011 15:56:52 +0100 From: Kristaps Dzonsons User-Agent: Mozilla/5.0 (X11; OpenBSD amd64; rv:5.0) Gecko/20110805 Thunderbird/5.0 X-Mailinglist: mdocml-tech Reply-To: tech@mdocml.bsd.lv MIME-Version: 1.0 To: tech@mdocml.bsd.lv Subject: Improve catman mandocdb(8) heuristics. Content-Type: multipart/mixed; boundary="------------040306090009020504030203" This is a multi-part message in MIME format. --------------040306090009020504030203 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi, Enclosed is a patch to de-backspace Nm/Nd lines for mandocdb(8). This arose from seeing the results for some LAPACK manuals, which are notoriously shitty. It also cleans up handling of the non-terminated string a bit and adds a quick check to see if SYNOPSIS has been reached right after the NAME. This occurs when manuals look like this: NAME SYNOPSIS Blah blah blah Again, LAPACK... If it's relevant, a check for NAME could also occur, then loop back into the fgets(). Thoughts? This area still needs a bit more attention to handle situations like: foo -[\n]? [whitespace]foo - bar[whitespace][\n]? I need to check this over a bit more carefully to see if I'm not trampling past the array, but this is a start. Best, Kristaps --------------040306090009020504030203 Content-Type: text/plain; name="patch.txt" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="patch.txt" Index: mandocdb.c =================================================================== RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mandocdb.c,v retrieving revision 1.25 diff -u -p -r1.25 mandocdb.c --- mandocdb.c 7 Dec 2011 01:57:20 -0000 1.25 +++ mandocdb.c 7 Dec 2011 14:55:39 -0000 @@ -1301,31 +1301,69 @@ pformatted(DB *hash, struct buf *buf, st } fclose(stream); + /* + * Strip out backspace-encoding. + * Also handle the bogus case where the backspace is malformed + * at the beginning or end of the line. + */ + + while (NULL != (p = memchr(line, '\b', len))) { + plen = p - line; + if (plen == --len) + continue; + if (plen > 0) { + memmove(p - 1, p + 1, len - plen); + len--; + } else + memmove(p, p + 1, len); + } + + /* + * Check if there's no name/description information. This + * happens with some manuals, e.g., LAPACK. If not, reuse our + * title. + */ + + if (len > 0 && '\n' == line[len - 1]) { + line[--len] = '\0'; + if (0 == strcmp(line, "SYNOPSIS")) { + buf_appendb(dbuf, buf->cp, buf->size); + hash_put(hash, buf, TYPE_Nd); + return; + } + } else if (0 == len) { + buf_appendb(dbuf, buf->cp, buf->size); + hash_put(hash, buf, TYPE_Nd); + return; + } + /* * If there is a dash, skip to the text following it. */ - for (p = line, plen = len; plen; p++, plen--) - if ('-' == *p) - break; + p = memchr(line, '-', len); + plen = len - (p - line); + for ( ; plen; p++, plen--) - if ('-' != *p && ' ' != *p && 8 != *p) + if ('-' != *p && ' ' != *p) break; - if (0 == plen) { - p = line; - plen = len; - } /* * Copy the rest of the line, but no more than 70 bytes. */ - if (70 < plen) + if (0 == plen) { + p = line; + plen = len; + } else if (70 < plen) plen = 70; - p[plen-1] = '\0'; + buf_appendb(dbuf, p, plen); + buf_appendb(dbuf, "", 1); + buf->len = 0; buf_appendb(buf, p, plen); + buf_appendb(buf, "", 1); hash_put(hash, buf, TYPE_Nd); } --------------040306090009020504030203-- -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv