From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from scc-mailout.scc.kit.edu (scc-mailout.scc.kit.edu [129.13.185.202]) by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id pB81A3AW001273 for ; Wed, 7 Dec 2011 20:10:03 -0500 (EST) Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82]) by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1) id 1RYSUs-00044G-5L; Thu, 08 Dec 2011 02:10:02 +0100 Received: from donnerwolke.usta.de ([172.24.96.3]) by hekate.usta.de with esmtp (Exim 4.72) (envelope-from ) id 1RYSUs-0006v3-2o for tech@mdocml.bsd.lv; Thu, 08 Dec 2011 02:10:02 +0100 Received: from iris.usta.de ([172.24.96.5] helo=usta.de) by donnerwolke.usta.de with esmtp (Exim 4.72) (envelope-from ) id 1RYSUs-0003Zb-1k for tech@mdocml.bsd.lv; Thu, 08 Dec 2011 02:10:02 +0100 Received: from schwarze by usta.de with local (Exim 4.72) (envelope-from ) id 1RYSUr-0000LD-Mv for tech@mdocml.bsd.lv; Thu, 08 Dec 2011 02:10:01 +0100 Date: Thu, 8 Dec 2011 02:10:01 +0100 From: Ingo Schwarze To: tech@mdocml.bsd.lv Subject: Re: Improve catman mandocdb(8) heuristics. Message-ID: <20111208011001.GC19643@iris.usta.de> References: <4EDF7EB4.7040906@bsd.lv> X-Mailinglist: mdocml-tech Reply-To: tech@mdocml.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4EDF7EB4.7040906@bsd.lv> User-Agent: Mutt/1.5.21 (2010-09-15) Hi Kristaps, Kristaps Dzonsons wrote on Wed, Dec 07, 2011 at 03:56:52PM +0100: > Enclosed is a patch to de-backspace Nm/Nd lines for mandocdb(8). I think that makes sense. > This arose from seeing the results for some LAPACK manuals, which > are notoriously shitty. It also cleans up handling of the > non-terminated string a bit That part seems good, too. > and adds a quick check to see if SYNOPSIS has been reached right > after the NAME. This occurs when manuals look like this: > > NAME > SYNOPSIS > Blah blah blah > > Again, LAPACK... I don't think i like that, it seems too specific, in particular looking for the exact string "SYNOPSIS". Perhaps we should move this check before stripping out the backspace encoding and do just this: line = fgetln(stream, &len); if (NULL == line || ' ' != line[0] || '\n' != line[len-1]) { buf_appendb(dbuf, buf->cp, buf->size); hash_put(hash, buf, TYPE_Nd); fclose(stream); return; } fclose(stream); line[--len] = '\0'; That removes a lot of duplicate code and may even be better heuristics. ANY section header right after the first one is fundamentally unusable, not just SYNOPSIS. Note that your "} else if (0 == len) {" can be dropped as well if we take that route. > If it's relevant, a check for NAME could also occur, then loop back > into the fgets(). Thoughts? If there is another section before NAME, do we really want to use the content of the NAME section? I'd say just use the first section. Experience with the current makewhatis(8) tells me that being clever and trying hard buys us very little: all reasonable pages do not need clever tricks, so at best that finds a bit of additional information from a small number of botched pages, i.e. most probably not the most valuable info, and not the largest amount. Being very resilient buys us more: Never complain, always return something at least semi-useful. The biggest problem with the current tool is that being so clever, it easily gets really badly confused, and then it starts complaining loudly. Yes, we should implement -t (test mode) later on, to help porters spot pages with broken NAME sections. But it is very important to be absolutely silent and not too clever in production mode. > This area still needs a bit more attention to handle situations like: > > foo -[\n]? > [whitespace]foo - bar[whitespace][\n]? Oh well, i think delivering just foo(1) - foo(1) - bar is good enough in these two cases, stripping trailing whitespace and falling back to "foo(1) - foo" would be a bit more fancy in the first case, probably worthwhile, but hardly critical. One thing that i want to do is read through the current Makewhatis::Formated and see which features should be ported and which are better done in a simpler way. Of course, i won't complain if somebody beats me to it. Yours, Ingo -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv