From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from scc-mailout.scc.kit.edu (scc-mailout.scc.kit.edu [129.13.185.202])
	by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id p5L0BHeb012689
	for <tech@mdocml.bsd.lv>; Mon, 20 Jun 2011 20:11:20 -0400 (EDT)
Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82])
	by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1)
	id 1QYoYl-0007j4-Gu; Tue, 21 Jun 2011 02:11:15 +0200
Received: from donnerwolke.usta.de ([172.24.96.3])
	by hekate.usta.de with esmtp (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1QYoYl-0005tK-La
	for tech@mdocml.bsd.lv; Tue, 21 Jun 2011 02:11:15 +0200
Received: from iris.usta.de ([172.24.96.5] helo=usta.de)
	by donnerwolke.usta.de with esmtp (Exim 4.69)
	(envelope-from <schwarze@usta.de>)
	id 1QYoYl-0002hw-J5
	for tech@mdocml.bsd.lv; Tue, 21 Jun 2011 02:11:15 +0200
Received: from schwarze by usta.de with local (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1QYoYl-0005og-IJ
	for tech@mdocml.bsd.lv; Tue, 21 Jun 2011 02:11:15 +0200
Date: Tue, 21 Jun 2011 02:11:15 +0200
From: Ingo Schwarze <schwarze@usta.de>
To: tech@mdocml.bsd.lv
Subject: Re: Perfect hashes.
Message-ID: <20110621001115.GC27716@iris.usta.de>
References: <4DDD69EF.3050008@bsd.lv>
X-Mailinglist: mdocml-tech
Reply-To: tech@mdocml.bsd.lv
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4DDD69EF.3050008@bsd.lv>
User-Agent: Mutt/1.5.21 (2010-09-15)

Hi Kristaps,

uh, sorry, i slacked on this one.

Kristaps Dzonsons wrote on Wed, May 25, 2011 at 10:43:27PM +0200:

[...]
> I'm reviving this conversation because I'd like a uniform interface
> for the roff, mdoc, man, chars, and mdoc-args tables.  For the time
> being, each of these has its own method, which is error-prone, ugly,
> and bloated.

I fully agree with that.

> Perfect hashes make sense because the above data is fixed.

I'm not so sure about that.  At least in roff.c, we have
user-defined macros.  Currently, we are handling these with
a linear search.  That's absolutely fine for now, but in the
long run, it may not be ideal for documents making heavy use
of them.  Besides, in traditional roff, user-defined macros can
override built-in requests.  Hence, i'm not even sure roff
request tables will always remain constant.

> Furthermore, they have no cost of allocation and have little
> overhead in terms of unused table-space.

Fair enough.

> The speed-up is a bit more than 1%

Wow.  *That's* impressive!   =8-0

> I'm leaning toward implementing similar steps for all existing hash
> files.  Does anybody have any objections?

Not really strong objections, but maybe you want to consider whether
that isn't premature optimization.  I suspect these tables will
still remain variable, maybe some might, at some point, be merged
with others or even get dynamical entries.  Maybe not, but who
knows.  Even though the code base is not newish any longer, the
design is still not set in stone, and there are various ideas
floating around.

So isn't simplicity and flexibility still more important than
fine-tuning and optimization?

> My intent in this isn't so much performance as uniform interfacing,

I strongly agree with that.

> code readability,

Hence:

> +  static const unsigned char asso_values[] =
> +    {
> +      248, 248, 248, 248, 248, 248, 248, 248, 248, 248,
> +      248, 248, 248, 248, 248, 248, 248, 248, 248, 248,
> +      248, 248, 248, 248, 248, 248, 248, 248, 248, 248,
> +      248, 248, 248, 248, 248, 248, 248, 120, 248, 248,
> +      248, 248, 248, 248, 248, 248, 248, 248, 248,  19,
> +      248, 248, 248, 248, 248, 248, 248, 248, 248, 248,
> +      248, 248, 248, 248, 248,  70,  20, 115,   0,  17,
> +        2, 248, 125,  92,  60, 248,  87,   4,  77, 105,
> +       67,  42, 112,  40, 125,  85,  97, 248,   4, 248,
> +      248, 248, 248, 248, 248, 248, 248,  12,  29,   5,
> +      125, 110, 102, 248,  50,  90, 248,  24,  32, 115,
> +       90,   0,  80,  15, 100,  95,  60, 248,  37, 248,
> +       10,  25, 248, 248, 248, 248, 248, 248
> +    };

You must be joking, Mr. Dzonsons!  ;-)

There are two problems with the above table:
It is rather obfuscated code.  The art of programming is to write
code that is good enough for the computer, but above all easily
intelligible (and, if possible, pleasing) to humans.

And even worse, that code can't be edited manually.
So what do i do when i need to add a macro for my next ugly
hack (you know, the one you will rip out again a few months
later)?  First muck around with gperf?  Than commit a
hundred-line diff just for mdoc_hash.c?  That's gross.

For both reasons, i dislike automatically generated code.
Personally, i like to stay away from autogenerated code.
When it's not really needed, why bother with it?
When the chosen aproach really requires it, for whatever
reason, then it was the wrong approach in the first place.

I have seen much worse instances of code generation and obfuscation,
so i'm not that strongly opposed, but i'm not at all convinced either.

> a good hash for the roff symbols (ds/de) is also under
> investigation.  Any suggestions---not dbopen---on a hash table
> for all Unices?

Currently, anything is good enough, even the linear search...
For inserting and deleting at run time, tree structures
may (or may not) be another option.

I'd say just choose the simplest and most flexible algorithm
and use it everywhere, consistently.  Then wait a bit whether
any design decisions need to be made - we still have those
unsolved issues with roff macros in man and mdoc, and escape
sequences that are functionally equivalent to macros but do
not enter the ASTs, and so on and so forth.  And those issues
are not even the most pressing ones, currently, if you ask
me, but they are certainly more pressing than hash optimizations.

Currently, improving makewhatis to handle as many macros as
possible and getting full text search with that tool would
probably bring us more benefit than optimizing the de/ds search
algorithm.  Oh, and an eqn parser and renderer would be great,
too, X11 hackers would love us for that.

Cleaning up the hashes is fine as well, but i'd say, for now,
use "simplicity" and "uniformity", and maybe a bit of
flexibility, as the only design goals.

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv