From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from scc-mailout.scc.kit.edu (scc-mailout.scc.kit.edu [129.13.185.202]) by krisdoz.my.domain (8.14.3/8.14.3) with ESMTP id p5L0BHeb012689 for ; Mon, 20 Jun 2011 20:11:20 -0400 (EDT) Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82]) by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1) id 1QYoYl-0007j4-Gu; Tue, 21 Jun 2011 02:11:15 +0200 Received: from donnerwolke.usta.de ([172.24.96.3]) by hekate.usta.de with esmtp (Exim 4.72) (envelope-from ) id 1QYoYl-0005tK-La for tech@mdocml.bsd.lv; Tue, 21 Jun 2011 02:11:15 +0200 Received: from iris.usta.de ([172.24.96.5] helo=usta.de) by donnerwolke.usta.de with esmtp (Exim 4.69) (envelope-from ) id 1QYoYl-0002hw-J5 for tech@mdocml.bsd.lv; Tue, 21 Jun 2011 02:11:15 +0200 Received: from schwarze by usta.de with local (Exim 4.72) (envelope-from ) id 1QYoYl-0005og-IJ for tech@mdocml.bsd.lv; Tue, 21 Jun 2011 02:11:15 +0200 Date: Tue, 21 Jun 2011 02:11:15 +0200 From: Ingo Schwarze To: tech@mdocml.bsd.lv Subject: Re: Perfect hashes. Message-ID: <20110621001115.GC27716@iris.usta.de> References: <4DDD69EF.3050008@bsd.lv> X-Mailinglist: mdocml-tech Reply-To: tech@mdocml.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4DDD69EF.3050008@bsd.lv> User-Agent: Mutt/1.5.21 (2010-09-15) Hi Kristaps, uh, sorry, i slacked on this one. Kristaps Dzonsons wrote on Wed, May 25, 2011 at 10:43:27PM +0200: [...] > I'm reviving this conversation because I'd like a uniform interface > for the roff, mdoc, man, chars, and mdoc-args tables. For the time > being, each of these has its own method, which is error-prone, ugly, > and bloated. I fully agree with that. > Perfect hashes make sense because the above data is fixed. I'm not so sure about that. At least in roff.c, we have user-defined macros. Currently, we are handling these with a linear search. That's absolutely fine for now, but in the long run, it may not be ideal for documents making heavy use of them. Besides, in traditional roff, user-defined macros can override built-in requests. Hence, i'm not even sure roff request tables will always remain constant. > Furthermore, they have no cost of allocation and have little > overhead in terms of unused table-space. Fair enough. > The speed-up is a bit more than 1% Wow. *That's* impressive! =8-0 > I'm leaning toward implementing similar steps for all existing hash > files. Does anybody have any objections? Not really strong objections, but maybe you want to consider whether that isn't premature optimization. I suspect these tables will still remain variable, maybe some might, at some point, be merged with others or even get dynamical entries. Maybe not, but who knows. Even though the code base is not newish any longer, the design is still not set in stone, and there are various ideas floating around. So isn't simplicity and flexibility still more important than fine-tuning and optimization? > My intent in this isn't so much performance as uniform interfacing, I strongly agree with that. > code readability, Hence: > + static const unsigned char asso_values[] = > + { > + 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, > + 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, > + 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, > + 248, 248, 248, 248, 248, 248, 248, 120, 248, 248, > + 248, 248, 248, 248, 248, 248, 248, 248, 248, 19, > + 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, > + 248, 248, 248, 248, 248, 70, 20, 115, 0, 17, > + 2, 248, 125, 92, 60, 248, 87, 4, 77, 105, > + 67, 42, 112, 40, 125, 85, 97, 248, 4, 248, > + 248, 248, 248, 248, 248, 248, 248, 12, 29, 5, > + 125, 110, 102, 248, 50, 90, 248, 24, 32, 115, > + 90, 0, 80, 15, 100, 95, 60, 248, 37, 248, > + 10, 25, 248, 248, 248, 248, 248, 248 > + }; You must be joking, Mr. Dzonsons! ;-) There are two problems with the above table: It is rather obfuscated code. The art of programming is to write code that is good enough for the computer, but above all easily intelligible (and, if possible, pleasing) to humans. And even worse, that code can't be edited manually. So what do i do when i need to add a macro for my next ugly hack (you know, the one you will rip out again a few months later)? First muck around with gperf? Than commit a hundred-line diff just for mdoc_hash.c? That's gross. For both reasons, i dislike automatically generated code. Personally, i like to stay away from autogenerated code. When it's not really needed, why bother with it? When the chosen aproach really requires it, for whatever reason, then it was the wrong approach in the first place. I have seen much worse instances of code generation and obfuscation, so i'm not that strongly opposed, but i'm not at all convinced either. > a good hash for the roff symbols (ds/de) is also under > investigation. Any suggestions---not dbopen---on a hash table > for all Unices? Currently, anything is good enough, even the linear search... For inserting and deleting at run time, tree structures may (or may not) be another option. I'd say just choose the simplest and most flexible algorithm and use it everywhere, consistently. Then wait a bit whether any design decisions need to be made - we still have those unsolved issues with roff macros in man and mdoc, and escape sequences that are functionally equivalent to macros but do not enter the ASTs, and so on and so forth. And those issues are not even the most pressing ones, currently, if you ask me, but they are certainly more pressing than hash optimizations. Currently, improving makewhatis to handle as many macros as possible and getting full text search with that tool would probably bring us more benefit than optimizing the de/ds search algorithm. Oh, and an eqn parser and renderer would be great, too, X11 hackers would love us for that. Cleaning up the hashes is fine as well, but i'd say, for now, use "simplicity" and "uniformity", and maybe a bit of flexibility, as the only design goals. Yours, Ingo -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv