On Thursday, 13 June 2024, at 6:08 AM, ori wrote:
Sounds fairly interesting, though I'm curious how it compares; my guess was that because of the lack of locality due to using hashes for the score, a trie wouldn't be that different from a hash table.
You are right. Lack of locality is a main issue. 

But, as Noam Preil has shown in his workshop presentation, it is nearly impossible to work on the current implementation. Example: index.c has 19202 bytes, whereas trie.c has 3232, doing roughly the same work.

The main drawback of the current venti is the disk based index. A fair comparison will have to put this on a ramdisk. The index section of my data set now is 5GB, but 2GB would be surely sufficient. When I have my implemention ready to work (this is a Herculian task, but I intend just to disable index disk writing without changing the rest), I shall publish the performance data.

One idea to improve locality is avoiding  the storage of the full score in the trie nodes. Only the nodes referring to the first 6 nibbles tend to be densely populated. The leaves  need not really to contain the score, as the index addr stored there also points to a place where the score is found in the the arena partition. This should not really be detrimental for performance as the block is needed anyway.