You are right. Lack of locality is a main issue.
But, as Noam Preil has shown in his workshop presentation, it is nearly impossible to work on the current implementation. Example: index.c has 19202 bytes, whereas trie.c has 3232, doing roughly the same work.
The main drawback of the current venti is the disk based index. A fair comparison will have to put this on a ramdisk. The index section of my data set now is 5GB, but 2GB would be surely sufficient. When I have my implemention ready to work (this is a Herculian task, but I intend just to disable index disk writing without changing the rest), I shall publish the performance data.
One idea to improve locality is avoiding the storage of the full score in the trie nodes. Only the nodes referring to the first 6 nibbles tend to be densely populated. The leaves need not really to contain the score, as the index addr stored there also points to a place where the score is found in the the arena partition. This should not really be detrimental for performance as the block is needed anyway.