* [9fans] yet another try to fixup venti
@ 2024-06-11 20:52 wb.kloke
2024-06-12 20:12 ` [9fans] " wb.kloke
2024-06-13 4:08 ` [9fans] " ori
0 siblings, 2 replies; 8+ messages in thread
From: wb.kloke @ 2024-06-11 20:52 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1780 bytes --]
After studying Steve Stallion's SSD venti disaster, I decided to do my own try to fix the issues of venti.
Despite my reservations on the lasting wisdom of some of the design choices, I try to use the traditional arena disk layout.
Only the on-disk index is replaced with a trie-based in-memory structure.
The trienodes represent either the score and IAddr data as leaves or 16 indices for the next nibble of the score to search further. There is no need for a Bloom filter, as the trie search is not less performant for negative results. The actual trienode size is 64 bytes now, but can probably shorted to 48 bytes.
So far, I have managed to convert buildindex into buildtrie. If -v option is used, the contents of the trie are printed in lexical order of the score.
The data from my experiments are:
I used my 4 arena files, each 20GB, containing about 10 million clumps in standard 500MB arenas. Data from the arena directories are read in in about one and a half minute. (There is one error in one of the arenas.) IMHO this is acceptable as startup time for a venti server.
The trie has about 14m nodes, which are stored in a contiguous array. The trie, which is now 32 bit indexed, thus may be reduced to 24 bit index for the current data amount.
For larger storage, there is a design choice, either use 24 bit indices and 48 byte trie nodes, and 256 trie arrays, or use 32bit indices and 64 byte trienodes in a single array.
After I manage to push my data to a planport fork on github, you will hear more.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T21878aa53884911b-Mb074534433ed9a094542eef4
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2555 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* [9fans] Re: yet another try to fixup venti
2024-06-11 20:52 [9fans] yet another try to fixup venti wb.kloke
@ 2024-06-12 20:12 ` wb.kloke
2024-06-13 4:08 ` [9fans] " ori
1 sibling, 0 replies; 8+ messages in thread
From: wb.kloke @ 2024-06-12 20:12 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 884 bytes --]
For now, I failed to push my changes to github.
If anybody is interested in the files, they are available in http://pkeus.de/~wb/mventi
Copy the files int $PLAN9/src/cmd/venti/srv
and try "mk o.buildtrie" .
I also inserted my code into index.c to check that index lookup gives the same result as before.
The code snippet inserted into loadientry function is
tf = trie_retrieve(score,&ie2);
assert(tf != ~0 );
assert(ie2.addr==ie->ia.addr);
assert(ie2.type==ie->ia.type);
assert(ie2.size==ie->ia.size);
assert(ie2.blocks==ie->ia.blocks);
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T21878aa53884911b-M7f3b63c6b43eee3e26c57981
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1819 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [9fans] yet another try to fixup venti
2024-06-11 20:52 [9fans] yet another try to fixup venti wb.kloke
2024-06-12 20:12 ` [9fans] " wb.kloke
@ 2024-06-13 4:08 ` ori
2024-06-13 15:52 ` wb.kloke
1 sibling, 1 reply; 8+ messages in thread
From: ori @ 2024-06-13 4:08 UTC (permalink / raw)
To: 9fans
Sounds fairly interesting, though I'm curious how it compares;
my guess was that because of the lack of locality due to using
hashes for the score, a trie wouldn't be that different from a
hash table.
Quoth wb.kloke@gmail.com:
> After studying Steve Stallion's SSD venti disaster, I decided to do my own try to fix the issues of venti.
>
> Despite my reservations on the lasting wisdom of some of the design choices, I try to use the traditional arena disk layout.
> Only the on-disk index is replaced with a trie-based in-memory structure.
>
> The trienodes represent either the score and IAddr data as leaves or 16 indices for the next nibble of the score to search further. There is no need for a Bloom filter, as the trie search is not less performant for negative results. The actual trienode size is 64 bytes now, but can probably shorted to 48 bytes.
>
> So far, I have managed to convert buildindex into buildtrie. If -v option is used, the contents of the trie are printed in lexical order of the score.
>
> The data from my experiments are:
>
> I used my 4 arena files, each 20GB, containing about 10 million clumps in standard 500MB arenas. Data from the arena directories are read in in about one and a half minute. (There is one error in one of the arenas.) IMHO this is acceptable as startup time for a venti server.
>
> The trie has about 14m nodes, which are stored in a contiguous array. The trie, which is now 32 bit indexed, thus may be reduced to 24 bit index for the current data amount.
>
> For larger storage, there is a design choice, either use 24 bit indices and 48 byte trie nodes, and 256 trie arrays, or use 32bit indices and 64 byte trienodes in a single array.
>
> After I manage to push my data to a planport fork on github, you will hear more.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T21878aa53884911b-M7cf5960cda854dba36823793
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [9fans] yet another try to fixup venti
2024-06-13 4:08 ` [9fans] " ori
@ 2024-06-13 15:52 ` wb.kloke
2024-06-13 19:41 ` wb.kloke
0 siblings, 1 reply; 8+ messages in thread
From: wb.kloke @ 2024-06-13 15:52 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1554 bytes --]
On Thursday, 13 June 2024, at 6:08 AM, ori wrote:
> Sounds fairly interesting, though I'm curious how it compares;
my guess was that because of the lack of locality due to using
hashes for the score, a trie wouldn't be that different from a
hash table.
You are right. Lack of locality is a main issue.
But, as Noam Preil has shown in his workshop presentation, it is nearly impossible to work on the current implementation. Example: index.c has 19202 bytes, whereas trie.c has 3232, doing roughly the same work.
The main drawback of the current venti is the disk based index. A fair comparison will have to put this on a ramdisk. The index section of my data set now is 5GB, but 2GB would be surely sufficient. When I have my implemention ready to work (this is a Herculian task, but I intend just to disable index disk writing without changing the rest), I shall publish the performance data.
One idea to improve locality is avoiding the storage of the full score in the trie nodes. Only the nodes referring to the first 6 nibbles tend to be densely populated. The leaves need not really to contain the score, as the index addr stored there also points to a place where the score is found in the the arena partition. This should not really be detrimental for performance as the block is needed anyway.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T21878aa53884911b-Mf21f7a13249d0b37cde75c53
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2269 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [9fans] yet another try to fixup venti
2024-06-13 15:52 ` wb.kloke
@ 2024-06-13 19:41 ` wb.kloke
2024-06-16 9:19 ` wb.kloke
0 siblings, 1 reply; 8+ messages in thread
From: wb.kloke @ 2024-06-13 19:41 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]
I updated http://pkeus.de/~wb/mventi, adding a file mventi.c which is venti.c + parts of index.c (to avoid linking the real index.o).
I have some performance data now.
I had to add a 4th arenapartition, which sealed the old partitions, so readonly acces is sufficient for them,.
I served the partitions on the original venti system via nfs.
On my AMD A10 workstation (32MB, 4 cores) it took 12min real time to vcat a 50GB ffs partiition to /dev/null.
On the WD Mycloud system ( on which the arena partitions are local, ithe same task took 33min.
mventi:
> time vcat ffs:ef6c183eabb7caafed7f85a3447a12accfbbeb8b > /dev/null
> 1540096 blocks total
> 1214607 blocks in use, 2754705 file reads
>
> real 12m11,573s
> user 2m33,463s
> sys 0m34,604s
original venti:
> ime vcat ffs:ef6c183eabb7caafed7f85a3447a12accfbbeb8b > /dev/null
> 1540096 blocks total
> 1214607 blocks in use, 2754705 file reads
> 504.283u 94.546s 2000.095r vcat ffs:ef6c183eabb7caafed7f85a3447a12accfbbeb8b
>
I should add the remark, that I was quite content with the performance of the same file served as a dump file system
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T21878aa53884911b-M019008e416672a06797b66a8
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2365 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [9fans] yet another try to fixup venti
2024-06-13 19:41 ` wb.kloke
@ 2024-06-16 9:19 ` wb.kloke
2024-06-20 15:32 ` wb.kloke
0 siblings, 1 reply; 8+ messages in thread
From: wb.kloke @ 2024-06-16 9:19 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1089 bytes --]
Some news on my effort.
This morning I used my venti do to real work, serving the fossil filesystem to boot a 386 vm.
So far, it looks good. I did not try to write yet.
I changed my trie.c for some optimisations:
I ditched my union trienode for separate struct trieleaf and struct trienode with the effect that leaves are now stored in 32 byte instead of the 64 byte nodes.
This reduces the actual memory footprint from 15m*64 to 10m*32+5m*64. The next optimisation will reduce the trienode size to 48 bytes.
Another optimisation is the shortening of the first 4 levels of the trie, which are trivial. now, I just use the first 2 bytes of the score to directly addressing the first explicit trienode, which may contain 16 indices to either other trienodes or leaves, discerned by 0 for empty, positive for leaf and negative for trienode.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T21878aa53884911b-Mccf878acc9623014dca300d2
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2040 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [9fans] yet another try to fixup venti
2024-06-16 9:19 ` wb.kloke
@ 2024-06-20 15:32 ` wb.kloke
2024-08-16 17:27 ` wb.kloke
0 siblings, 1 reply; 8+ messages in thread
From: wb.kloke @ 2024-06-20 15:32 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 635 bytes --]
After I managed to avoid storing the full scores in main memory, I am confident that the files in http://pkeus.de/~wb/mventi are worth a try for the public.
So far, I don't see much restrictions on use. The arena on-disk format is unchanged.
I now see chances to further shorten the trienodes to 32 bytes. So, the server may be used for storage of about 100GB with 500MB available main memory.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T21878aa53884911b-Mfb25eaa53a05b455b4a5f5fb
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1915 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [9fans] yet another try to fixup venti
2024-06-20 15:32 ` wb.kloke
@ 2024-08-16 17:27 ` wb.kloke
0 siblings, 0 replies; 8+ messages in thread
From: wb.kloke @ 2024-08-16 17:27 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 925 bytes --]
Now I have managed to get my files pushed to github. So the new location of my plan9port fork is now at
https://github.com/vestein463/plan9port
In addition, there are a couple of FreeBSD related hosted ther, eg. a program ggatev,
which can be used to serve a vbackup'ed UFS filesystem as a geom, thus even writes are now possible (with help of geom_union) on the backup.
In the meantime, I changed my plan9 VMs to use my mventi as 2nd stage storage for fossil.
I still would not recommend to use write active (unsealed) arenas on SSD (the main culprit isect has gone ), as the arena directory will still be written in partial blocks, probably leading to unnecessary disk wear.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T21878aa53884911b-M35c05eb463b41f5ad747d5f0
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1639 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-08-16 17:27 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-11 20:52 [9fans] yet another try to fixup venti wb.kloke
2024-06-12 20:12 ` [9fans] " wb.kloke
2024-06-13 4:08 ` [9fans] " ori
2024-06-13 15:52 ` wb.kloke
2024-06-13 19:41 ` wb.kloke
2024-06-16 9:19 ` wb.kloke
2024-06-20 15:32 ` wb.kloke
2024-08-16 17:27 ` wb.kloke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).