From mboxrd@z Thu Jan  1 00:00:00 1970
From: arisawa <arisawa@ar.aichi-u.ac.jp>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Message-Id: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp>
Date: Tue,  6 Aug 2013 10:14:36 +0900
To: 9fans@9fans.net
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Subject: [9fans] text database Kirara
Topicbox-Message-UUID: 6f42e866-ead8-11e9-9d60-3106f5b1d025

Hello 9fans,

I have written a text database named Kirara.
The following is a brief introduction to Kirara.
If you are interested in, get Kirara from:
  http://plan9.aichi-u.ac.jp/netlib/kirara/

Kenji Arisawa

-------------

Kirara

-------------

Kirara is a text indexing/retrieval tool for Plan 9.

Personal use: index/retrieve local files.

Kirara is based on the idea similar to Glimpse.

(1) indexing + grep
(2) multi-level indexing

(a) small space for indexing
(b) small update time
(c) quick search

Note that:
small indexing   <->	quick search
Kirara makes more index -> quick search
Glimpse is single-level indexing.

-------------

Query

Kirara does not support phrase search.
The database is index of words,

supporting:
QE mode (query expression mode)
	'&', '|', '*'
The example:
	'snoopy&html'
	'snoop*&htm*'

RE mode (regular expression mode)
	'&', RE
where RE denotes regular expression.
The example:
	'sn.*y&h.+l'

RE mode is a bit slow. (a few second.)

-------------

Words

Two or more runes.
All words are converted to lower case.
In English, words is composed of alphabets.
The number of runes is configurable

Assumption:
Text is composed of space-separated words
popular in English and many European Languages,
but not in Japanese.

-------------

The user's interface

Best match with Rio
	term% kfind snoop
	G snoop /sys/src/9/ip/
	G snoop /sys/src/cmd/spell/
	G snoop /sys/src/9/kw/
	...
	term% G snoop /sys/src/9/ip
	devip.c:34: 	Qsnoop,
	devip.c:95: 	case Qsnoop:
	devip.c:98: 		devdir(c, q, "snoop", qlen(cv->sq), =
cv->owner, 0400, dp);
	...
Note that: two steps
1. find directories
2. find files and the contents
Step 2 is actually 'grep'. we can use RE.

Two-steps search is not a weekness, but a desirable feature.
Because we have so many files that are hit by the query.

-------------

The organization

My example

/n/other/kirara/sysdb
target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)

/n/other/kirara/usrdb
target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)

Indexing target is fully configurable.

-------------

Multi-Level Indexing

(1) Indexing (top level)
word to directory mapping

sysdb/index		# main index					=
# used for RE mode
sysdb/mindex	# meta index (alphabetic index)	# used for QE mode
sysdb/dind/*	# rough index of each directory
sysdb/QTDir		# map table (QID, mtime, path-to-dir)

index		# word to dir QID
	aa	0000000000014f0a
	aa	000000000001a1e0
	aa	000000000001a26e

mindex	# word to range in index
	aa 0 126669
	ab 126669 491569
	ac 491569 1258566
	ad 1258566 1852467
	...
dind/*				# `*' is a directory QID
	0000000000014f05
	0000000000014f0a
	000000000001a1ce

usrdb is same.

(2) Indexing (directory level)	# optional
word to file mapping

sysdb/find/*/ind.gz	# fine index of the directory (gzipped)
sysdb/find/*/qtn	# map table (QID, mtime, name)
where `*' is a directory QID

usrdb is same as sysdb.

-------------

Experiment

(a) hardware
GA-H61M-USB3-B3
Intel Pentium G860 (3GHz)
DDR3 PC3 4GB

(b) software
9front
cwfs64x

-------------

The performance (compression ratio)

target	 target	 num_of_dirs	  indexing
sysdb:   556 MB	   1790 dirs		 49 MB
usrdb:	6620 MB	   8948 dirs		150 MB

compression ratio: 49/556 (sysdb)
note: usrdb includes many non-text file.

-------------

The performance (retrieval time)
system dependent

RQ search	# kfind foo

0.1 seconds.

It is not important to make this time smaller.
(sufficiently small)

RE search	# kfind -r foo

a few seconds


-------------

The performance (construction/update)

(a) Construction time
system dependent

Initial construction
need
	10 minutes for sysdb
	30 minutes for usrdb

(b) Updating time
two commands for update

mkdb
	20 seconds to a few minutes for usrdb
	depends largely on state of cache

mkdb1 (currently only for usrdb)
	5 to 15 seconds for usrdb
	mkdb1 needs event log

-------------

Scalability

Main factors

(a) retrieval time
QE search: proportional to number of dirs that include the query
RE search: proportional to size of index

(b) initial construction time
proportional to total data

(c) update time
mkdb: 	proportional to number of dirs and the changes
mkdb1:	proportional to changes and size of index

-------------

Used Tools

(1) rc

(2) grep, sed, awk, sort, diff, gzip, ...

(3) some new tools written in C

-------------

What Kirara means?

Kirara is name of a girl that appeared in a Japanese comic book.
(But I have never read the book.)
The name is seldom used in real world.
=46rom the name we Japanese imagine something glittering.
I like the name.

-------------

References

[1] GLIMPSE: A Tool to Search Through Entire File Systems
Udi Manber and Sun Wu (1993)
http://webglimpse.net/pubs/glimpse.pdf
[2] Glimpse Documentation
http://webglimpse.net/gdocs/glimpsehelp.html