Re: [9fans] text database Kirara

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

From: brz-systemd-dev@intma.in
To: 9fans@9fans.net
Subject: Re: [9fans] text database Kirara
Date: Tue,  6 Aug 2013 14:14:07 -0400	[thread overview]
Message-ID: <7fb94c8c89d208107f12d7e69d02f2d6@neinchan.znet> (raw)
In-Reply-To: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp>

[-- Attachment #1: Type: text/plain, Size: 622 bytes --]

I've played around with Kirara for a couple hours, now, and am pretty
surprised at how simple it is.  It's already become integrated into my
workflow.  Being able to quickly (and easily) search for relevant
snippets of code throughout the system is quite useful.

I feel compelled to mention that the code is abnormally high in
quality.  (This is seen, even in the rc scripts)
Now I'm going to have to look through your other projects.

Thanks for releasing this.

- BurnZeZ

Bug:
	kirara-1.1/INSTALL:9: mkdir -p $kirarar/bin/^(rc $objtype)
	Here (and on line 11), '$kirarar' is used instead of '$kirara'.

[-- Attachment #2: Type: message/rfc822, Size: 7706 bytes --]

From: arisawa <arisawa@ar.aichi-u.ac.jp>
To: 9fans@9fans.net
Subject: [9fans] text database Kirara
Date: Tue, 6 Aug 2013 10:14:36 +0900
Message-ID: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp>

Hello 9fans,

I have written a text database named Kirara.
The following is a brief introduction to Kirara.
If you are interested in, get Kirara from:
  http://plan9.aichi-u.ac.jp/netlib/kirara/

Kenji Arisawa

-------------

Kirara

-------------

Kirara is a text indexing/retrieval tool for Plan 9.

Personal use: index/retrieve local files.

Kirara is based on the idea similar to Glimpse.

(1) indexing + grep
(2) multi-level indexing

(a) small space for indexing
(b) small update time
(c) quick search

Note that:
small indexing   <->	quick search
Kirara makes more index -> quick search
Glimpse is single-level indexing.

-------------

Query

Kirara does not support phrase search.
The database is index of words,

supporting:
QE mode (query expression mode)
	'&', '|', '*'
The example:
	'snoopy&html'
	'snoop*&htm*'

RE mode (regular expression mode)
	'&', RE
where RE denotes regular expression.
The example:
	'sn.*y&h.+l'

RE mode is a bit slow. (a few second.)

-------------

Words

Two or more runes.
All words are converted to lower case.
In English, words is composed of alphabets.
The number of runes is configurable

Assumption:
Text is composed of space-separated words
popular in English and many European Languages,
but not in Japanese.

-------------

The user's interface

Best match with Rio
	term% kfind snoop
	G snoop /sys/src/9/ip/
	G snoop /sys/src/cmd/spell/
	G snoop /sys/src/9/kw/
	...
	term% G snoop /sys/src/9/ip
	devip.c:34: 	Qsnoop,
	devip.c:95: 	case Qsnoop:
	devip.c:98: 		devdir(c, q, "snoop", qlen(cv->sq), cv->owner, 0400, dp);
	...
Note that: two steps
1. find directories
2. find files and the contents
Step 2 is actually 'grep'. we can use RE.

Two-steps search is not a weekness, but a desirable feature.
Because we have so many files that are hit by the query.

-------------

The organization

My example

/n/other/kirara/sysdb
target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)

/n/other/kirara/usrdb
target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)

Indexing target is fully configurable.

-------------

Multi-Level Indexing

(1) Indexing (top level)
word to directory mapping

sysdb/index		# main index					# used for RE mode
sysdb/mindex	# meta index (alphabetic index)	# used for QE mode
sysdb/dind/*	# rough index of each directory
sysdb/QTDir		# map table (QID, mtime, path-to-dir)

index		# word to dir QID
	aa	0000000000014f0a
	aa	000000000001a1e0
	aa	000000000001a26e

mindex	# word to range in index
	aa 0 126669
	ab 126669 491569
	ac 491569 1258566
	ad 1258566 1852467
	...
dind/*				# `*' is a directory QID
	0000000000014f05
	0000000000014f0a
	000000000001a1ce

usrdb is same.

(2) Indexing (directory level)	# optional
word to file mapping

sysdb/find/*/ind.gz	# fine index of the directory (gzipped)
sysdb/find/*/qtn	# map table (QID, mtime, name)
where `*' is a directory QID

usrdb is same as sysdb.

-------------

Experiment

(a) hardware
GA-H61M-USB3-B3
Intel Pentium G860 (3GHz)
DDR3 PC3 4GB

(b) software
9front
cwfs64x

-------------

The performance (compression ratio)

target	 target	 num_of_dirs	  indexing
sysdb:   556 MB	   1790 dirs		 49 MB
usrdb:	6620 MB	   8948 dirs		150 MB

compression ratio: 49/556 (sysdb)
note: usrdb includes many non-text file.

-------------

The performance (retrieval time)
system dependent

RQ search	# kfind foo

0.1 seconds.

It is not important to make this time smaller.
(sufficiently small)

RE search	# kfind -r foo

a few seconds

-------------

The performance (construction/update)

(a) Construction time
system dependent

Initial construction
need
	10 minutes for sysdb
	30 minutes for usrdb

(b) Updating time
two commands for update

mkdb
	20 seconds to a few minutes for usrdb
	depends largely on state of cache

mkdb1 (currently only for usrdb)
	5 to 15 seconds for usrdb
	mkdb1 needs event log

-------------

Scalability

Main factors

(a) retrieval time
QE search: proportional to number of dirs that include the query
RE search: proportional to size of index

(b) initial construction time
proportional to total data

(c) update time
mkdb: 	proportional to number of dirs and the changes
mkdb1:	proportional to changes and size of index

-------------

Used Tools

(1) rc

(2) grep, sed, awk, sort, diff, gzip, ...

(3) some new tools written in C

-------------

What Kirara means?

Kirara is name of a girl that appeared in a Japanese comic book.
(But I have never read the book.)
The name is seldom used in real world.
From the name we Japanese imagine something glittering.
I like the name.

-------------

References

[1] GLIMPSE: A Tool to Search Through Entire File Systems
Udi Manber and Sun Wu (1993)
http://webglimpse.net/pubs/glimpse.pdf
[2] Glimpse Documentation
http://webglimpse.net/gdocs/glimpsehelp.html

next prev parent reply	other threads:[~2013-08-06 18:14 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-06  1:14 arisawa
2013-08-06  7:26 ` Francisco J Ballesteros
2013-08-06  8:12 ` Peter A. Cejchan
2013-08-06 18:14 ` brz-systemd-dev [this message]
2013-08-07  6:31   ` arisawa
2013-08-07  7:22     ` Skip Tavakkolian
2013-08-07  8:17       ` arisawa
2013-08-07 13:32         ` erik quanstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7fb94c8c89d208107f12d7e69d02f2d6@neinchan.znet \
    --to=brz-systemd-dev@intma.in \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).