9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] mkindex of dict(7)
@ 2009-01-05  6:08 Akshat Kumar
  2009-01-05  7:15 ` Akshat Kumar
  0 siblings, 1 reply; 8+ messages in thread
From: Akshat Kumar @ 2009-01-05  6:08 UTC (permalink / raw)
  To: 9fans

In its current state, /sys/src/cmd/dict/mkindex suicides if
/lib/dict/oed2 is not present and '-d' option is not specified (along
with the dict name) -- fix:
	move /sys/src/cmd/dict/mkindex:57 to
	after /sys/src/cmd/dict/mkindex:62
(that is, place the Bseek after the conditional on Bopen/bdict)

Is no one is using it these days?
ak





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] mkindex of dict(7)
  2009-01-05  6:08 [9fans] mkindex of dict(7) Akshat Kumar
@ 2009-01-05  7:15 ` Akshat Kumar
  2009-01-05 14:13   ` erik quanstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Akshat Kumar @ 2009-01-05  7:15 UTC (permalink / raw)
  To: 9fans

/sys/src/cmd/dict/dict.c states:
/*
 * Assumed index file structure: lines of form
 * 	[^\t]+\t[0-9]+
 * First field is key, second is byte offset into dictionary.
 * Should be sorted with args -u -t'	' +0f -1 +0 -1 +1n -2
 */

whereas, /sys/src/cmd/dict/mkindex outputs:
<byte offset>	<key>
i.e.,
0	ヽ [くりかえし]
(custom dictionary from EDICT)
or
158928	Ab*sorb`a*bil"i*ty
(pgw)

thus, the resulting index from mkindex seems to not be usable with
dict(7)


Perhaps, before I dive into thinking I'm fixing things,
someone would be kind enough to look into the above?
ak




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] mkindex of dict(7)
  2009-01-05  7:15 ` Akshat Kumar
@ 2009-01-05 14:13   ` erik quanstrom
  2009-01-05 14:41     ` erik quanstrom
  0 siblings, 1 reply; 8+ messages in thread
From: erik quanstrom @ 2009-01-05 14:13 UTC (permalink / raw)
  To: 9fans

> /sys/src/cmd/dict/dict.c states:
> /*
>  * Assumed index file structure: lines of form
>  * 	[^\t]+\t[0-9]+
>  * First field is key, second is byte offset into dictionary.
>  * Should be sorted with args -u -t'	' +0f -1 +0 -1 +1n -2
>  */
>
> whereas, /sys/src/cmd/dict/mkindex outputs:
> <byte offset>	<key>
> i.e.,
> 0	ヽ [くりかえし]
>
> Perhaps, before I dive into thinking I'm fixing things,
> someone would be kind enough to look into the above?
> ak

clearly there was some post processing.
why do you need to regenerate the index?

- erik



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] mkindex of dict(7)
  2009-01-05 14:13   ` erik quanstrom
@ 2009-01-05 14:41     ` erik quanstrom
  2009-01-06  5:34       ` Akshat Kumar
  0 siblings, 1 reply; 8+ messages in thread
From: erik quanstrom @ 2009-01-05 14:41 UTC (permalink / raw)
  To: 9fans

> learly there was some post processing.
> why do you need to regenerate the index?

garr.  reply to myself.

there are a number of awk and rc scripts in
the directory.  it would be a good quick
project to put the pieces together in the
mkfile, but i think some of the pieces are
missing.  the wouldn't be hard to recreate.

running pgwindexraw through
	awk -F^'	' -f canonind.awk
seems pretty reasonable, but there was different
processing applied before the sort to remove
the syllable markers and the leading ||.

- erik




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] mkindex of dict(7)
  2009-01-05 14:41     ` erik quanstrom
@ 2009-01-06  5:34       ` Akshat Kumar
  2009-01-06 12:41         ` Fazlul Shahriar
  0 siblings, 1 reply; 8+ messages in thread
From: Akshat Kumar @ 2009-01-06  5:34 UTC (permalink / raw)
  To: 9fans

Regarding the dict index files, what I understand is that dict(7)
receives a pattern (may also be a byte offset or whatever, but suppose
pattern), looks it up in the first fields of the lines in the dict
index, and uses the corresponding byte offset in the index to find the
full line in the dict file.  Well, I've been trying to make the EDICT
dictionary[1] usable with dict(7), using just the "simple" dict scheme
as described in /sys/src/cmd/dict/simple.c, and have made (for now) a
	<kanji>	<byte offset>
index file from the output of mkindex (piping through to sed and
switching the order of the kanji and byte offset).  I've tried quite a
few ways of making that index file, but have yet not succeeded in
getting dict(7) to actually find a corresponding line in the dict file
(`pattern not found'), given any kanji in the first fields in the
index file as a pattern.

I cannot attach the index file nor the dictionary file with this
E-Mail, since both are too big -- though I've put them online[2] --
but the dictionary file made available at [1] is in a slightly
different format (inserted tab after each kanji/kana) and charset
(EUC-JP/JIS X 0208 → UTF-8) than I have converted at [2].  If anyone
is willing to help figure this out, I'd be very grateful.


[1] http://www.csse.monash.edu.au/~jwb/edict_doc.html
	(see FORMAT for default formatting, and CURRENT VERSION & DOWNLOAD
	 to grab edict.gz)
[2] http://sounine.nanosouffle.net/magic/webls?dir=/comp/dict


Please alert me if the information here is insufficient --
I also don't mind if you go ahead and make the dict files yourself...
just let me in on it --
ak




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] mkindex of dict(7)
  2009-01-06  5:34       ` Akshat Kumar
@ 2009-01-06 12:41         ` Fazlul Shahriar
  2009-01-06 12:50           ` Akshat Kumar
  0 siblings, 1 reply; 8+ messages in thread
From: Fazlul Shahriar @ 2009-01-06 12:41 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Hi,

You need to sort your index file. Looks like dict(7) is doing binary
search on it. After sorted, it works fine.

fhs

On Tue, Jan 6, 2009 at 12:34 AM, Akshat Kumar
<akumar@sounine.nanosouffle.net> wrote:
> Regarding the dict index files, what I understand is that dict(7)
> receives a pattern (may also be a byte offset or whatever, but suppose
> pattern), looks it up in the first fields of the lines in the dict
> index, and uses the corresponding byte offset in the index to find the
> full line in the dict file.  Well, I've been trying to make the EDICT
> dictionary[1] usable with dict(7), using just the "simple" dict scheme
> as described in /sys/src/cmd/dict/simple.c, and have made (for now) a
>        <kanji> <byte offset>
> index file from the output of mkindex (piping through to sed and
> switching the order of the kanji and byte offset).  I've tried quite a
> few ways of making that index file, but have yet not succeeded in
> getting dict(7) to actually find a corresponding line in the dict file
> (`pattern not found'), given any kanji in the first fields in the
> index file as a pattern.
>
> I cannot attach the index file nor the dictionary file with this
> E-Mail, since both are too big -- though I've put them online[2] --
> but the dictionary file made available at [1] is in a slightly
> different format (inserted tab after each kanji/kana) and charset
> (EUC-JP/JIS X 0208 → UTF-8) than I have converted at [2].  If anyone
> is willing to help figure this out, I'd be very grateful.
>
>
> [1] http://www.csse.monash.edu.au/~jwb/edict_doc.html
>        (see FORMAT for default formatting, and CURRENT VERSION & DOWNLOAD
>         to grab edict.gz)
> [2] http://sounine.nanosouffle.net/magic/webls?dir=/comp/dict
>
>
> Please alert me if the information here is insufficient --
> I also don't mind if you go ahead and make the dict files yourself...
> just let me in on it --
> ak

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] mkindex of dict(7)
  2009-01-06 12:41         ` Fazlul Shahriar
@ 2009-01-06 12:50           ` Akshat Kumar
  0 siblings, 0 replies; 8+ messages in thread
From: Akshat Kumar @ 2009-01-06 12:50 UTC (permalink / raw)
  To: 9fans

...
> You need to sort your index file.

Aha! This was exactly the problem!

> Looks like dict(7) is doing binary
> search on it. After sorted, it works fine.
>

Indeed.


> fhs
>
Thanks,
ak




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] mkindex of dict(7)
@ 2009-01-05  7:27 Akshat Kumar
  0 siblings, 0 replies; 8+ messages in thread
From: Akshat Kumar @ 2009-01-05  7:27 UTC (permalink / raw)
  To: 9fans

/sys/src/cmd/dict/dict.c states:
/*
 * Assumed index file structure: lines of form
 * 	[^\t]+\t[0-9]+
 * First field is key, second is byte offset into dictionary.
 * Should be sorted with args -u -t'	' +0f -1 +0 -1 +1n -2
 */

whereas, /sys/src/cmd/dict/mkindex outputs:
<byte offset>	<key>
i.e.,
0	ヽ [くりかえし]
(custom dictionary  from EDICT)
or
158928	Ab*sorb`a*bil"i*ty
(pgw)

thus, the resulting index from mkindex seems to not be usable with
dict(7)


Perhaps, before I dive into thinking I'm fixing things,
someone would be kind enough to look into the above?
ak




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-01-06 12:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-05  6:08 [9fans] mkindex of dict(7) Akshat Kumar
2009-01-05  7:15 ` Akshat Kumar
2009-01-05 14:13   ` erik quanstrom
2009-01-05 14:41     ` erik quanstrom
2009-01-06  5:34       ` Akshat Kumar
2009-01-06 12:41         ` Fazlul Shahriar
2009-01-06 12:50           ` Akshat Kumar
2009-01-05  7:27 Akshat Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).