ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* [NTG-context] Quickly invoke a self-defined index sorting file?
@ 2025-01-12  8:58 autumnus 
  2025-01-12 11:01 ` [NTG-context] " Hans Hagen
  0 siblings, 1 reply; 6+ messages in thread
From: autumnus  @ 2025-01-12  8:58 UTC (permalink / raw)
  To: ntg-context

hi,

I defined an index file for sorting Chinese. (a bit large with almost 4MB)

https://github.com/Soanguy/ConTeXt-Chinese-Example/blob/master/sorting/sort-alpha.lua

But I have to pre-reference this file with input every time to enable it. 
How do I fuse these files with my local context system and use setup to enable it directly.

I want directly use
%%%
\setupregister[index][n=4,language=cn-alpha,]
%%%

instead of 
%%%%
\input sort-alpha.lua
\setupregister[index][
  n=4,
  language={cn-alpha},]
%%%

In addition to this, I found that the following notification appeared on the tex terminal, 
probably because there are too many characters in the index file (tens of thousands of characters). 
Can I avoid this notification?

tex memory      > bumping category 'token' succeeded, details: all=16000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=2000000 | min=2000000 | ptr=1999080 | set=10000000 | stp=1000000 | top=2000000
tex memory      > bumping category 'token' succeeded, details: all=24000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=3000000 | min=2000000 | ptr=2999080 | set=10000000 | stp=1000000 | top=3000000
tex memory      > bumping category 'token' succeeded, details: all=32000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=4000000 | min=2000000 | ptr=3999080 | set=10000000 | stp=1000000 | top=4000000
tex memory      > bumping category 'token' succeeded, details: all=40000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=5000000 | min=2000000 | ptr=4999080 | set=10000000 | stp=1000000 | top=5000000
tex memory      > bumping category 'token' succeeded, details: all=48000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=6000000 | min=2000000 | ptr=5999080 | set=10000000 | stp=1000000 | top=6000000
tex memory      > bumping category 'token' succeeded, details: all=56000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=7000000 | min=2000000 | ptr=6999080 | set=10000000 | stp=1000000 | top=7000000

autumnus
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://mailman.ntg.nl/mailman3/lists/ntg-context.ntg.nl
webpage  : https://www.pragma-ade.nl / https://context.aanhet.net (mirror)
archive  : https://github.com/contextgarden/context
wiki     : https://wiki.contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [NTG-context] Re: Quickly invoke a self-defined index sorting file?
  2025-01-12  8:58 [NTG-context] Quickly invoke a self-defined index sorting file? autumnus 
@ 2025-01-12 11:01 ` Hans Hagen
  2025-01-12 11:53   ` autumnus 
  0 siblings, 1 reply; 6+ messages in thread
From: Hans Hagen @ 2025-01-12 11:01 UTC (permalink / raw)
  To: autumnus, mailing list for ConTeXt users

On 1/12/2025 9:58 AM, autumnus wrote:
> hi,
> 
> I defined an index file for sorting Chinese. (a bit large with almost 4MB)
> 
> https://github.com/Soanguy/ConTeXt-Chinese-Example/blob/master/sorting/sort-alpha.lua
> 
> But I have to pre-reference this file with input every time to enable it.
> How do I fuse these files with my local context system and use setup to enable it directly.
> 
> I want directly use
> %%%
> \setupregister[index][n=4,language=cn-alpha,]
> %%%
> 
> instead of
> %%%%
> \input sort-alpha.lua
> \setupregister[index][
>    n=4,
>    language={cn-alpha},]
> %%%

remove \startluacode and \stopluacode in that file and do this instead:

\registerctxluafile{sort-hanzi}{}

\starttext
     test
\stoptext

currently you first load that whole file in memory tokenized (i.e. 1 
byte becomes 8 bytes) which is fine (and fast) for reasonable size files 
but in your case it has to bump token memory

also, you don't really need the huge entries table because you're not 
going to split the index for every first character

maybe this

definitions["cn-hanzi"].entries = table.setmetatableindex(function(t,k)
     if utfbyte(k) < 1000 then
         return "latin"
     else
         return "chinese"
     end
end)

print(definitions["cn-hanzi"].entries['a'])
print(definitions["cn-hanzi"].entries['咗'])

but even then ... korean and japanese don't have that either so 
basically you only need the order (is that order defined somewhere in 
unicode?

> In addition to this, I found that the following notification appeared on the tex terminal,
> probably because there are too many characters in the index file (tens of thousands of characters).
> Can I avoid this notification?
> 
> tex memory      > bumping category 'token' succeeded, details: all=16000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=2000000 | min=2000000 | ptr=1999080 | set=10000000 | stp=1000000 | top=2000000
> tex memory      > bumping category 'token' succeeded, details: all=24000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=3000000 | min=2000000 | ptr=2999080 | set=10000000 | stp=1000000 | top=3000000
> tex memory      > bumping category 'token' succeeded, details: all=32000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=4000000 | min=2000000 | ptr=3999080 | set=10000000 | stp=1000000 | top=4000000
> tex memory      > bumping category 'token' succeeded, details: all=40000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=5000000 | min=2000000 | ptr=4999080 | set=10000000 | stp=1000000 | top=5000000
> tex memory      > bumping category 'token' succeeded, details: all=48000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=6000000 | min=2000000 | ptr=5999080 | set=10000000 | stp=1000000 | top=6000000
> tex memory      > bumping category 'token' succeeded, details: all=56000000 | ext=0 | ini=568315 | itm=8 | max=10000000 | mem=7000000 | min=2000000 | ptr=6999080 | set=10000000 | stp=1000000 | top=7000000
> 
> autumnus
> ___________________________________________________________________________________
> If your question is of interest to others as well, please add an entry to the Wiki!
> 
> maillist : ntg-context@ntg.nl / https://mailman.ntg.nl/mailman3/lists/ntg-context.ntg.nl
> webpage  : https://www.pragma-ade.nl / https://context.aanhet.net (mirror)
> archive  : https://github.com/contextgarden/context
> wiki     : https://wiki.contextgarden.net
> ___________________________________________________________________________________


-- 

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://mailman.ntg.nl/mailman3/lists/ntg-context.ntg.nl
webpage  : https://www.pragma-ade.nl / https://context.aanhet.net (mirror)
archive  : https://github.com/contextgarden/context
wiki     : https://wiki.contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [NTG-context] Re: Quickly invoke a self-defined index sorting file?
  2025-01-12 11:01 ` [NTG-context] " Hans Hagen
@ 2025-01-12 11:53   ` autumnus 
  2025-01-12 12:18     ` Hans Hagen
  0 siblings, 1 reply; 6+ messages in thread
From: autumnus  @ 2025-01-12 11:53 UTC (permalink / raw)
  To: ntg-context

Thanks for the explanation.

After using \registerctxluafile{sort-hanzi}{}, 
the bumping message did not appear.

In terms of daily practical use, I really don't need so many characters.
I just don't have the energy to pick out those thousands of commonly used Chinese characters 
from these 40,000 or 50,000 characters
(In China, for example, there are only about 6,000-8,000 characters actually used on a daily basis.
 In Japanese, you may only need about 1000-3000 characters)

There are only two commonly used sorts for these characters:
(Sorting has nothing to do with unicode sorting) 
1 according to the actual pronunciation of the  characters and 
2 according to the order in which the characters are written (strokes). 
(The situation in Japanese should probably be mostly sorted by actual pronunciation based on kana, 
but the pronunciation of kanji in Japanese is much more complicated than in Chinese.) 

But sorting by strokes, I don't have the ability to achieve it at the moment. 
So the three indexes I designed are sorted according to the actual pronunciation of the Chinese characters.

The difference between them is only in the entries.
1 Sort in the order of a, b, c d, and use these letters as entries.(mostly used)
2 Sort in the order of a ai ao an ...... , and these pronunciations are used as entries.
3 Sort Chinese characters directly by their pronunciation and use them as entries.

Because I know almost nothing about lua myself, just referring to sort-lang (just applying templates)

For the sorting of Japanese, the sorting I see on latex 
so far is also directly marked out of the pronunciation, 
and then sorted by the pronunciation of the kana. 
(Because kanji in Japanese may have more than one pronunciation, 
and maybe even as many as 5). 
Unless there is a tool that can simultaneously 
phonetize the Chinese characters in the index at compile time.
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://mailman.ntg.nl/mailman3/lists/ntg-context.ntg.nl
webpage  : https://www.pragma-ade.nl / https://context.aanhet.net (mirror)
archive  : https://github.com/contextgarden/context
wiki     : https://wiki.contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [NTG-context] Re: Quickly invoke a self-defined index sorting file?
  2025-01-12 11:53   ` autumnus 
@ 2025-01-12 12:18     ` Hans Hagen
  2025-01-12 13:54       ` autumnus 
  0 siblings, 1 reply; 6+ messages in thread
From: Hans Hagen @ 2025-01-12 12:18 UTC (permalink / raw)
  To: ntg-context

On 1/12/2025 12:53 PM, autumnus wrote:
> Thanks for the explanation.
> 
> After using \registerctxluafile{sort-hanzi}{},
> the bumping message did not appear.
> 
> In terms of daily practical use, I really don't need so many characters.
> I just don't have the energy to pick out those thousands of commonly used Chinese characters
> from these 40,000 or 50,000 characters
> (In China, for example, there are only about 6,000-8,000 characters actually used on a daily basis.
>   In Japanese, you may only need about 1000-3000 characters)

so the entries table can just be omitted then

> There are only two commonly used sorts for these characters:
> (Sorting has nothing to do with unicode sorting)
> 1 according to the actual pronunciation of the  characters and
> 2 according to the order in which the characters are written (strokes).
> (The situation in Japanese should probably be mostly sorted by actual pronunciation based on kana,
> but the pronunciation of kanji in Japanese is much more complicated than in Chinese.)
> 
> But sorting by strokes, I don't have the ability to achieve it at the moment.
> So the three indexes I designed are sorted according to the actual pronunciation of the Chinese characters.
> 
> The difference between them is only in the entries.
> 1 Sort in the order of a, b, c d, and use these letters as entries.(mostly used)
> 2 Sort in the order of a ai ao an ...... , and these pronunciations are used as entries.
> 3 Sort Chinese characters directly by their pronunciation and use them as entries.
> 
> Because I know almost nothing about lua myself, just referring to sort-lang (just applying templates)

you can set up a combination of sorting if needed

so the 'order' table is what matters in your case, is that table made 
from some public list?

> For the sorting of Japanese, the sorting I see on latex
> Unless there is a tool that can simultaneously
> phonetize the Chinese characters in the index at compile time.

if we have the basic data (how to pronounce a single char) then runtime 
is no big deal

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://mailman.ntg.nl/mailman3/lists/ntg-context.ntg.nl
webpage  : https://www.pragma-ade.nl / https://context.aanhet.net (mirror)
archive  : https://github.com/contextgarden/context
wiki     : https://wiki.contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [NTG-context] Re: Quickly invoke a self-defined index sorting file?
  2025-01-12 12:18     ` Hans Hagen
@ 2025-01-12 13:54       ` autumnus 
  2025-01-12 15:59         ` Hans Hagen via ntg-context
  0 siblings, 1 reply; 6+ messages in thread
From: autumnus  @ 2025-01-12 13:54 UTC (permalink / raw)
  To: ntg-context

Chinese Hanzi PinYin  : https://github.com/mozillazg/pinyin-data/tree/master
Chinese Hanzi Stroke : https://github.com/leo-liu/zhmakeindex/blob/master/CJK/strokes.go

Japanese Hanzi pronunciation: https://github.com/cjkvi/cjkvi-tables/blob/master/joyo2010.txt
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://mailman.ntg.nl/mailman3/lists/ntg-context.ntg.nl
webpage  : https://www.pragma-ade.nl / https://context.aanhet.net (mirror)
archive  : https://github.com/contextgarden/context
wiki     : https://wiki.contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [NTG-context] Re: Quickly invoke a self-defined index sorting file?
  2025-01-12 13:54       ` autumnus 
@ 2025-01-12 15:59         ` Hans Hagen via ntg-context
  0 siblings, 0 replies; 6+ messages in thread
From: Hans Hagen via ntg-context @ 2025-01-12 15:59 UTC (permalink / raw)
  To: ntg-context; +Cc: Hans Hagen

[-- Attachment #1: Type: text/plain, Size: 719 bytes --]

On 1/12/2025 2:54 PM, autumnus wrote:

> Chinese Hanzi PinYin  : https://github.com/mozillazg/pinyin-data/tree/master
> Chinese Hanzi Stroke : https://github.com/leo-liu/zhmakeindex/blob/master/CJK/strokes.go

assuming that you donwloaded kHanyuPinlu.txt you can run the attached 
test file You might want to set up an order / entries for the latin variant

just prototyping here

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------

[-- Attachment #2: lang-imp-cn.lua --]
[-- Type: text/plain, Size: 441 bytes --]

local function whatever(name)
    local data = io.loaddata(name)
    if data then
        local mapping = { }
        for a, b in string.gmatch(data,".-: ([^, ]+).-# (%S+)") do -- U+4E00: yī  # 一
            mapping[b] = a
        end
        return mapping
    end
end

return {
    name = "cn transliterations",
    transliterations = {
        ["hanyu to pinlu"] = {
            mapping = whatever("kHanyuPinlu.txt")
        }
    }
}

[-- Attachment #3: sort-imp-cn.lua --]
[-- Type: text/plain, Size: 666 bytes --]

local utfchar, utfbyte  = utf.char, utf.byte
local sorters           = sorters or { }
local definitions       = sorters.definitions or { }
local replacementoffset = sorters.constants.replacementoffset
local variables         = interfaces.variables or { }

local function whatever(name)
    local data = io.loaddata(name)
    if data then
        local replacements = { }
        for a, b in string.gmatch(data,".-: ([^, ]+).-# (%S+)") do -- U+4E00: yī  # 一
            replacements[#replacements+1] = { b, a }
        end
        return replacements
    end
end

definitions["cn-hanzi"] = {
 -- orders = {
 -- }
    replacements = whatever("kHanyuPinlu.txt"),
}


[-- Attachment #4: test.tex --]
[-- Type: text/plain, Size: 907 bytes --]

\mainlanguage[cn]

\registerctxluafile{sort-imp-cn}{}

\usetransliteration[cn]

\definetransliteration
    [MyChinese]
    [color=blue,
     vector={hanyu to pinlu}]

\starttext

    \definedfont[name:adobesongstdlight]

    test 词语拼音数据 \index{词语拼音数据}\par
    test 据词语拼音数 \index{据词语拼音数}\par
    test 数据词语拼音 \index{数据词语拼音}\par

    \blank

    \starttransliteration[MyChinese]
        \tf
        词语拼音数据
        据词语拼音数
        数据词语拼音
    \stoptransliteration

    \placeregister
      [index]
      [n=1,indicator=no]

    \placeregister
      [index]
      [language=cn-hanzi,n=1]

    \starttransliteration[MyChinese]
        \tf
        \placeregister
          [index]
          [language=cn-hanzi,n=1,inbetween=]
    \stoptransliteration

\stoptext


[-- Attachment #5: Type: text/plain, Size: 511 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://mailman.ntg.nl/mailman3/lists/ntg-context.ntg.nl
webpage  : https://www.pragma-ade.nl / https://context.aanhet.net (mirror)
archive  : https://github.com/contextgarden/context
wiki     : https://wiki.contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-01-12 16:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-12  8:58 [NTG-context] Quickly invoke a self-defined index sorting file? autumnus 
2025-01-12 11:01 ` [NTG-context] " Hans Hagen
2025-01-12 11:53   ` autumnus 
2025-01-12 12:18     ` Hans Hagen
2025-01-12 13:54       ` autumnus 
2025-01-12 15:59         ` Hans Hagen via ntg-context

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).