From: Philipp Gesang <pgesang@ix.urz.uni-heidelberg.de>
To: mailing list for ConTeXt users <ntg-context@ntg.nl>
Subject: Re: two buglets
Date: Tue, 5 Oct 2010 14:15:43 +0200 [thread overview]
Message-ID: <20101005121543.GD13466@aides> (raw)
In-Reply-To: <E02D0E63-FC11-4F9B-AE45-7D6CE6A4D3FF@uni-bonn.de>
[-- Attachment #1.1.1: Type: text/plain, Size: 2291 bytes --]
On 2010-10-03 <17:43:21>, Thomas A. Schmitz wrote:
> OK, I'll write something for German and English, but the thing
> is that we need more input what users expect. For mixtures with
> foreign languages, there might not be generally accepted rules at
> all, so people will define something on an ad-hoc basis.
Hi Thomas and others,
technically speaking the problem is solved by ISO 14651.[1]
In praxi multilingual sorting depends on local rules, of
which “One index per script|language.” seems to be the most
common.
Some time ago I made an lpeg from the bnf in [1]. It matches the
collation rules from [2], but as I couldn’t figure out how to map
them onto context’s sorting mechanism I never got around to
actually capture the information. As I won’t be having the time
to try it with the new structure of sort-lan I guess I’ll just
attach the peg grammar for anyone to use as a starting point.
Unicode collation would be great to have in context.
> transliteration. The problem with polytonic Greek is that so many
> different unicode characters need to have the same sort entry. If
Isn’t that just what the Greek rules in sort-lan.lua do? If not
then it would be a bug.
····startsnippet·················································
definitions["gr"] = {
entries = {
["α"] = "α", ["ά"] = "α", ["ὰ"] = "α", ["ᾶ"] = "α", ["ᾳ"] = "α",
["ἀ"] = "α", ["ἁ"] = "α", ["ἄ"] = "α", ["ἂ"] = "α", ["ἆ"] = "α",
["ἁ"] = "α", ["ἅ"] = "α", ["ἃ"] = "α", ["ἇ"] = "α", ["ᾁ"] = "α",
["ᾴ"] = "α", ["ᾲ"] = "α", ["ᾷ"] = "α", ["ᾄ"] = "α", ["ᾂ"] = "α",
["ᾅ"] = "α", ["ᾃ"] = "α", ["ᾆ"] = "α", ["ᾇ"] = "α", ["β"] = "β",
····stopsnippet··················································
Always nice to have a decent discussion on sorting ;)
Philipp
[1] http://standards.iso.org/ittf/PubliclyAvailableStandards/c044872_ISO_IEC_14651_2007(E).zip
[2] http://www.iso.org/ittf/ISO14651_2006_TABLE1_En.txt
--
() ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org - against proprietary attachments
[-- Attachment #1.1.2: iso14651-parser.lua --]
[-- Type: text/plain, Size: 3747 bytes --]
require "lpeg"
local C, Cs, Ct, P, R, S, V, match = lpeg.C, lpeg.Cs, lpeg.Ct, lpeg.P, lpeg.R, lpeg.S, lpeg.V, lpeg.match
local iso_parser
rules = P{
[1] = "weight_table",
-- Define collation tables as sequences of lines
weight_table = V"common_template_table" + V"tailored_table",
common_template_table = V"simple_line"^0,
tailored_table = V"table_line"^0,
-- Define the line types
simple_line = (V"symbol_definition" + V"collating_element" +
V"weight_assignment" + V"order_end")^-1 * V"line_completion" --/ function (first) io.write("simple: "..first) end
,
--table_line = V"simple_line" + V"tailoring_line",
table_line = V"tailoring_line" + V"simple_line",
tailoring_line = (V"reorder_after" + V"order_start" + V"reorder_end" +
V"section_definition" + V"reorder_section_after") *
V"line_completion" --/ function (first) io.write("tailoring: "..first) end
,
-- Define the basic syntax for collation weighting
symbol_definition = P"collating-symbol" * V"space"^1 * V"symbol_element",
symbol_element = V"symbol"-V"symbol_range" + V"symbol_range",
symbol_range = V"symbol" * P".." * V"symbol",
symbol = V"simple_symbol" + V"ucs_symbol",
ucs_symbol = (P"<U" * V"one_to_eight_digit_hex_string" * P">") +
(P"<U-" * V"one_to_eight_digit_hex_string" * P">"),
simple_symbol = P"<" * V"identifier" * P">",
collating_element = P"collating-element" * V"space"^1 * V"symbol" * V"space"^1 *
P"from" * V"space"^1 * V"quoted_symbol_sequence",
quoted_symbol_sequence = P'"' * V"simple_weight"^1 * P'"',
--weight_assignment = V"simple_weight" + V"symbol_weight",
weight_assignment = V"symbol_weight" + V"simple_weight",
simple_weight = V"symbol_element" + P"UNDEFINED",
symbol_weight = V"symbol_element" * V"space"^1 * V"weight_list",
weight_list = V"level_token" * (V"semicolon" * V"level_token")^0,
level_token = V"symbol_group" + P"IGNORE",
symbol_group = V"symbol_element" + V"quoted_symbol_sequence",
order_end = P"order_end",
-- Define the tailoring syntax
reorder_after = P"reorder-after" * V"space"^1 * V"target_symbol",
target_symbol = V"symbol",
order_start = P"order_start" * V"space"^1 * V"multiple_level_direction",
multiple_level_direction = V"direction" * (V"semicolon" * V"direction")^0 * P",position"^-1,
direction = P"forward" + P"backward",
reorder_end = P"reorder-end",
section_definition = V"section_definition_simple" + V"section_definition_list",
section_definition_simple = P"section" * V"space"^1 * V"section_identifier",
section_identifier = V"identifier",
section_definition_list = P"section" * V"space"^1 * V"section_identifier" * V"space"^1 * V"symbol_list",
symbol_list = V"symbol_element" * (V"semicolon" * V"symbol_element")^0,
reorder_section_after = P"reorder-section-after" * V"space"^1 * V"section_identifier" * V"space"^1 * V"target_symbol",
-- Define low-level tokens used by the rest of the syntax
identifier = (V"letter" + V"digit") * V"id_part"^0,
id_part = V"letter" + V"digit" + S"-_",
line_completion = V"space"^0 * V"comment"^-1 * V"EOL",
comment = V"comment_char" * V"character"^0,
one_to_eight_digit_hex_string = V"hex_upper"^-8,
hex_numeric_string = V"hex_upper"^1,
space = S" \t",
semicolon = P";",
comment_char = P"%",
digit = R"09",
hex_upper = V"digit" + S"ABCDEF",
letter = R"az" + R"AZ",
EOL = P"\n",
character = 1-V"EOL",
}
f = io.open("iso14651.txt", "r")
tab = f:read("*all")
f:close()
--rules:print()
print(rules:match(tab))
[-- Attachment #1.2: Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 486 bytes --]
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
next prev parent reply other threads:[~2010-10-05 12:15 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-11 15:52 Thomas A. Schmitz
2010-02-11 17:17 ` Hans Hagen
2010-02-11 17:35 ` Thomas A. Schmitz
2010-02-11 19:29 ` Hans Hagen
2010-02-11 21:27 ` Thomas A. Schmitz
2010-02-11 18:14 ` David Rogers
2010-10-03 8:24 ` Thomas A. Schmitz
2010-10-03 10:29 ` Hans Hagen
2010-10-03 10:58 ` Thomas A. Schmitz
2010-10-03 15:10 ` Hans Hagen
2010-10-03 15:43 ` Thomas A. Schmitz
2010-10-05 12:15 ` Philipp Gesang [this message]
2010-10-05 12:39 ` Hans Hagen
2010-10-05 13:29 ` Thomas A. Schmitz
2010-10-05 21:17 ` Philipp Gesang
2010-10-05 21:27 ` Hans Hagen
2010-10-05 21:55 ` Philipp Gesang
2010-10-06 7:50 ` Hans Hagen
2010-02-11 17:19 ` Hans Hagen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101005121543.GD13466@aides \
--to=pgesang@ix.urz.uni-heidelberg.de \
--cc=ntg-context@ntg.nl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).