From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/62331 Path: news.gmane.org!not-for-mail From: Philipp Gesang Newsgroups: gmane.comp.tex.context Subject: Re: two buglets Date: Tue, 5 Oct 2010 14:15:43 +0200 Message-ID: <20101005121543.GD13466@aides> References: <4B743BBE.2050307@wxs.nl> <4CA85AFD.7050200@wxs.nl> <1D533ABC-6546-40C3-9CD4-9659A86D9A5E@uni-bonn.de> <4CA89CCA.8020705@wxs.nl> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1828040144==" X-Trace: dough.gmane.org 1286280940 9920 80.91.229.12 (5 Oct 2010 12:15:40 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 5 Oct 2010 12:15:40 +0000 (UTC) To: mailing list for ConTeXt users Original-X-From: ntg-context-bounces@ntg.nl Tue Oct 05 14:15:36 2010 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from balder.ntg.nl ([195.12.62.10]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1P36Qi-0000Tv-Co for gctc-ntg-context-518@m.gmane.org; Tue, 05 Oct 2010 14:15:36 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id 81D10CA670; Tue, 5 Oct 2010 14:15:35 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at balder.ntg.nl Original-Received: from balder.ntg.nl ([127.0.0.1]) by localhost (balder.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 1qY0UURexYar; Tue, 5 Oct 2010 14:15:29 +0200 (CEST) Original-Received: from balder.ntg.nl (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id 38146CA623; Tue, 5 Oct 2010 14:15:29 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id 541F0CA623 for ; Tue, 5 Oct 2010 14:15:27 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at balder.ntg.nl Original-Received: from balder.ntg.nl ([127.0.0.1]) by localhost (balder.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id UE+PpcGxQqBA for ; Tue, 5 Oct 2010 14:15:14 +0200 (CEST) Original-Received: from filter4-ams.mf.surf.net (filter4-ams.mf.surf.net [192.87.102.72]) by balder.ntg.nl (Postfix) with ESMTP id DD57FCA620 for ; Tue, 5 Oct 2010 14:15:14 +0200 (CEST) Original-Received: from relay2.uni-heidelberg.de (relay2.uni-heidelberg.de [129.206.210.211]) by filter4-ams.mf.surf.net (8.14.3/8.14.3/Debian-5+lenny1) with ESMTP id o95CFDKe027600 for ; Tue, 5 Oct 2010 14:15:14 +0200 Original-Received: from ix.urz.uni-heidelberg.de (cyrus-portal.urz.uni-heidelberg.de [129.206.100.176]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id o95CH3TD004059 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Tue, 5 Oct 2010 14:17:03 +0200 Original-Received: from extmail.urz.uni-heidelberg.de (extmail.urz.uni-heidelberg.de [129.206.100.140]) by ix.urz.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id o95CFCDg032471 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Tue, 5 Oct 2010 14:15:12 +0200 Original-Received: from localhost (mnhm-4d010aa6.pool.mediaWays.net [77.1.10.166]) (authenticated bits=0) by extmail.urz.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id o95CElow015678 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO) for ; Tue, 5 Oct 2010 14:14:48 +0200 In-Reply-To: X-Operating-System: Linux aides 2.6.35-ARCH X-Polite-Request: "Please try to be nice, don't send html mail." User-Agent: Mutt/1.5.20 (2009-06-14) X-Bayes-Prob: 0.0001 (Score 0, tokens from: @@RPTN) X-CanIt-Geo: ip=129.206.210.211; country=DE; region=01; city=Heidelberg; latitude=49.4167; longitude=8.7000; http://maps.google.com/maps?q=49.4167,8.7000&z=6 X-CanItPRO-Stream: uu:ntg-context@ntg.nl (inherits from uu:default, base:default) X-Canit-Stats-ID: 03Df0feRp - 3dbdd6dbd79b - 20101005 X-Scanned-By: CanIt (www . roaringpenguin . com) on 192.87.102.72 X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.12 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl Xref: news.gmane.org gmane.comp.tex.context:62331 Archived-At: --===============1828040144== Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="+SfteS7bOf3dGlBC" Content-Disposition: inline --+SfteS7bOf3dGlBC Content-Type: multipart/mixed; boundary="PHCdUe6m4AxPMzOu" Content-Disposition: inline --PHCdUe6m4AxPMzOu Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2010-10-03 <17:43:21>, Thomas A. Schmitz wrote: > OK, I'll write something for German and English, but the thing > is that we need more input what users expect. For mixtures with > foreign languages, there might not be generally accepted rules at > all, so people will define something on an ad-hoc basis. Hi Thomas and others, technically speaking the problem is solved by ISO 14651.[1] In praxi multilingual sorting depends on local rules, of which =E2=80=9COne index per script|language.=E2=80=9D seems to be the most common. Some time ago I made an lpeg from the bnf in [1]. It matches the collation rules from [2], but as I couldn=E2=80=99t figure out how to map them onto context=E2=80=99s sorting mechanism I never got around to actually capture the information. As I won=E2=80=99t be having the time to try it with the new structure of sort-lan I guess I=E2=80=99ll just attach the peg grammar for anyone to use as a starting point. Unicode collation would be great to have in context. > transliteration. The problem with polytonic Greek is that so many > different unicode characters need to have the same sort entry. If Isn=E2=80=99t that just what the Greek rules in sort-lan.lua do? If not then it would be a bug. =C2=B7=C2=B7=C2=B7=C2=B7startsnippet=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2= =B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7= =C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2= =B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7= =C2=B7=C2=B7=C2=B7=C2=B7=C2=B7 definitions["gr"] =3D { entries =3D { ["=CE=B1"] =3D "=CE=B1", ["=CE=AC"] =3D "=CE=B1", ["=E1=BD=B0"] =3D= "=CE=B1", ["=E1=BE=B6"] =3D "=CE=B1", ["=E1=BE=B3"] =3D "=CE=B1", ["=E1=BC=80"] =3D "=CE=B1", ["=E1=BC=81"] =3D "=CE=B1", ["=E1=BC=84= "] =3D "=CE=B1", ["=E1=BC=82"] =3D "=CE=B1", ["=E1=BC=86"] =3D "=CE=B1", ["=E1=BC=81"] =3D "=CE=B1", ["=E1=BC=85"] =3D "=CE=B1", ["=E1=BC=83= "] =3D "=CE=B1", ["=E1=BC=87"] =3D "=CE=B1", ["=E1=BE=81"] =3D "=CE=B1", ["=E1=BE=B4"] =3D "=CE=B1", ["=E1=BE=B2"] =3D "=CE=B1", ["=E1=BE=B7= "] =3D "=CE=B1", ["=E1=BE=84"] =3D "=CE=B1", ["=E1=BE=82"] =3D "=CE=B1", ["=E1=BE=85"] =3D "=CE=B1", ["=E1=BE=83"] =3D "=CE=B1", ["=E1=BE=86= "] =3D "=CE=B1", ["=E1=BE=87"] =3D "=CE=B1", ["=CE=B2"] =3D "=CE=B2", =C2=B7=C2=B7=C2=B7=C2=B7stopsnippet=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2= =B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7= =C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2= =B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7= =C2=B7=C2=B7=C2=B7=C2=B7=C2=B7=C2=B7 Always nice to have a decent discussion on sorting ;) Philipp [1] http://standards.iso.org/ittf/PubliclyAvailableStandards/c044872_ISO_IE= C_14651_2007(E).zip [2] http://www.iso.org/ittf/ISO14651_2006_TABLE1_En.txt --=20 () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments --PHCdUe6m4AxPMzOu Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="iso14651-parser.lua" require "lpeg" local C, Cs, Ct, P, R, S, V, match = lpeg.C, lpeg.Cs, lpeg.Ct, lpeg.P, lpeg.R, lpeg.S, lpeg.V, lpeg.match local iso_parser rules = P{ [1] = "weight_table", -- Define collation tables as sequences of lines weight_table = V"common_template_table" + V"tailored_table", common_template_table = V"simple_line"^0, tailored_table = V"table_line"^0, -- Define the line types simple_line = (V"symbol_definition" + V"collating_element" + V"weight_assignment" + V"order_end")^-1 * V"line_completion" --/ function (first) io.write("simple: "..first) end , --table_line = V"simple_line" + V"tailoring_line", table_line = V"tailoring_line" + V"simple_line", tailoring_line = (V"reorder_after" + V"order_start" + V"reorder_end" + V"section_definition" + V"reorder_section_after") * V"line_completion" --/ function (first) io.write("tailoring: "..first) end , -- Define the basic syntax for collation weighting symbol_definition = P"collating-symbol" * V"space"^1 * V"symbol_element", symbol_element = V"symbol"-V"symbol_range" + V"symbol_range", symbol_range = V"symbol" * P".." * V"symbol", symbol = V"simple_symbol" + V"ucs_symbol", ucs_symbol = (P"") + (P""), simple_symbol = P"<" * V"identifier" * P">", collating_element = P"collating-element" * V"space"^1 * V"symbol" * V"space"^1 * P"from" * V"space"^1 * V"quoted_symbol_sequence", quoted_symbol_sequence = P'"' * V"simple_weight"^1 * P'"', --weight_assignment = V"simple_weight" + V"symbol_weight", weight_assignment = V"symbol_weight" + V"simple_weight", simple_weight = V"symbol_element" + P"UNDEFINED", symbol_weight = V"symbol_element" * V"space"^1 * V"weight_list", weight_list = V"level_token" * (V"semicolon" * V"level_token")^0, level_token = V"symbol_group" + P"IGNORE", symbol_group = V"symbol_element" + V"quoted_symbol_sequence", order_end = P"order_end", -- Define the tailoring syntax reorder_after = P"reorder-after" * V"space"^1 * V"target_symbol", target_symbol = V"symbol", order_start = P"order_start" * V"space"^1 * V"multiple_level_direction", multiple_level_direction = V"direction" * (V"semicolon" * V"direction")^0 * P",position"^-1, direction = P"forward" + P"backward", reorder_end = P"reorder-end", section_definition = V"section_definition_simple" + V"section_definition_list", section_definition_simple = P"section" * V"space"^1 * V"section_identifier", section_identifier = V"identifier", section_definition_list = P"section" * V"space"^1 * V"section_identifier" * V"space"^1 * V"symbol_list", symbol_list = V"symbol_element" * (V"semicolon" * V"symbol_element")^0, reorder_section_after = P"reorder-section-after" * V"space"^1 * V"section_identifier" * V"space"^1 * V"target_symbol", -- Define low-level tokens used by the rest of the syntax identifier = (V"letter" + V"digit") * V"id_part"^0, id_part = V"letter" + V"digit" + S"-_", line_completion = V"space"^0 * V"comment"^-1 * V"EOL", comment = V"comment_char" * V"character"^0, one_to_eight_digit_hex_string = V"hex_upper"^-8, hex_numeric_string = V"hex_upper"^1, space = S" \t", semicolon = P";", comment_char = P"%", digit = R"09", hex_upper = V"digit" + S"ABCDEF", letter = R"az" + R"AZ", EOL = P"\n", character = 1-V"EOL", } f = io.open("iso14651.txt", "r") tab = f:read("*all") f:close() --rules:print() print(rules:match(tab)) --PHCdUe6m4AxPMzOu-- --+SfteS7bOf3dGlBC Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAkyrFu8ACgkQ02lYlJYWs9LoPACeIk7OQLWWdz6625ahOC1MWwg/ RD0AoI006l8DoAUaemMRJjlvmhYXL+Em =F0jp -----END PGP SIGNATURE----- --+SfteS7bOf3dGlBC-- --===============1828040144== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ --===============1828040144==--