From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/61030 Path: news.gmane.org!not-for-mail From: Philipp Gesang Newsgroups: gmane.comp.tex.context Subject: polish sorting Date: Wed, 18 Aug 2010 18:08:56 +0200 Message-ID: <20100818160856.GB13324@aides> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0993570439==" X-Trace: dough.gmane.org 1282147735 13047 80.91.229.12 (18 Aug 2010 16:08:55 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 18 Aug 2010 16:08:55 +0000 (UTC) To: ntg-context@ntg.nl Original-X-From: ntg-context-bounces@ntg.nl Wed Aug 18 18:08:54 2010 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from balder.ntg.nl ([195.12.62.10]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1OllC9-0002aS-Vm for gctc-ntg-context-518@m.gmane.org; Wed, 18 Aug 2010 18:08:54 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id 3C99ECA5A5; Wed, 18 Aug 2010 18:08:53 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at balder.ntg.nl Original-Received: from balder.ntg.nl ([127.0.0.1]) by localhost (balder.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id bRIDocdigr1y; Wed, 18 Aug 2010 18:08:50 +0200 (CEST) Original-Received: from balder.ntg.nl (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id 38667CA5A0; Wed, 18 Aug 2010 18:08:50 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id 17470CA5A0 for ; Wed, 18 Aug 2010 18:08:49 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at balder.ntg.nl Original-Received: from balder.ntg.nl ([127.0.0.1]) by localhost (balder.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id x7En7cGbNMOp for ; Wed, 18 Aug 2010 18:08:34 +0200 (CEST) Original-Received: from filter1-nij.mf.surf.net (filter1-nij.mf.surf.net [195.169.124.152]) by balder.ntg.nl (Postfix) with ESMTP id 662BCCA59E for ; Wed, 18 Aug 2010 18:08:34 +0200 (CEST) Original-Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by filter1-nij.mf.surf.net (8.14.3/8.14.3/Debian-5+lenny1) with ESMTP id o7IG8WkA011358 for ; Wed, 18 Aug 2010 18:08:33 +0200 Original-Received: from ix.urz.uni-heidelberg.de (cyrus-portal.urz.uni-heidelberg.de [129.206.100.176]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id o7IG85P1010199 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 18 Aug 2010 18:08:05 +0200 Original-Received: from extmail.urz.uni-heidelberg.de (extmail.urz.uni-heidelberg.de [129.206.100.140]) by ix.urz.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id o7IG8VSJ023405 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 18 Aug 2010 18:08:31 +0200 Original-Received: from localhost (mnhm-4d012457.pool.mediaWays.net [77.1.36.87]) (authenticated bits=0) by extmail.urz.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id o7IG8EWq007161 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO) for ; Wed, 18 Aug 2010 18:08:15 +0200 X-Operating-System: Linux aides 2.6.34-rc3 X-Polite-Request: "Please try to be nice, don't send html mail." User-Agent: Mutt/1.5.20 (2009-06-14) X-Bayes-Prob: 0.0001 (Score 0, tokens from: @@RPTN) X-CanIt-Geo: ip=129.206.100.212; country=DE; region=01; city=Heidelberg; latitude=49.4167; longitude=8.7000; http://maps.google.com/maps?q=49.4167,8.7000&z=6 X-CanItPRO-Stream: uu:ntg-context@ntg.nl (inherits from uu:default, base:default) X-Canit-Stats-ID: 07CTQ8xQ4 - 7d148c4dc570 - 20100818 X-Scanned-By: CanIt (www . roaringpenguin . com) on 195.169.124.152 X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.12 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl Xref: news.gmane.org gmane.comp.tex.context:61030 Archived-At: --===============0993570439== Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="oOB74oR0WcNeq9Zb" Content-Disposition: inline --oOB74oR0WcNeq9Zb Content-Type: multipart/mixed; boundary="oFbHfjnMgUMsrGjO" Content-Disposition: inline --oFbHfjnMgUMsrGjO Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, I'm creating some sorting tables. While researching this topic I stumbled on the Polish dictionary sorting rules: if two strings are equal except for case then the one gets precedence that begins lowercase.[1] (This seems to apply to the Swedish order as well but I have no means to verify that. Apparently, my German dictionary (from 1991) follows the same rule without explicitly stating so.) Context seems to prefer it the other way round, so I modified two functions from sort-ini.lua to handle that; but I'm not happy with this solution. So my question: is there already, or could we have some mechanism to influence the details of sorting in context? Thanks for your help,=20 Philipp [1] , p. 7. --=20 () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments --oFbHfjnMgUMsrGjO Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="playground.lua" Content-Transfer-Encoding: quoted-printable --- testing environment for sorters dofile "polishsort.lua" document.whatever =3D { } document.whatever.words =3D { } local my =3D {} function my.gsub (s, patt, repl) patt =3D lpeg.S(patt) patt =3D lpeg.Cs((patt / repl + 1)^0) return lpeg.match(patt, s) end --- based on http://www.mail-archive.com/ntg-context@ntg.nl/msg47525.html function document.whatever.sorttext() local dwtext =3D document.whatever.text --local split =3D sorters.splitters.utf local split =3D sorters.splitters.utflower dwtext =3D my.gsub(dwtext, '\n\t\v"', " ") dwtext =3D string.explode(dwtext, " +") local dwwords =3D document.whatever.words for i=3D1, #dwtext do local current =3D string.strip(dwtext[i]) if current ~=3D "" then table.insert(dwwords, { word =3D current }) end end for i=3D1, #dwwords do local word =3D dwwords[i] word.split =3D split(word.word)=20 end --sorters.sort(dwwords, sorters.comparers.basic) sorters.sort(dwwords, sorters.comparers.polish) end function document.whatever.flushtext() local words =3D document.whatever.words local previous =3D false local p_word =3D false for i=3D1, #words do local word =3D words[i] local letter, current =3D sorters.firstofsplit(word) local letter =3D utf.lower(letter) if previous ~=3D current then previous =3D current context.section(letter) end local c_word =3D word.word if p_word ~=3D c_word then context(tostring(i) .. ": " .. c_word) context.par() p_word =3D c_word end end end function testrun (lang) --f =3D assert(io.open("anna-utf.txt", "r")) --f =3D assert(io.open("sltext.txt", "r")) document.whatever.text =3D [[ polskie s=C5=82owa dziwnie si=C4=99 szereguje Polskie S=C5=82owa Dziwnie Si=C4=99 Szereguje ]] sorters.setlanguage(lang) context.starttext() document.whatever.sorttext() document.whatever.flushtext() context.stoptext() end testrun("pl") --oFbHfjnMgUMsrGjO Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="polishsort.lua" Content-Transfer-Encoding: quoted-printable --- Polish sorting (including the letters q, v, x) sorters.replacements["pl"] =3D {} sorters.entries["pl"] =3D { ["a"] =3D "a", ["=C4=85"] =3D "=C4=85", ["b"] =3D "b", ["c"] = =3D "c", ["=C4=87"] =3D "=C4=87", ["d"] =3D "d", ["e"] =3D "e", ["=C4=99"] =3D "=C4=99", ["f"] = =3D "f", ["g"] =3D "g", ["h"] =3D "h", ["i"] =3D "i", ["j"] =3D "j", ["k"] =3D "k", ["l= "] =3D "l", ["=C5=82"] =3D "=C5=82", ["m"] =3D "m", ["n"] =3D "n", ["=C5=84"]= =3D "=C5=84", ["o"] =3D "o", ["=C3=B3"] =3D "=C3=B3", ["p"] =3D "p", ["q"] =3D "q", ["r"] = =3D "r", ["s"] =3D "s", ["=C5=9B"] =3D "=C5=9B", ["t"] =3D "t", ["u"] =3D "u", ["v"] = =3D "v", ["w"] =3D "w", ["x"] =3D "x", ["y"] =3D "y", ["z"] =3D "z", ["=C5=BA"] =3D "= =C5=BA", ["=C5=BC"] =3D "=C5=BC", } sorters.mappings["pl"] =3D { ["a"] =3D 1, ["=C4=85"] =3D 2, ["b"] =3D 3, ["c"] =3D 4, ["= =C4=87"] =3D 5, ["d"] =3D 6, ["e"] =3D 7, ["=C4=99"] =3D 8, ["f"] =3D 9, ["= g"] =3D 10, ["h"] =3D 11, ["i"] =3D 12, ["j"] =3D 13, ["k"] =3D 14, ["l"] = =3D 15, ["=C5=82"] =3D 16, ["m"] =3D 17, ["n"] =3D 18, ["=C5=84"] =3D 1= 9, ["o"] =3D 20, ["=C3=B3"] =3D 21, ["p"] =3D 22, ["q"] =3D 23, ["r"] =3D 24, ["= s"] =3D 25, ["=C5=9B"] =3D 26, ["t"] =3D 27, ["u"] =3D 28, ["v"] =3D 29, ["= w"] =3D 30, ["x"] =3D 31, ["y"] =3D 32, ["z"] =3D 33, ["=C5=BA"] =3D 34, ["= =C5=BC"] =3D 35, } local currentreplacements =3D sorters.replacements["pl"] or {} local currentmappings =3D sorters.mappings["pl"] or {} local currententries =3D sorters.entries["pl"] or {} local utfcharacters =3D string.utfcharacters local utfbyte =3D utf.byte -- unchanged, needs to be in local scope local function basicsort(sort_a,sort_b) if not sort_a or not sort_b then return 0 elseif #sort_a > #sort_b then if #sort_b =3D=3D 0 then return 1 else for i=3D1,#sort_b do local ai, bi =3D sort_a[i], sort_b[i] if ai > bi then return 1 elseif ai < bi then return -1 end end return 1 end elseif #sort_a < #sort_b then if #sort_a =3D=3D 0 then return -1 else for i=3D1,#sort_a do local ai, bi =3D sort_a[i], sort_b[i] if ai > bi then return 1 elseif ai < bi then return -1 end end return -1 end elseif #sort_a =3D=3D 0 then return 0 else for i=3D1,#sort_a do local ai, bi =3D sort_a[i], sort_b[i] if ai > bi then return 1 elseif ai < bi then return -1 end end return 0 end end -- modified from sorters.comparers.basic(str) function sorters.comparers.polish(a,b) local ea, eb =3D a.split, b.split local na, nb =3D #ea, #eb if na =3D=3D 0 and nb =3D=3D 0 then -- simple variant (single word) local result =3D basicsort(ea.e,eb.e) if result =3D=3D 0 then if eb.first_lower and not ea.first_lower then return 1 elseif ea.first_lower and not eb.first_lower then return -1 else return 0 end else return basicsort(ea.m, eb.m) end else -- complex variant, used in register (multiple words) local result =3D 0 for i=3D1,nb < na and nb or na do local eai, ebi =3D ea[i], eb[i] result =3D basicsort(eai.e,ebi.e) if result =3D=3D 0 then result =3D basicsort(eai.m,ebi.m) -- only needed it there a= re m's end if result ~=3D 0 then break end end if result ~=3D 0 then return result elseif na > nb then return 1 elseif nb > na then return -1 else if eb[1].first_lower and not ea[1].first_lower then return 1 elseif ea[1].first_lower and not eb[1].first_lower then return -1 else return 0 end end end end -- modified from sorters.splitters.utf(str) function sorters.splitters.utflower(str) local first_char =3D utf.sub(str,1,1) str =3D utf.lower(str) if #currentreplacements > 0 then for k=3D1,#currentreplacements do local v =3D currentreplacements[k] str =3D gsub(str,v[1],v[2]) end end local s, e, m, n =3D { }, { }, { }, 0 for sc in utfcharacters(str) do -- maybe an lpeg local ec, mc =3D currententries[sc], currentmappings[sc] or utfbyte= (sc) n =3D n + 1 s[n] =3D sc e[n] =3D currentmappings[ec] or mc m[n] =3D mc end return { s =3D s, e =3D e, m =3D m, first_lower =3D first_char =3D=3D u= tf.lower(first_char) } end --oFbHfjnMgUMsrGjO-- --oOB74oR0WcNeq9Zb Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAkxsBZgACgkQ02lYlJYWs9IBfgCgiI/sY692FbinSYnGUMNxbUEl 0mgAn3TruC7Kn41aqZsFlMl02NcDdebq =YhFy -----END PGP SIGNATURE----- --oOB74oR0WcNeq9Zb-- --===============0993570439== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ --===============0993570439==--