9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] Character case mappings
@ 2013-06-24 13:15 Steffen Daode Nurpmeso
  2013-06-24 15:11 ` erik quanstrom
  0 siblings, 1 reply; 5+ messages in thread
From: Steffen Daode Nurpmeso @ 2013-06-24 13:15 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

'Thing is; i'm writing a Unicode aware library for ISO C99 aware
environments (*earliest* alpha state) and at the moment i use
binary searches (i only have display-widths and simple case
mappings right now).  For combined upper/lower case mappings i do
end up with

  static struct _casemap {
    uint32_t start;      /* First code point */
    uint32_t accu  : 16; /* Relative distance to mapping */
    _Bool isneg    : 1;  /* Accu must be subtracted */
    _Bool isup     : 1;  /* Code point is uppercase */
    _Bool islull   : 1;  /* Is Lu/Ll range (.accu = range start & 1) */
    _Bool isemap   : 1;  /* Has a one-to-many mapping */
    uint32_t count : 12; /* Number of entries in this range */
  } const _casemaps[] = {
    {0x000041,    32, 0,1,0,0, 26},
    ...
    {0x010428,    40, 1,0,0,0, 40},
  }; /* 250 entries */

that can be accessed via

  static struct _casemap const *
  _find_casemap(uint32_t codep)
  {
    struct _casemap const *cme = _casemaps, *dp;
    uint32_t min = 0, max = ARRAYCOUNT(_casemaps) - 1;

    if (codep >= cme[min].start && codep < cme[max].start + cme[max].count)
      do {
        uint32_t mid = (min + max) >> 1,
          s = (dp = cme + mid)->start;
        if (codep < s)
          max = --mid;
        else if (codep >= s + dp->count)
          min = ++mid;
        else {
          cme += mid;
          goto jleave;
        }
      } while (max >= min);
    cme = NULL;
  jleave:
    return cme;
  }

  uint32_t
  sud_simple_tolower(uint32_t codep)
  {
    struct _casemap const *cme = _find_casemap(codep);

    if (cme == NULL)
      ;
    else if (! cme->islull) {
      if (cme->isup)
        codep = cme->isneg ? codep - cme->accu : codep + cme->accu;
    } else if ((codep & 1) == cme->accu)
      ++codep;
    return codep;
  }

  uint32_t
  sud_simple_toupper(uint32_t codep)
  {
    struct _casemap const *cme = _find_casemap(codep);

    if (cme == NULL)
      ;
    else if (! cme->islull) {
      if (! cme->isup)
        codep = cme->isneg ? codep - cme->accu : codep + cme->accu;
    } else if ((codep & 1) != cme->accu)
      --codep;
    return codep;
  }

My S-CText (on <sourceforge DOT net SLASH p SLASH s-ctext SLASH
code SLASH>) tests all 0x10FFFF code points correct with the
above.  Now when i look at the sys/src/libc/port/runetype.c (of
plan9front) then i think this one is generated, but i cannot find
the creating script or program, which would be of interest to me.
And maybe Plan9 would be interested to see the above patched into
that, at some later time. ?
Thank you and ciao,

--steffen



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] Character case mappings
  2013-06-24 13:15 [9fans] Character case mappings Steffen Daode Nurpmeso
@ 2013-06-24 15:11 ` erik quanstrom
  2013-06-24 20:25   ` Steffen Daode Nurpmeso
  0 siblings, 1 reply; 5+ messages in thread
From: erik quanstrom @ 2013-06-24 15:11 UTC (permalink / raw)
  To: 9fans

> My S-CText (on <sourceforge DOT net SLASH p SLASH s-ctext SLASH
> code SLASH>) tests all 0x10FFFF code points correct with the
> above.  Now when i look at the sys/src/libc/port/runetype.c (of
> plan9front) then i think this one is generated, but i cannot find
> the creating script or program, which would be of interest to me.
> And maybe Plan9 would be interested to see the above patched into
> that, at some later time. ?
> Thank you and ciao,

that's close to the approach taken, except since one needs
a fresh table for each sorting if one hopes to do a binary search,
simple tables of (various width) integers were made.  it was also
noted that bursting the tables at the junction of the basic and
extended plans was possible in many cases.

for example, for decompositions if r is a precombined form,
and r is in the basic frame then for r = r' + c, r' and c are both
in the basic plane.  thus we can burst this table, and put
basic plane mappings (1000 of them) in a more compact table
that doesn't use vlongs.  the extended plane table is tiny
(18 entries).  it's only worth using a binary search for symmetry.

static
uint	__decompose2[] =
{
	0x00c0,	0x00410300,	 /* À -> A 0300 */
[... 998 entries skipped ... ]
	0xfb4e,	0x05e405bf,	 /* פֿ -> פ 05bf */
}

static
uvlong	__decompose264[] =
{
	0x1109a,	0x11099110baull,	 /* 𑂚 -> 𑂙 + 110ba */
[... 16 entries skipped ...]
	0x1d1c0,	0x1d1bc1d16full,	 /* 𝆺𝅥𝅯 -> 𝆺𝅥 + 1d16f */
};

static uint*
bsearch32(uint c, uint *t, int n, int ne)
{
	uint *p;
	int m;

	while(n > 1) {
		m = n/2;
		p = t + m*ne;
		if(c >= p[0]) {
			t = p;
			n = n-m;
		} else
			n = m;
	}
	if(n && c == t[0])
		return t;
	return 0;
}

[bsearch64 omitted]

int
runedecompose(Rune a, Rune *d)
{
	uint *p;
	uvlong *q;

	if(a <= 0xffff){
		p = bsearch32(a, __decompose2, nelem(__decompose2)/2, 2);
		if(p){
			d[0] = p[1] >> 16;
			d[1] = p[1] & 0xffff;
			return 0;
		}
	}else{
		q = bsearch64(a, __decompose264, nelem(__decompose264)/2, 2);
		if(q){
			d[0] = q[1] >> 32;
			d[1] = q[1] & 0xfffffff;
			return 0;
		}
	}
	return -1;
}

all the other rune tables work this way.  there is one
table per property.  having a structure doesn't fit the
current programming interface, nor usage.

- erik



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] Character case mappings
  2013-06-24 15:11 ` erik quanstrom
@ 2013-06-24 20:25   ` Steffen Daode Nurpmeso
  2013-06-24 20:59     ` erik quanstrom
  0 siblings, 1 reply; 5+ messages in thread
From: Steffen Daode Nurpmeso @ 2013-06-24 20:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom <quanstro@quanstro.net> wrote:
 |all the other rune tables work this way.  there is one
 |table per property.  having a structure doesn't fit the
 |current programming interface, nor usage.

uuh, ok, 9atom seems to have seen a lot of progress compared to
what i have yet looked at.
I'm still waiting for some time somewhen to work the Ballesteros
"Introduction to Operating Systems Abstractions", but i already
have read the manual page of the C compiler i think it was that
stated something like "structures are now almost first class
members".  So maybe there will come the day that i can tweak
9CD/sys/src/cmd/runetype/ the right way, because i can.

 |- erik

--steffen



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] Character case mappings
  2013-06-24 20:25   ` Steffen Daode Nurpmeso
@ 2013-06-24 20:59     ` erik quanstrom
  2013-06-25 12:11       ` Steffen Daode Nurpmeso
  0 siblings, 1 reply; 5+ messages in thread
From: erik quanstrom @ 2013-06-24 20:59 UTC (permalink / raw)
  To: 9fans

On Mon Jun 24 16:26:37 EDT 2013, sdaoden@gmail.com wrote:
> erik quanstrom <quanstro@quanstro.net> wrote:
>  |all the other rune tables work this way.  there is one
>  |table per property.  having a structure doesn't fit the
>  |current programming interface, nor usage.
>
> uuh, ok, 9atom seems to have seen a lot of progress compared to
> what i have yet looked at.

just a few tables.  and a bit of time spent applying them.  ;-)
if you have plan 9 installed and can

	nflag=-n srv $nflag -q tcp!atom.9atom.org atom &&
		mount $nflag /srv/atom /n/atom atom

then the tables, &c. are in /n/atom/plan9/sys/src/libc/port.
the awk code to generate them, and the supporting functions
are in /n/atom/plan9/sys/src/cmd/runetype.

a particularlly nifty (if straightforward) application is grep -I, which is like
grep -i, but translates its input with tolowerrune(tobaserune(r))
rather than tolower(c).  also straightforward is rune/case, which is
like tr 'A-Z' 'a-z', except generalized for unicode.

see also,
http://www.9atom.org/magic/man2html/1/rune
http://www.9atom.org/magic/man2html/2/isalpharune
http://www.9atom.org/magic/man2html/2/runeclass

- erik



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] Character case mappings
  2013-06-24 20:59     ` erik quanstrom
@ 2013-06-25 12:11       ` Steffen Daode Nurpmeso
  0 siblings, 0 replies; 5+ messages in thread
From: Steffen Daode Nurpmeso @ 2013-06-25 12:11 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom <quanstro@quanstro.net> wrote:
 |> uuh, ok, 9atom seems to have seen a lot of progress compared to
 |> what i have yet looked at.
 |
 |just a few tables.  and a bit of time spent applying them.  ;-) 
 |if you have plan 9 installed and can 
 |
 |	nflag=-n srv $nflag -q tcp!atom.9atom.org atom &&
 |		mount $nflag /srv/atom /n/atom atom

Unfortunately not yet; but i have the distribution since
yesterday.  (The git(1) pack is 121 MB.  And what i've seen before
belonged to go, yet i wrote Plan9 since it seemed to have a common
origin.)

 |then the tables, &c. are in /n/atom/plan9/sys/src/libc/port.
 |the awk code to generate them, and the supporting functions
 |are in /n/atom/plan9/sys/src/cmd/runetype.
 |
 |a particularlly nifty (if straightforward) application is grep -I, which is \
 |like
 |grep -i, but translates its input with tolowerrune(tobaserune(r))
 |rather than tolower(c).  also straightforward is rune/case, which is
 |like tr 'A-Z' 'a-z', except generalized for unicode.

May be worth taking a deeper look into a system that works for
non-english.

Btw. i thought i was so smart due to my "Ctx" objects for bracket
expressions, format string conversions etc. -- and even said so --
only to find out that on Plan9 there existed something rather
similar years before!  Pretty awkward.

 |see also,
 |http://www.9atom.org/magic/man2html/1/rune
 |http://www.9atom.org/magic/man2html/2/isalpharune
 |http://www.9atom.org/magic/man2html/2/runeclass

yea yea, maybe: i'm not familiar with something that just works,
i'm using BSD for such a long time.
Looking into upas doesn't make me much happier, too.  Sigh.

 |- erik

--steffen



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-06-25 12:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-24 13:15 [9fans] Character case mappings Steffen Daode Nurpmeso
2013-06-24 15:11 ` erik quanstrom
2013-06-24 20:25   ` Steffen Daode Nurpmeso
2013-06-24 20:59     ` erik quanstrom
2013-06-25 12:11       ` Steffen Daode Nurpmeso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).