[9front] [PATCH] introduce code points above BMP to /lib/unicode

9front - general discussion about 9front
 help / color / mirror / Atom feed

* [9front] [PATCH] introduce code points above BMP to /lib/unicode
@ 2022-10-17  5:24 Jacob Moody
  2022-10-17  6:34 ` qwx
  0 siblings, 1 reply; 6+ messages in thread
From: Jacob Moody @ 2022-10-17  5:24 UTC (permalink / raw)
  To: 9front

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

Our /lib/unicode is a bit out of date, this updates our stripped down
version of UnicodeData.txt that we keep in /lib to cover characters
and code ranges above the Basic Multilingual Plane.

This does balloon the file a bit compared to the ~200k original.
; 800k	/lib/unicode

The full patch is attached.  Of note the non-zero padding of the BMP
range is replicated in the upstream UnicodeData.txt, I would be open
to zero padding ours but this would change the results of existing
scripts that use look(1) with /lib/unicode.

Not sure how much use others get out of /lib/unicode, but wanted
to ask if people thought it was worth the size to update.

Thanks,
moody

[-- Attachment #2.1: Type: text/plain, Size: 348 bytes --]

from postmaster@9front:
The following attachment had content that we can't
prove to be harmless.  To avoid possible automatic
execution, we changed the content headers.
The original header was:

	Content-Type: application/gzip; name="unicode.patch.gz"
	Content-Disposition: attachment; filename="unicode.patch.gz"
	Content-Transfer-Encoding: base64

[-- Attachment #2.2: unicode.patch.gz.suspect --]
[-- Type: application/octet-stream, Size: 117597 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9front] [PATCH] introduce code points above BMP to /lib/unicode
  2022-10-17  5:24 [9front] [PATCH] introduce code points above BMP to /lib/unicode Jacob Moody
@ 2022-10-17  6:34 ` qwx
  2022-10-17 20:28   ` Jacob Moody
  0 siblings, 1 reply; 6+ messages in thread
From: qwx @ 2022-10-17  6:34 UTC (permalink / raw)
  To: 9front

On Mon Oct 17 07:25:49 +0200 2022, moody@mail.posixcafe.org wrote:

> Our /lib/unicode is a bit out of date, this updates our stripped down
> version of UnicodeData.txt that we keep in /lib to cover characters
> and code ranges above the Basic Multilingual Plane.
> 
> This does balloon the file a bit compared to the ~200k original.
> ; 800k	/lib/unicode
> 
> The full patch is attached.  Of note the non-zero padding of the BMP
> range is replicated in the upstream UnicodeData.txt, I would be open
> to zero padding ours but this would change the results of existing
> scripts that use look(1) with /lib/unicode.
> 
> Not sure how much use others get out of /lib/unicode, but wanted
> to ask if people thought it was worth the size to update.
> 
> Thanks,
> moody

I'm definitely in favor, thanks for doing this.  Maybe a problem on my
end, but I can't gunzip the attached patch.

Cheers,
qwx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9front] [PATCH] introduce code points above BMP to /lib/unicode
  2022-10-17  6:34 ` qwx
@ 2022-10-17 20:28   ` Jacob Moody
  2022-10-23  5:07     ` qwx
  0 siblings, 1 reply; 6+ messages in thread
From: Jacob Moody @ 2022-10-17 20:28 UTC (permalink / raw)
  To: 9front

Some additional discussion I'd like to get some input on is if we
should just include the entirety of UnicodeData.txt.  There are some
fields in there, notably decomposition mappings, that would be quite
useful.  It would also be nice to generate the ranges used in things
like runetype(2) from the upstream documents so that we can more
easily keep up to date.

On this topic, I have been considering what should be done about
compositional runes in general, as we currently do nothing with them.
For some quick background, these are runes typically used for
diacritic or tonal markings(but not exclusively) in unicode that are
meant to be combined with another base rune.  For various reasons many
combinations have specific precomposed runes they map to.  Currently
our fonts support only these precomposed variants.

One way we could get better is to put in some unicode normalization,
specifically I am looking at NFC, in someplace like libdraw.  Checking
for normalization is cheap, and fixing up strings under the hood would
be an easy way to make (better) use of the bitmaps in our fonts
already.  NFC canonically decomposes then recomposes the runes to
consistently fully precompose the string before handing it off
to the fonts.

It is worth pointing out also that we can't precompose everything,
there are ranges in unicode where you have no option but to implement
shaping yourself.  This won't address those, and would be nice to not
get in the way of that down the road.  Realistically this would allow
us to support a large majority of decomposed latin, decomposed Korean,
and some other decomposed edge cases that do provide precomposed
variants.

This matters if keyboard maps provide these combinational runes, which
as I understand it is not uncommon.  With this change, the
combinational runes would essentially become zero width codepoints to
the perspective of libdraw users.  Which means backspacing (without
any changes) would require two(or more) hits to fully strike out the
rune, progressively unwinding the modifications.  This makes sense to
me, but I cant make assumptions about how others use these runes.

A bit of a ramble, but I wanted to write out what I've been thinking
so someone else can pick it apart if they'd like.

Thanks,
moody

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9front] [PATCH] introduce code points above BMP to /lib/unicode
  2022-10-17 20:28   ` Jacob Moody
@ 2022-10-23  5:07     ` qwx
  2022-10-23 14:28       ` ori
  0 siblings, 1 reply; 6+ messages in thread
From: qwx @ 2022-10-23  5:07 UTC (permalink / raw)
  To: 9front

On Mon Oct 17 22:28:26 +0200 2022, moody@mail.posixcafe.org wrote:
> Some additional discussion I'd like to get some input on is if we
> should just include the entirety of UnicodeData.txt.  There are some
> fields in there, notably decomposition mappings, that would be quite
> useful.  It would also be nice to generate the ranges used in things
> like runetype(2) from the upstream documents so that we can more
> easily keep up to date.
> 
> On this topic, I have been considering what should be done about
> compositional runes in general, as we currently do nothing with them.
[...]

I think that if this is of significant practical value and an
improvement in quality of life here, it should be done.  My question
is, and maybe I've missed an obvious answer, how often is this needed
or used in general, and what do people do when it's missing?  I
haven't been able to follow all of the discussions on input methods
and I don't know much about the subject, but I'm curious about how far
this must be pushed.

> [...] With this change, the
> combinational runes would essentially become zero width codepoints to
> the perspective of libdraw users.  Which means backspacing (without
> any changes) would require two(or more) hits to fully strike out the
> rune, progressively unwinding the modifications.  This makes sense to
> me, but I cant make assumptions about how others use these runes.

Again, I can't speak for anyone, but personally I'd always expect one
single backspace to erase any megarune, which is also what I've seen
in virtual keyboards on the shitphones I've touched.  I'd be very
confused if some characters in the middle of a sentence refuse to be
removed at once.

Anyway, thanks for looking into this!

Cheers,
qwx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9front] [PATCH] introduce code points above BMP to /lib/unicode
  2022-10-23  5:07     ` qwx
@ 2022-10-23 14:28       ` ori
  2022-10-23 16:58         ` qwx
  0 siblings, 1 reply; 6+ messages in thread
From: ori @ 2022-10-23 14:28 UTC (permalink / raw)
  To: 9front

Quoth qwx@sciops.net:
> On Mon Oct 17 22:28:26 +0200 2022, moody@mail.posixcafe.org wrote:
> > Some additional discussion I'd like to get some input on is if we
> > should just include the entirety of UnicodeData.txt.  There are some
> > fields in there, notably decomposition mappings, that would be quite
> > useful.  It would also be nice to generate the ranges used in things
> > like runetype(2) from the upstream documents so that we can more
> > easily keep up to date.
> > 
> > On this topic, I have been considering what should be done about
> > compositional runes in general, as we currently do nothing with them.
> [...]
> 
> I think that if this is of significant practical value and an
> improvement in quality of life here, it should be done.  My question
> is, and maybe I've missed an obvious answer, how often is this needed
> or used in general, and what do people do when it's missing?  I
> haven't been able to follow all of the discussions on input methods
> and I don't know much about the subject, but I'm curious about how far
> this must be pushed.

Currently 9front only works well with the latinoid languages, and
with Moody's work, I suspect passably with Chinese and Japanese.

Languages like hebrew (the only non-latin language I know) are
unusable, though with hebrew the larger problem is the lack of
right to left support.

as far as what people do when it's missing? use english; I can't
do hebrew on 9front. doesn't work.

saying "it will never work" is an option; I think our UI will
always be in English, for example, but it seems like it would
be a nice goal to type and view any language correctly.

at the same time, it *is* a lot of complexity.

> > [...] With this change, the
> > combinational runes would essentially become zero width codepoints to
> > the perspective of libdraw users.  Which means backspacing (without
> > any changes) would require two(or more) hits to fully strike out the
> > rune, progressively unwinding the modifications.  This makes sense to
> > me, but I cant make assumptions about how others use these runes.
> 
> Again, I can't speak for anyone, but personally I'd always expect one
> single backspace to erase any megarune, which is also what I've seen
> in virtual keyboards on the shitphones I've touched.  I'd be very
> confused if some characters in the middle of a sentence refuse to be
> removed at once.

In hebrew input, at least, I'd expect the opposite; I think it will
end up depending on culture what behavior is 'normal', but  on this
sort of thing, we can set our own expectations.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9front] [PATCH] introduce code points above BMP to /lib/unicode
  2022-10-23 14:28       ` ori
@ 2022-10-23 16:58         ` qwx
  0 siblings, 0 replies; 6+ messages in thread
From: qwx @ 2022-10-23 16:58 UTC (permalink / raw)
  To: 9front

On Sun Oct 23 16:27:01 +0200 2022, ori@eigenstate.org wrote:
> Currently 9front only works well with the latinoid languages, and
> with Moody's work, I suspect passably with Chinese and Japanese.
> 
> Languages like hebrew (the only non-latin language I know) are
> unusable, though with hebrew the larger problem is the lack of
> right to left support.
> 
> as far as what people do when it's missing? use english; I can't
> do hebrew on 9front. doesn't work.
> 
> saying "it will never work" is an option; I think our UI will
> always be in English, for example, but it seems like it would
> be a nice goal to type and view any language correctly.
> 
> at the same time, it *is* a lot of complexity.
[...]
> > Again, I can't speak for anyone, but personally I'd always expect one
> > single backspace to erase any megarune, which is also what I've seen
> > in virtual keyboards on the shitphones I've touched.  I'd be very
> > confused if some characters in the middle of a sentence refuse to be
> > removed at once.
> 
> In hebrew input, at least, I'd expect the opposite; I think it will
> end up depending on culture what behavior is 'normal', but  on this
> sort of thing, we can set our own expectations.


Makes sense.  There's other scripts that are very difficult as well
and, as you say, put together it involves quite a lot of work.  There
are lots of ideas about changing rio and libdraw and what not in some
future, but perhaps in the mean time, if these unicodedata and
composition changes would actually help on their own, imho it makes
sense to pursue them.

Just my 2¢, and thanks for clarifying :)

Cheers,
qwx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-23 17:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-17  5:24 [9front] [PATCH] introduce code points above BMP to /lib/unicode Jacob Moody
2022-10-17  6:34 ` qwx
2022-10-17 20:28   ` Jacob Moody
2022-10-23  5:07     ` qwx
2022-10-23 14:28       ` ori
2022-10-23 16:58         ` qwx

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).