[-- Attachment #1: Type: text/plain, Size: 664 bytes --] Our /lib/unicode is a bit out of date, this updates our stripped down version of UnicodeData.txt that we keep in /lib to cover characters and code ranges above the Basic Multilingual Plane. This does balloon the file a bit compared to the ~200k original. ; 800k /lib/unicode The full patch is attached. Of note the non-zero padding of the BMP range is replicated in the upstream UnicodeData.txt, I would be open to zero padding ours but this would change the results of existing scripts that use look(1) with /lib/unicode. Not sure how much use others get out of /lib/unicode, but wanted to ask if people thought it was worth the size to update. Thanks, moody [-- Attachment #2.1: Type: text/plain, Size: 348 bytes --] from postmaster@9front: The following attachment had content that we can't prove to be harmless. To avoid possible automatic execution, we changed the content headers. The original header was: Content-Type: application/gzip; name="unicode.patch.gz" Content-Disposition: attachment; filename="unicode.patch.gz" Content-Transfer-Encoding: base64 [-- Attachment #2.2: unicode.patch.gz.suspect --] [-- Type: application/octet-stream, Size: 117597 bytes --]
On Mon Oct 17 07:25:49 +0200 2022, moody@mail.posixcafe.org wrote:
> Our /lib/unicode is a bit out of date, this updates our stripped down
> version of UnicodeData.txt that we keep in /lib to cover characters
> and code ranges above the Basic Multilingual Plane.
>
> This does balloon the file a bit compared to the ~200k original.
> ; 800k /lib/unicode
>
> The full patch is attached. Of note the non-zero padding of the BMP
> range is replicated in the upstream UnicodeData.txt, I would be open
> to zero padding ours but this would change the results of existing
> scripts that use look(1) with /lib/unicode.
>
> Not sure how much use others get out of /lib/unicode, but wanted
> to ask if people thought it was worth the size to update.
>
> Thanks,
> moody
I'm definitely in favor, thanks for doing this. Maybe a problem on my
end, but I can't gunzip the attached patch.
Cheers,
qwx
Some additional discussion I'd like to get some input on is if we should just include the entirety of UnicodeData.txt. There are some fields in there, notably decomposition mappings, that would be quite useful. It would also be nice to generate the ranges used in things like runetype(2) from the upstream documents so that we can more easily keep up to date. On this topic, I have been considering what should be done about compositional runes in general, as we currently do nothing with them. For some quick background, these are runes typically used for diacritic or tonal markings(but not exclusively) in unicode that are meant to be combined with another base rune. For various reasons many combinations have specific precomposed runes they map to. Currently our fonts support only these precomposed variants. One way we could get better is to put in some unicode normalization, specifically I am looking at NFC, in someplace like libdraw. Checking for normalization is cheap, and fixing up strings under the hood would be an easy way to make (better) use of the bitmaps in our fonts already. NFC canonically decomposes then recomposes the runes to consistently fully precompose the string before handing it off to the fonts. It is worth pointing out also that we can't precompose everything, there are ranges in unicode where you have no option but to implement shaping yourself. This won't address those, and would be nice to not get in the way of that down the road. Realistically this would allow us to support a large majority of decomposed latin, decomposed Korean, and some other decomposed edge cases that do provide precomposed variants. This matters if keyboard maps provide these combinational runes, which as I understand it is not uncommon. With this change, the combinational runes would essentially become zero width codepoints to the perspective of libdraw users. Which means backspacing (without any changes) would require two(or more) hits to fully strike out the rune, progressively unwinding the modifications. This makes sense to me, but I cant make assumptions about how others use these runes. A bit of a ramble, but I wanted to write out what I've been thinking so someone else can pick it apart if they'd like. Thanks, moody
On Mon Oct 17 22:28:26 +0200 2022, moody@mail.posixcafe.org wrote: > Some additional discussion I'd like to get some input on is if we > should just include the entirety of UnicodeData.txt. There are some > fields in there, notably decomposition mappings, that would be quite > useful. It would also be nice to generate the ranges used in things > like runetype(2) from the upstream documents so that we can more > easily keep up to date. > > On this topic, I have been considering what should be done about > compositional runes in general, as we currently do nothing with them. [...] I think that if this is of significant practical value and an improvement in quality of life here, it should be done. My question is, and maybe I've missed an obvious answer, how often is this needed or used in general, and what do people do when it's missing? I haven't been able to follow all of the discussions on input methods and I don't know much about the subject, but I'm curious about how far this must be pushed. > [...] With this change, the > combinational runes would essentially become zero width codepoints to > the perspective of libdraw users. Which means backspacing (without > any changes) would require two(or more) hits to fully strike out the > rune, progressively unwinding the modifications. This makes sense to > me, but I cant make assumptions about how others use these runes. Again, I can't speak for anyone, but personally I'd always expect one single backspace to erase any megarune, which is also what I've seen in virtual keyboards on the shitphones I've touched. I'd be very confused if some characters in the middle of a sentence refuse to be removed at once. Anyway, thanks for looking into this! Cheers, qwx
Quoth qwx@sciops.net: > On Mon Oct 17 22:28:26 +0200 2022, moody@mail.posixcafe.org wrote: > > Some additional discussion I'd like to get some input on is if we > > should just include the entirety of UnicodeData.txt. There are some > > fields in there, notably decomposition mappings, that would be quite > > useful. It would also be nice to generate the ranges used in things > > like runetype(2) from the upstream documents so that we can more > > easily keep up to date. > > > > On this topic, I have been considering what should be done about > > compositional runes in general, as we currently do nothing with them. > [...] > > I think that if this is of significant practical value and an > improvement in quality of life here, it should be done. My question > is, and maybe I've missed an obvious answer, how often is this needed > or used in general, and what do people do when it's missing? I > haven't been able to follow all of the discussions on input methods > and I don't know much about the subject, but I'm curious about how far > this must be pushed. Currently 9front only works well with the latinoid languages, and with Moody's work, I suspect passably with Chinese and Japanese. Languages like hebrew (the only non-latin language I know) are unusable, though with hebrew the larger problem is the lack of right to left support. as far as what people do when it's missing? use english; I can't do hebrew on 9front. doesn't work. saying "it will never work" is an option; I think our UI will always be in English, for example, but it seems like it would be a nice goal to type and view any language correctly. at the same time, it *is* a lot of complexity. > > [...] With this change, the > > combinational runes would essentially become zero width codepoints to > > the perspective of libdraw users. Which means backspacing (without > > any changes) would require two(or more) hits to fully strike out the > > rune, progressively unwinding the modifications. This makes sense to > > me, but I cant make assumptions about how others use these runes. > > Again, I can't speak for anyone, but personally I'd always expect one > single backspace to erase any megarune, which is also what I've seen > in virtual keyboards on the shitphones I've touched. I'd be very > confused if some characters in the middle of a sentence refuse to be > removed at once. In hebrew input, at least, I'd expect the opposite; I think it will end up depending on culture what behavior is 'normal', but on this sort of thing, we can set our own expectations.
On Sun Oct 23 16:27:01 +0200 2022, ori@eigenstate.org wrote: > Currently 9front only works well with the latinoid languages, and > with Moody's work, I suspect passably with Chinese and Japanese. > > Languages like hebrew (the only non-latin language I know) are > unusable, though with hebrew the larger problem is the lack of > right to left support. > > as far as what people do when it's missing? use english; I can't > do hebrew on 9front. doesn't work. > > saying "it will never work" is an option; I think our UI will > always be in English, for example, but it seems like it would > be a nice goal to type and view any language correctly. > > at the same time, it *is* a lot of complexity. [...] > > Again, I can't speak for anyone, but personally I'd always expect one > > single backspace to erase any megarune, which is also what I've seen > > in virtual keyboards on the shitphones I've touched. I'd be very > > confused if some characters in the middle of a sentence refuse to be > > removed at once. > > In hebrew input, at least, I'd expect the opposite; I think it will > end up depending on culture what behavior is 'normal', but on this > sort of thing, we can set our own expectations. Makes sense. There's other scripts that are very difficult as well and, as you say, put together it involves quite a lot of work. There are lots of ideas about changing rio and libdraw and what not in some future, but perhaps in the mean time, if these unicodedata and composition changes would actually help on their own, imho it makes sense to pursue them. Just my 2¢, and thanks for clarifying :) Cheers, qwx