zsh-workers
 help / color / mirror / code / Atom feed
* [BUG] ZLE character width with emoji presentation variation selectors in Unicode
@ 2024-05-09 14:45 Advait Maybhate
  2024-05-10  9:37 ` Mikael Magnusson
  0 siblings, 1 reply; 8+ messages in thread
From: Advait Maybhate @ 2024-05-09 14:45 UTC (permalink / raw)
  To: zsh-workers; +Cc: Aloke Desai, Zach Bai

[-- Attachment #1: Type: text/plain, Size: 2275 bytes --]

Hey folks!


Wanted to file a bug report/get a discussion going on the best way to
handle emoji variation selectors with Unicode characters.


Metadata:

Zsh version: zsh 5.9 (x86_64-apple-darwin23.0), OS version: macOS Sonoma
14.3.1

Terminal: tested across Warp, Kitty, default Mac terminal, Alacritty, iTerm
2

ZLE incorrectly treats characters with the emoji variation selector as 1
character instead of 2 characters, causing off-by-one cursor movement
issues in terminals that (correctly) treat it as 2 characters.

This is most easily reproduced in Kitty (v0.34), which renders and
calculates these emojis as 2 cells (most terminal emulators seem to
incorrectly handle this case of Unicode).

To repro:

   -

   Paste in the command “echo ☁️” into Kitty (the last character is \0x2601
   followed by \0xFE0F). Note that this results in bracketed paste mode in Zsh.


Expected behavior:

   -

   ZLE contains “echo ☁️”.


Actual behavior:

   -

   ZLE contains “eecho ☁️” (note the additional “e” at the beginning here -
   inverted colors from the bracketed paste). Confirmed that this is due to
   an off-by-one on the cursor instruction, from the PTY recording.


Screenshot: link
<https://github.com/warpdotdev/Warp/assets/12927474/b8ae2aae-7be4-4a9b-a471-423d098b5c8a>


I’d love to discuss how to fix this for terminals that do respect variation
selectors. One way to do this could be via a new `terminfo` entry, but I’d
love to know what ZSH devs think! I’m an engineer building the Warp
terminal, so I’d be happy to work on any terminal-side changes of this with
`terminfo` (we actually use bracketed paste mode for all commands, to best
support multiline commands with Warp's input editor)!

Notably, Fish 3.6 seems to calculate the width correctly as 2 cells (this
is what originally prompted my investigation, due to the Starship prompt -
see fish-shell/issues/10461
<https://github.com/fish-shell/fish-shell/issues/10461>), along with Bash
(using bracketed paste with Bash 5.2).


I’ve seen 2017/msg00432 <https://www.zsh.org/mla/users/2017/msg00432.html>
which is related to this, but deals with 0xFE0E not 0xFE0F.


Thanks!


Best,

Advait

[-- Attachment #2: Type: text/html, Size: 12804 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ZLE character width with emoji presentation variation selectors in Unicode
  2024-05-09 14:45 [BUG] ZLE character width with emoji presentation variation selectors in Unicode Advait Maybhate
@ 2024-05-10  9:37 ` Mikael Magnusson
  2024-05-10  9:54   ` Mikael Magnusson
  0 siblings, 1 reply; 8+ messages in thread
From: Mikael Magnusson @ 2024-05-10  9:37 UTC (permalink / raw)
  To: Advait Maybhate; +Cc: zsh-workers

On Thu, May 9, 2024 at 4:46 PM Advait Maybhate <advait@warp.dev> wrote:
>
> Hey folks!
>
>
> Wanted to file a bug report/get a discussion going on the best way to handle emoji variation selectors with Unicode characters.
>
>
> Metadata:
>
> Zsh version: zsh 5.9 (x86_64-apple-darwin23.0), OS version: macOS Sonoma 14.3.1
>
> Terminal: tested across Warp, Kitty, default Mac terminal, Alacritty, iTerm 2
>
>
> ZLE incorrectly treats characters with the emoji variation selector as 1 character instead of 2 characters, causing off-by-one cursor movement issues in terminals that (correctly) treat it as 2 characters.
>
>
> This is most easily reproduced in Kitty (v0.34), which renders and calculates these emojis as 2 cells (most terminal emulators seem to incorrectly handle this case of Unicode).
>
>
> To repro:
>
> Paste in the command “echo ☁️” into Kitty (the last character is \0x2601 followed by \0xFE0F). Note that this results in bracketed paste mode in Zsh.
>
>
> Expected behavior:
>
> ZLE contains “echo ☁️”.
>
>
> Actual behavior:
>
> ZLE contains “eecho ☁️” (note the additional “e” at the beginning here - inverted colors from the bracketed paste). Confirmed that this is due to an off-by-one on the cursor instruction, from the PTY recording.
>
>
> Screenshot: link
>
>
> I’d love to discuss how to fix this for terminals that do respect variation selectors. One way to do this could be via a new `terminfo` entry, but I’d love to know what ZSH devs think! I’m an engineer building the Warp terminal, so I’d be happy to work on any terminal-side changes of this with `terminfo` (we actually use bracketed paste mode for all commands, to best support multiline commands with Warp's input editor)!
>
>
> Notably, Fish 3.6 seems to calculate the width correctly as 2 cells (this is what originally prompted my investigation, due to the Starship prompt - see fish-shell/issues/10461), along with Bash (using bracketed paste with Bash 5.2).
>
>
> I’ve seen 2017/msg00432 which is related to this, but deals with 0xFE0E not 0xFE0F.

Generally speaking it is impossible to handle combining emoji, since
the specification allows the rendering to either combine or not
combine the glyphs, it is not possible for zsh to know how much space
they will take up. Of course, your problem isn't even about combining
emoji, but as far as I can see the same conceptual problem applies
here; there is no way for zsh to know what "render as an image"
implies for glyph width, all we can do is call wcwidth. I took a quick
look at some unicode emoji standards pages and none of them even
mention the word width. If you can find an authorative part of the
standard talking about emoji width, feel free to link it... In my
terminal your example renders as 1 glyph wide which agrees with zsh's
guess, and I don't get any display errors.

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ZLE character width with emoji presentation variation selectors in Unicode
  2024-05-10  9:37 ` Mikael Magnusson
@ 2024-05-10  9:54   ` Mikael Magnusson
  2024-05-10 17:11     ` Advait Maybhate
  0 siblings, 1 reply; 8+ messages in thread
From: Mikael Magnusson @ 2024-05-10  9:54 UTC (permalink / raw)
  To: Advait Maybhate; +Cc: zsh-workers

On Fri, May 10, 2024 at 11:37 AM Mikael Magnusson <mikachu@gmail.com> wrote:
>
> On Thu, May 9, 2024 at 4:46 PM Advait Maybhate <advait@warp.dev> wrote:
> >
> > Hey folks!
> >
> >
> > Wanted to file a bug report/get a discussion going on the best way to handle emoji variation selectors with Unicode characters.
> >
> >
> > Metadata:
> >
> > Zsh version: zsh 5.9 (x86_64-apple-darwin23.0), OS version: macOS Sonoma 14.3.1
> >
> > Terminal: tested across Warp, Kitty, default Mac terminal, Alacritty, iTerm 2
> >
> >
> > ZLE incorrectly treats characters with the emoji variation selector as 1 character instead of 2 characters, causing off-by-one cursor movement issues in terminals that (correctly) treat it as 2 characters.
> >
> >
> > This is most easily reproduced in Kitty (v0.34), which renders and calculates these emojis as 2 cells (most terminal emulators seem to incorrectly handle this case of Unicode).
> >
> >
> > To repro:
> >
> > Paste in the command “echo ☁️” into Kitty (the last character is \0x2601 followed by \0xFE0F). Note that this results in bracketed paste mode in Zsh.
> >
> >
> > Expected behavior:
> >
> > ZLE contains “echo ☁️”.
> >
> >
> > Actual behavior:
> >
> > ZLE contains “eecho ☁️” (note the additional “e” at the beginning here - inverted colors from the bracketed paste). Confirmed that this is due to an off-by-one on the cursor instruction, from the PTY recording.
> >
> >
> > Screenshot: link
> >
> >
> > I’d love to discuss how to fix this for terminals that do respect variation selectors. One way to do this could be via a new `terminfo` entry, but I’d love to know what ZSH devs think! I’m an engineer building the Warp terminal, so I’d be happy to work on any terminal-side changes of this with `terminfo` (we actually use bracketed paste mode for all commands, to best support multiline commands with Warp's input editor)!
> >
> >
> > Notably, Fish 3.6 seems to calculate the width correctly as 2 cells (this is what originally prompted my investigation, due to the Starship prompt - see fish-shell/issues/10461), along with Bash (using bracketed paste with Bash 5.2).
> >
> >
> > I’ve seen 2017/msg00432 which is related to this, but deals with 0xFE0E not 0xFE0F.
>
> Generally speaking it is impossible to handle combining emoji, since
> the specification allows the rendering to either combine or not
> combine the glyphs, it is not possible for zsh to know how much space
> they will take up. Of course, your problem isn't even about combining
> emoji, but as far as I can see the same conceptual problem applies
> here; there is no way for zsh to know what "render as an image"
> implies for glyph width, all we can do is call wcwidth.

I also meant to say, if wcwidth for the base glyph is 1, then adding a
composing character after with a width of 0, it will not magically
change the width of the base glyph and cannot do so.
https://www.unicode.org/reports/tr51/ does mention that "Current
practice is for emoji to have a square aspect ratio, deriving from
their origin in Japanese. For interoperability, it is recommended that
this practice be continued with current and future emoji. They will
typically have about the same vertical placement and advance width as
CJK ideographs." but zsh cannot have some custom tables of emoji
widths, either wcwidth works correctly or it doesn't.

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ZLE character width with emoji presentation variation selectors in Unicode
  2024-05-10  9:54   ` Mikael Magnusson
@ 2024-05-10 17:11     ` Advait Maybhate
  2024-05-10 18:57       ` Mikael Magnusson
  2024-05-10 20:40       ` Bart Schaefer
  0 siblings, 2 replies; 8+ messages in thread
From: Advait Maybhate @ 2024-05-10 17:11 UTC (permalink / raw)
  To: Mikael Magnusson; +Cc: zsh-workers, Aloke Desai, Zach Bai

[-- Attachment #1: Type: text/plain, Size: 6026 bytes --]

Gotcha, thanks for the context! Combining emojis are weird :)

Hmm, agreed that it won't be possible to use the same standard across all
terminals - hence, I was thinking terminfo would allow the terminal to
indicate whether it supports these variation selectors with wide characters?

Yep, I was referencing TR51 from Unicode as well (emoji presentation
selectors
<https://www.unicode.org/reports/tr51/#def_emoji_presentation_selector>).

For examples of display errors/differences with terminals
for \0x2601\0xFE0F (images hosted on GitHub to avoid embeds here):
- Kitty
<https://github.com/warpdotdev/Warp/assets/12927474/b8ae2aae-7be4-4a9b-a471-423d098b5c8a>
-
the prior example with bracketed paste. Kitty renders this as 2 cells wide
and width is computed as 2 cells wide.
- Default Mac terminal
<https://github.com/warpdotdev/Warp/assets/12927474/a4af9db8-7741-4607-aab4-7dc170e9baa2>
- rendered as 2 cells wide, but width is computed as 1 cell wide. Results
in the next character overlapping the emoji.
- iTerm 2
<https://github.com/warpdotdev/Warp/assets/12927474/ea082464-0856-4e46-89aa-59d4014949f2>
- same as default Mac terminal (next char overlaps).
- Alacritty
<https://github.com/warpdotdev/Warp/assets/12927474/e134c39a-99d6-4f6b-84c5-86c8ae1edf51>
- renders as 1 cell wide and width is also computed as 1 cell wide.
Essentially ignores the emoji variation selector.

In fish's case, I believe they use ridiculousfish/widecharwidth
<https://github.com/ridiculousfish/widecharwidth> which does seem to handle
emoji presentation selectors. unicode-width, part of the Rust stdlib,
recently added support for correctly reporting the width of these sequences
as well: unicode-width/pull/41
<https://github.com/unicode-rs/unicode-width/pull/41>. I believe the
wcwidth for something like \0x2601\0xFE0F should be 2 (assuming the
terminal supports it)?

From looking a bit into wcwidth, it seems like it doesn't inherently
support width for a sequence of code points. I just tried this out in C++
with ICU (International Components for Unicode library) and grapheme
clusters to demonstrate the width calculation as 2 with this sequence:
gist.github.com/Advait-M/a326cd2e474b9520dc893765ec4cb2c4.

Best,
Advait

On Fri, May 10, 2024 at 5:54 AM Mikael Magnusson <mikachu@gmail.com> wrote:

> On Fri, May 10, 2024 at 11:37 AM Mikael Magnusson <mikachu@gmail.com>
> wrote:
> >
> > On Thu, May 9, 2024 at 4:46 PM Advait Maybhate <advait@warp.dev> wrote:
> > >
> > > Hey folks!
> > >
> > >
> > > Wanted to file a bug report/get a discussion going on the best way to
> handle emoji variation selectors with Unicode characters.
> > >
> > >
> > > Metadata:
> > >
> > > Zsh version: zsh 5.9 (x86_64-apple-darwin23.0), OS version: macOS
> Sonoma 14.3.1
> > >
> > > Terminal: tested across Warp, Kitty, default Mac terminal, Alacritty,
> iTerm 2
> > >
> > >
> > > ZLE incorrectly treats characters with the emoji variation selector as
> 1 character instead of 2 characters, causing off-by-one cursor movement
> issues in terminals that (correctly) treat it as 2 characters.
> > >
> > >
> > > This is most easily reproduced in Kitty (v0.34), which renders and
> calculates these emojis as 2 cells (most terminal emulators seem to
> incorrectly handle this case of Unicode).
> > >
> > >
> > > To repro:
> > >
> > > Paste in the command “echo ☁️” into Kitty (the last character is
> \0x2601 followed by \0xFE0F). Note that this results in bracketed paste
> mode in Zsh.
> > >
> > >
> > > Expected behavior:
> > >
> > > ZLE contains “echo ☁️”.
> > >
> > >
> > > Actual behavior:
> > >
> > > ZLE contains “eecho ☁️” (note the additional “e” at the beginning here
> - inverted colors from the bracketed paste). Confirmed that this is due to
> an off-by-one on the cursor instruction, from the PTY recording.
> > >
> > >
> > > Screenshot: link
> > >
> > >
> > > I’d love to discuss how to fix this for terminals that do respect
> variation selectors. One way to do this could be via a new `terminfo`
> entry, but I’d love to know what ZSH devs think! I’m an engineer building
> the Warp terminal, so I’d be happy to work on any terminal-side changes of
> this with `terminfo` (we actually use bracketed paste mode for all
> commands, to best support multiline commands with Warp's input editor)!
> > >
> > >
> > > Notably, Fish 3.6 seems to calculate the width correctly as 2 cells
> (this is what originally prompted my investigation, due to the Starship
> prompt - see fish-shell/issues/10461), along with Bash (using bracketed
> paste with Bash 5.2).
> > >
> > >
> > > I’ve seen 2017/msg00432 which is related to this, but deals with
> 0xFE0E not 0xFE0F.
> >
> > Generally speaking it is impossible to handle combining emoji, since
> > the specification allows the rendering to either combine or not
> > combine the glyphs, it is not possible for zsh to know how much space
> > they will take up. Of course, your problem isn't even about combining
> > emoji, but as far as I can see the same conceptual problem applies
> > here; there is no way for zsh to know what "render as an image"
> > implies for glyph width, all we can do is call wcwidth.
>
> I also meant to say, if wcwidth for the base glyph is 1, then adding a
> composing character after with a width of 0, it will not magically
> change the width of the base glyph and cannot do so.
> https://www.unicode.org/reports/tr51/ does mention that "Current
> practice is for emoji to have a square aspect ratio, deriving from
> their origin in Japanese. For interoperability, it is recommended that
> this practice be continued with current and future emoji. They will
> typically have about the same vertical placement and advance width as
> CJK ideographs." but zsh cannot have some custom tables of emoji
> widths, either wcwidth works correctly or it doesn't.
>
> --
> Mikael Magnusson
>

[-- Attachment #2: Type: text/html, Size: 7785 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ZLE character width with emoji presentation variation selectors in Unicode
  2024-05-10 17:11     ` Advait Maybhate
@ 2024-05-10 18:57       ` Mikael Magnusson
  2024-05-14  0:08         ` Advait Maybhate
  2024-05-10 20:40       ` Bart Schaefer
  1 sibling, 1 reply; 8+ messages in thread
From: Mikael Magnusson @ 2024-05-10 18:57 UTC (permalink / raw)
  To: Advait Maybhate; +Cc: zsh-workers

On Fri, May 10, 2024 at 7:12 PM Advait Maybhate <advait@warp.dev> wrote:
>
> Gotcha, thanks for the context! Combining emojis are weird :)
>
> Hmm, agreed that it won't be possible to use the same standard across all terminals - hence, I was thinking terminfo would allow the terminal to indicate whether it supports these variation selectors with wide characters?
>
> Yep, I was referencing TR51 from Unicode as well (emoji presentation selectors).

From what I could tell (I'm not an expert), there is no phrasing that
implies the width should be different for the emoji presentation form
and the text presentation form.

> From looking a bit into wcwidth, it seems like it doesn't inherently support width for a sequence of code points. I just tried this out in C++ with ICU (International Components for Unicode library) and grapheme clusters to demonstrate the width calculation as 2 with this sequence: gist.github.com/Advait-M/a326cd2e474b9520dc893765ec4cb2c4.

Yes, normal compose sequences are a base character with a width, and
composing characters with 0 width (but effectively rendering to the
left of the insertion point, on top of the base character.)

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ZLE character width with emoji presentation variation selectors in Unicode
  2024-05-10 17:11     ` Advait Maybhate
  2024-05-10 18:57       ` Mikael Magnusson
@ 2024-05-10 20:40       ` Bart Schaefer
  2024-05-14  0:04         ` Advait Maybhate
  1 sibling, 1 reply; 8+ messages in thread
From: Bart Schaefer @ 2024-05-10 20:40 UTC (permalink / raw)
  To: Advait Maybhate; +Cc: zsh-workers, Aloke Desai, Zach Bai

On Fri, May 10, 2024 at 10:12 AM Advait Maybhate <advait@warp.dev> wrote:
>
> In fish's case, I believe they use ridiculousfish/widecharwidth which does seem to handle emoji presentation selectors. unicode-width, part of the Rust stdlib, recently added support for correctly reporting the width of these sequences as well: unicode-width/pull/41.

Note that if your primary concern is emojis in prompts (rather than in
text typed as command input), zsh has the %G (for "glitch") prompt
sequence.  So if you write e.g.

PS1="Cloudy %{☁️%2G%}% "

then zsh will correctly reserve 2 positions for the glyph when
calculating the prompt. (Note gmail may have messed up the copy-paste
of the emoji; do it right and it works).


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ZLE character width with emoji presentation variation selectors in Unicode
  2024-05-10 20:40       ` Bart Schaefer
@ 2024-05-14  0:04         ` Advait Maybhate
  0 siblings, 0 replies; 8+ messages in thread
From: Advait Maybhate @ 2024-05-14  0:04 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers, Aloke Desai, Zach Bai

[-- Attachment #1: Type: text/plain, Size: 1204 bytes --]

Ah cool - didn't know about the glitch prompt. Thanks for the info!

Hmm, though yep, I was observing that the width reservation seems to be
working correctly in the prompt itself but seems to fail for the bracketed
paste case (when testing in Kitty for correctness), thus the glitch prompt
doesn't entirely help here unfortunately.



On Fri, May 10, 2024 at 4:40 PM Bart Schaefer <schaefer@brasslantern.com>
wrote:

> On Fri, May 10, 2024 at 10:12 AM Advait Maybhate <advait@warp.dev> wrote:
> >
> > In fish's case, I believe they use ridiculousfish/widecharwidth which
> does seem to handle emoji presentation selectors. unicode-width, part of
> the Rust stdlib, recently added support for correctly reporting the width
> of these sequences as well: unicode-width/pull/41.
>
> Note that if your primary concern is emojis in prompts (rather than in
> text typed as command input), zsh has the %G (for "glitch") prompt
> sequence.  So if you write e.g.
>
> PS1="Cloudy %{☁️%2G%}% "
>
> then zsh will correctly reserve 2 positions for the glyph when
> calculating the prompt. (Note gmail may have messed up the copy-paste
> of the emoji; do it right and it works).
>

[-- Attachment #2: Type: text/html, Size: 1645 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ZLE character width with emoji presentation variation selectors in Unicode
  2024-05-10 18:57       ` Mikael Magnusson
@ 2024-05-14  0:08         ` Advait Maybhate
  0 siblings, 0 replies; 8+ messages in thread
From: Advait Maybhate @ 2024-05-14  0:08 UTC (permalink / raw)
  To: Mikael Magnusson; +Cc: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 2292 bytes --]

Agreed that there's no particular phrasing for this in the Unicode spec wrt
the exact width differences. I believe that'll largely be left up to the
renderer (in web, mobile, desktop, etc. contexts).

Given that, it seems like the optimal path forward might be to ask the
terminal emulator for this information to ensure alignment in what the
shell thinks vs. the terminal (for widths)?

Gotcha re composing characters - that makes sense, thanks for explaining!

But yep, I've got a fallback mechanism here in mind for Zsh (render as 2
cells wide but only reserve 1 cell, to match the shell, similar to iTerm) -
my goal with opening this issue was to kick off a discussion on the
"correct" way to approach this in Zsh and how to best support this going
forward. Since the current experience I've got in mind is suboptimal for
Zsh (compared to Bash/Fish) within Warp, for example, due to these
limitations.

Best,
Advait

On Fri, May 10, 2024 at 2:57 PM Mikael Magnusson <mikachu@gmail.com> wrote:

> On Fri, May 10, 2024 at 7:12 PM Advait Maybhate <advait@warp.dev> wrote:
> >
> > Gotcha, thanks for the context! Combining emojis are weird :)
> >
> > Hmm, agreed that it won't be possible to use the same standard across
> all terminals - hence, I was thinking terminfo would allow the terminal to
> indicate whether it supports these variation selectors with wide characters?
> >
> > Yep, I was referencing TR51 from Unicode as well (emoji presentation
> selectors).
>
> From what I could tell (I'm not an expert), there is no phrasing that
> implies the width should be different for the emoji presentation form
> and the text presentation form.
>
> > From looking a bit into wcwidth, it seems like it doesn't inherently
> support width for a sequence of code points. I just tried this out in C++
> with ICU (International Components for Unicode library) and grapheme
> clusters to demonstrate the width calculation as 2 with this sequence:
> gist.github.com/Advait-M/a326cd2e474b9520dc893765ec4cb2c4.
>
> Yes, normal compose sequences are a base character with a width, and
> composing characters with 0 width (but effectively rendering to the
> left of the insertion point, on top of the base character.)
>
> --
> Mikael Magnusson
>

[-- Attachment #2: Type: text/html, Size: 2908 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-05-14  0:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-09 14:45 [BUG] ZLE character width with emoji presentation variation selectors in Unicode Advait Maybhate
2024-05-10  9:37 ` Mikael Magnusson
2024-05-10  9:54   ` Mikael Magnusson
2024-05-10 17:11     ` Advait Maybhate
2024-05-10 18:57       ` Mikael Magnusson
2024-05-14  0:08         ` Advait Maybhate
2024-05-10 20:40       ` Bart Schaefer
2024-05-14  0:04         ` Advait Maybhate

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).