ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1
@ 2023-10-02  6:55 nobu (Nobuyoshi Nakada) via ruby-core
  2023-10-02 14:06 ` [ruby-core:114939] " Игорь Пятчиц via ruby-core
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: nobu (Nobuyoshi Nakada) via ruby-core @ 2023-10-02  6:55 UTC (permalink / raw)
  To: ruby-core; +Cc: nobu (Nobuyoshi Nakada)

Issue #19908 has been reported by nobu (Nobuyoshi Nakada).



----------------------------------------

Feature #19908: Update to Unicode 15.1

https://bugs.ruby-lang.org/issues/19908



* Author: nobu (Nobuyoshi Nakada)

* Status: Assigned

* Priority: Normal

* Assignee: duerst (Martin Dürst)

* Target version: 3.3

----------------------------------------

The Unicode 15.1 is released.



The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.



I'm not sure how these properties should be handled well.

`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?

https://github.com/nobu/ruby/tree/unicode-15.1 is the former.







-- 

https://bugs.ruby-lang.org/

 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:114939] Re: [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
@ 2023-10-02 14:06 ` Игорь Пятчиц via ruby-core
  2023-12-26  6:52 ` [ruby-core:115899] " duerst via ruby-core
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Игорь Пятчиц via ruby-core @ 2023-10-02 14:06 UTC (permalink / raw)
  To: Ruby developers
  Cc: Игорь
	Пятчиц


[-- Attachment #1.1: Type: text/plain, Size: 1252 bytes --]

🤘👍

пн, 2 окт. 2023 г. в 12:55, nobu (Nobuyoshi Nakada) via ruby-core <
ruby-core@ml.ruby-lang.org>:

> Issue #19908 has been reported by nobu (Nobuyoshi Nakada).
>
>
>
> ----------------------------------------
>
> Feature #19908: Update to Unicode 15.1
>
> https://bugs.ruby-lang.org/issues/19908
>
>
>
> * Author: nobu (Nobuyoshi Nakada)
>
> * Status: Assigned
>
> * Priority: Normal
>
> * Assignee: duerst (Martin Dürst)
>
> * Target version: 3.3
>
> ----------------------------------------
>
> The Unicode 15.1 is released.
>
>
>
> The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break`
> properties with values.
>
>
>
> I'm not sure how these properties should be handled well.
>
> `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?
>
> https://github.com/nobu/ruby/tree/unicode-15.1 is the former.
>
>
>
>
>
>
>
> --
>
> https://bugs.ruby-lang.org/
>
>  ______________________________________________
>  ruby-core mailing list -- ruby-core@ml.ruby-lang.org
>  To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
>  ruby-core info --
> https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
>

[-- Attachment #1.2: Type: text/html, Size: 2186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 264 bytes --]

 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:115899] [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
  2023-10-02 14:06 ` [ruby-core:114939] " Игорь Пятчиц via ruby-core
@ 2023-12-26  6:52 ` duerst via ruby-core
  2023-12-26 11:42 ` [ruby-core:115906] " duerst via ruby-core
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: duerst via ruby-core @ 2023-12-26  6:52 UTC (permalink / raw)
  To: ruby-core; +Cc: duerst

Issue #19908 has been updated by duerst (Martin Dürst).





There is a serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters.



Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundaries. This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundaries. One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression.



From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby.



----------------------------------------

Feature #19908: Update to Unicode 15.1

https://bugs.ruby-lang.org/issues/19908#change-105854



* Author: nobu (Nobuyoshi Nakada)

* Status: Assigned

* Priority: Normal

* Assignee: duerst (Martin Dürst)

----------------------------------------

The Unicode 15.1 is released.



The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.



I'm not sure how these properties should be handled well.

`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?

https://github.com/nobu/ruby/tree/unicode-15.1 is the former.







-- 

https://bugs.ruby-lang.org/

 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:115906] [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
  2023-10-02 14:06 ` [ruby-core:114939] " Игорь Пятчиц via ruby-core
  2023-12-26  6:52 ` [ruby-core:115899] " duerst via ruby-core
@ 2023-12-26 11:42 ` duerst via ruby-core
  2024-01-06 21:28 ` [ruby-core:116056] " janosch-x via ruby-core
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: duerst via ruby-core @ 2023-12-26 11:42 UTC (permalink / raw)
  To: ruby-core; +Cc: duerst

Issue #19908 has been updated by duerst (Martin Dürst).


@nobu:
We have `Grapheme_Cluster_Break=...`、so I think '=' may be appropriate. But `Grapheme_Cluster_Break=...` uses a long, explicit name. So shouldn't it be `Indic_Cluster_Break=...`, not just `InCB=...`?

----------------------------------------
Feature #19908: Update to Unicode 15.1
https://bugs.ruby-lang.org/issues/19908#change-105861

* Author: nobu (Nobuyoshi Nakada)
* Status: Assigned
* Priority: Normal
* Assignee: duerst (Martin Dürst)
----------------------------------------
The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.

I'm not sure how these properties should be handled well.
`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:116056] [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
                   ` (2 preceding siblings ...)
  2023-12-26 11:42 ` [ruby-core:115906] " duerst via ruby-core
@ 2024-01-06 21:28 ` janosch-x via ruby-core
  2024-01-09  1:25 ` [ruby-core:116099] " duerst via ruby-core
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: janosch-x via ruby-core @ 2024-01-06 21:28 UTC (permalink / raw)
  To: ruby-core; +Cc: janosch-x

Issue #19908 has been updated by janosch-x (Janosch Müller).





Is not [this](https://www.unicode.org/reports/tr29/tr29-43.html#Regex_Definitions) the updated regular expression?



```diff

 ccs-base :=     [\p{L}\p{N}\p{P}\p{S}\p{Zs}]

 ccs-extend :=  [\p{M}\p{Join_Control}]

 extended_base :=       ccs-base

 | hangul-syllable

-crlf :=        CR LF

+crlf :=        CR LF | CR | LF

 legacy-core := hangul-syllable

 | ri-sequence

 | xpicto-sequence

 legacy-postcore :=    [Extend ZWJ]

 core :=        hangul-syllable

 | ri-sequence

 | xpicto-sequence

+| conjunctCluster

 | [^Control CR LF]

 postcore :=    [Extend ZWJ SpacingMark]

 precore :=     Prepend

 hangul-syllable :=    L* (V+ | LV V* | LVT) T*

 | L+

 | T+

 xpicto-sequence :=     \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*

+conjunctCluster :=     \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+

```



----------------------------------------

Feature #19908: Update to Unicode 15.1

https://bugs.ruby-lang.org/issues/19908#change-106054



* Author: nobu (Nobuyoshi Nakada)

* Status: Assigned

* Priority: Normal

* Assignee: duerst (Martin Dürst)

----------------------------------------

The Unicode 15.1 is released.



The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.



I'm not sure how these properties should be handled well.

`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?

https://github.com/nobu/ruby/tree/unicode-15.1 is the former.







-- 

https://bugs.ruby-lang.org/

 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:116099] [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
                   ` (3 preceding siblings ...)
  2024-01-06 21:28 ` [ruby-core:116056] " janosch-x via ruby-core
@ 2024-01-09  1:25 ` duerst via ruby-core
  2024-09-12  1:56 ` [ruby-core:119128] " hsbt (Hiroshi SHIBATA) via ruby-core
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: duerst via ruby-core @ 2024-01-09  1:25 UTC (permalink / raw)
  To: ruby-core; +Cc: duerst

Issue #19908 has been updated by duerst (Martin Dürst).





@janosch-x You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that!



----------------------------------------

Feature #19908: Update to Unicode 15.1

https://bugs.ruby-lang.org/issues/19908#change-106096



* Author: nobu (Nobuyoshi Nakada)

* Status: Assigned

* Priority: Normal

* Assignee: duerst (Martin Dürst)

----------------------------------------

The Unicode 15.1 is released.



The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.



I'm not sure how these properties should be handled well.

`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?

https://github.com/nobu/ruby/tree/unicode-15.1 is the former.







-- 

https://bugs.ruby-lang.org/

 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:119128] [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
                   ` (4 preceding siblings ...)
  2024-01-09  1:25 ` [ruby-core:116099] " duerst via ruby-core
@ 2024-09-12  1:56 ` hsbt (Hiroshi SHIBATA) via ruby-core
  2024-09-12  3:21 ` [ruby-core:119130] " duerst via ruby-core
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: hsbt (Hiroshi SHIBATA) via ruby-core @ 2024-09-12  1:56 UTC (permalink / raw)
  To: ruby-core; +Cc: hsbt (Hiroshi SHIBATA)

Issue #19908 has been updated by hsbt (Hiroshi SHIBATA).


Unicode 16.0 has been released.

https://www.unicode.org/versions/Unicode16.0.0/

Should we move this instead of 15.1?

----------------------------------------
Feature #19908: Update to Unicode 15.1
https://bugs.ruby-lang.org/issues/19908#change-109722

* Author: nobu (Nobuyoshi Nakada)
* Status: Assigned
* Assignee: duerst (Martin Dürst)
----------------------------------------
The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.

I'm not sure how these properties should be handled well.
`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:119130] [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
                   ` (5 preceding siblings ...)
  2024-09-12  1:56 ` [ruby-core:119128] " hsbt (Hiroshi SHIBATA) via ruby-core
@ 2024-09-12  3:21 ` duerst via ruby-core
  2024-09-12  3:53 ` [ruby-core:119131] " hsbt (Hiroshi SHIBATA) via ruby-core
  2025-01-01 15:06 ` [ruby-core:120460] " ima1zumi (Mari Imaizumi) via ruby-core
  8 siblings, 0 replies; 10+ messages in thread
From: duerst via ruby-core @ 2024-09-12  3:21 UTC (permalink / raw)
  To: ruby-core; +Cc: duerst

Issue #19908 has been updated by duerst (Martin Dürst).


hsbt (Hiroshi SHIBATA) wrote in #note-8:
> Unicode 16.0 has been released.

> Should we move this instead of 15.1?

I think it's more prudent to do 15.1 first, then 16.0. I hope to be able to work on this soon. I created a separate issue for 16.0.

----------------------------------------
Feature #19908: Update to Unicode 15.1
https://bugs.ruby-lang.org/issues/19908#change-109725

* Author: nobu (Nobuyoshi Nakada)
* Status: Assigned
* Assignee: duerst (Martin Dürst)
----------------------------------------
The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.

I'm not sure how these properties should be handled well.
`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:119131] [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
                   ` (6 preceding siblings ...)
  2024-09-12  3:21 ` [ruby-core:119130] " duerst via ruby-core
@ 2024-09-12  3:53 ` hsbt (Hiroshi SHIBATA) via ruby-core
  2025-01-01 15:06 ` [ruby-core:120460] " ima1zumi (Mari Imaizumi) via ruby-core
  8 siblings, 0 replies; 10+ messages in thread
From: hsbt (Hiroshi SHIBATA) via ruby-core @ 2024-09-12  3:53 UTC (permalink / raw)
  To: ruby-core; +Cc: hsbt (Hiroshi SHIBATA)

Issue #19908 has been updated by hsbt (Hiroshi SHIBATA).


>I think it's more prudent to do 15.1 first, then 16.0.

Agreed, thanks!

----------------------------------------
Feature #19908: Update to Unicode 15.1
https://bugs.ruby-lang.org/issues/19908#change-109726

* Author: nobu (Nobuyoshi Nakada)
* Status: Assigned
* Assignee: duerst (Martin Dürst)
----------------------------------------
The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.

I'm not sure how these properties should be handled well.
`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ruby-core:120460] [Ruby master Feature#19908] Update to Unicode 15.1
  2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
                   ` (7 preceding siblings ...)
  2024-09-12  3:53 ` [ruby-core:119131] " hsbt (Hiroshi SHIBATA) via ruby-core
@ 2025-01-01 15:06 ` ima1zumi (Mari Imaizumi) via ruby-core
  8 siblings, 0 replies; 10+ messages in thread
From: ima1zumi (Mari Imaizumi) via ruby-core @ 2025-01-01 15:06 UTC (permalink / raw)
  To: ruby-core; +Cc: ima1zumi (Mari Imaizumi)

Issue #19908 has been updated by ima1zumi (Mari Imaizumi).


@duerst

I'm interested in working on this issue. Are you planning to start it? If not, I'd like to try.

----------------------------------------
Feature #19908: Update to Unicode 15.1
https://bugs.ruby-lang.org/issues/19908#change-111243

* Author: nobu (Nobuyoshi Nakada)
* Status: Assigned
* Assignee: duerst (Martin Dürst)
----------------------------------------
The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.

I'm not sure how these properties should be handled well.
`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-01-01 15:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-02  6:55 [ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1 nobu (Nobuyoshi Nakada) via ruby-core
2023-10-02 14:06 ` [ruby-core:114939] " Игорь Пятчиц via ruby-core
2023-12-26  6:52 ` [ruby-core:115899] " duerst via ruby-core
2023-12-26 11:42 ` [ruby-core:115906] " duerst via ruby-core
2024-01-06 21:28 ` [ruby-core:116056] " janosch-x via ruby-core
2024-01-09  1:25 ` [ruby-core:116099] " duerst via ruby-core
2024-09-12  1:56 ` [ruby-core:119128] " hsbt (Hiroshi SHIBATA) via ruby-core
2024-09-12  3:21 ` [ruby-core:119130] " duerst via ruby-core
2024-09-12  3:53 ` [ruby-core:119131] " hsbt (Hiroshi SHIBATA) via ruby-core
2025-01-01 15:06 ` [ruby-core:120460] " ima1zumi (Mari Imaizumi) via ruby-core

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).