mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Paintcans for reverse iterating strings
@ 2025-06-29  0:08 Rich Felker
  2025-06-29  3:51 ` Rich Felker
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Rich Felker @ 2025-06-29  0:08 UTC (permalink / raw)
  To: musl

One thing we're going to need for LC_COLLATE in locales where
second-level weights are applied in reverse order (diacritic marks
later in the string weigh more than earlier ones) is the ability to
traverse (& live transform to NFD) the input string in reverse.

In order not to take quadratic time doing this, we need a strategy for
laying down a logarithmic (thus bounded by a small constant) number of
waypoints ("paintcans") we can resume forward-processing at to
simulate reverse-order random access.

I'm posting here a proposal for comments on whether this seems fairly
optimal or has any obvious improvements:


Maintain an array of up to N waypoints, initially populated just by
the halfway point.

When requesting a position, first remove any waypoints past the
requested position from the array, then start from the last remaining
waypoint and work forward. When crossing the halfway point between the
last waypoint and the desired position, if the waypoint array is not
already full, add the halfway position as a waypoint.


With unlimited N, I'm pretty sure this gives O(N log N) total
traversal time. In a finite real-machine model, N=8*sizeof(size_t)
then gives the same. And if we know ahead of time that the size of the
input is way less than the maximum possible length (note: total length
is known from a previous pass), we can pre-populate linear-spaced
early waypoints or space them differently (like 1/4 instead of 1/2 the
way to the target) so that less time is spent backtracking (I haven't
yet thought much about which is best).

Note that N=64 is no problem in terms of storage space; it's under 4k.
And the majority would not even get used/touched except on
pathologically long inputs to strcoll/strxfrm, and only in the case of
a locale with reversed second-level weights.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Paintcans for reverse iterating strings
  2025-06-29  0:08 [musl] Paintcans for reverse iterating strings Rich Felker
@ 2025-06-29  3:51 ` Rich Felker
  2025-06-29  7:58 ` Alexander Monakov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Rich Felker @ 2025-06-29  3:51 UTC (permalink / raw)
  To: musl

On Sat, Jun 28, 2025 at 08:08:43PM -0400, Rich Felker wrote:
> One thing we're going to need for LC_COLLATE in locales where
> second-level weights are applied in reverse order (diacritic marks
> later in the string weigh more than earlier ones) is the ability to
> traverse (& live transform to NFD) the input string in reverse.
> 
> In order not to take quadratic time doing this, we need a strategy for
> laying down a logarithmic (thus bounded by a small constant) number of
> waypoints ("paintcans") we can resume forward-processing at to
> simulate reverse-order random access.
> 
> I'm posting here a proposal for comments on whether this seems fairly
> optimal or has any obvious improvements:
> 
> 
> Maintain an array of up to N waypoints, initially populated just by
> the halfway point.
> 
> When requesting a position, first remove any waypoints past the
> requested position from the array, then start from the last remaining
> waypoint and work forward. When crossing the halfway point between the
> last waypoint and the desired position, if the waypoint array is not
> already full, add the halfway position as a waypoint.
> 
> 
> With unlimited N, I'm pretty sure this gives O(N log N) total
> traversal time. In a finite real-machine model, N=8*sizeof(size_t)
> then gives the same. And if we know ahead of time that the size of the
> input is way less than the maximum possible length (note: total length
> is known from a previous pass), we can pre-populate linear-spaced
> early waypoints or space them differently (like 1/4 instead of 1/2 the
> way to the target) so that less time is spent backtracking (I haven't
> yet thought much about which is best).
> 
> Note that N=64 is no problem in terms of storage space; it's under 4k.
> And the majority would not even get used/touched except on
> pathologically long inputs to strcoll/strxfrm, and only in the case of
> a locale with reversed second-level weights.

At least empirically this seems to work. I'm getting expected output
and for a 21-character string it performs the forward-iterate
operation 63 times, which is not bad. This is without any logic to
emit extra waypoints when log(len) is small and only a few slots are
needed for the halfway steps.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Paintcans for reverse iterating strings
  2025-06-29  0:08 [musl] Paintcans for reverse iterating strings Rich Felker
  2025-06-29  3:51 ` Rich Felker
@ 2025-06-29  7:58 ` Alexander Monakov
  2025-06-29 10:39   ` Joakim Sindholt
  2025-06-29  8:30 ` David Wang
  2025-06-29 15:39 ` Nick Wellnhofer
  3 siblings, 1 reply; 12+ messages in thread
From: Alexander Monakov @ 2025-06-29  7:58 UTC (permalink / raw)
  To: musl


On Sat, 28 Jun 2025, Rich Felker wrote:

> One thing we're going to need for LC_COLLATE in locales where
> second-level weights are applied in reverse order (diacritic marks
> later in the string weigh more than earlier ones) is the ability to
> traverse (& live transform to NFD) the input string in reverse.

Apologies if I'm forgetting some essential context, but what is the
encoding of the input string, is it not always UTF-8? Reverse iteration
of valid UTF-8 is easy (unless you need something more specific than
just reverse iteration).

Alexander

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re:[musl] Paintcans for reverse iterating strings
  2025-06-29  0:08 [musl] Paintcans for reverse iterating strings Rich Felker
  2025-06-29  3:51 ` Rich Felker
  2025-06-29  7:58 ` Alexander Monakov
@ 2025-06-29  8:30 ` David Wang
  2025-06-29 15:06   ` [musl] " Rich Felker
  2025-06-29 15:39 ` Nick Wellnhofer
  3 siblings, 1 reply; 12+ messages in thread
From: David Wang @ 2025-06-29  8:30 UTC (permalink / raw)
  To: musl


At 2025-06-29 08:08:43, "Rich Felker" <dalias@libc.org> wrote:
>One thing we're going to need for LC_COLLATE in locales where
>second-level weights are applied in reverse order (diacritic marks
>later in the string weigh more than earlier ones) is the ability to
>traverse (& live transform to NFD) the input string in reverse.
>
>In order not to take quadratic time doing this, we need a strategy for
>laying down a logarithmic (thus bounded by a small constant) number of
>waypoints ("paintcans") we can resume forward-processing at to
>simulate reverse-order random access.
>
Hi,
Maybe some mumbo jumbo, (I have no knowledge of LC_COLLATE,  just takes it  an interesting  algorithm problem),
but is it possible to build a "small"  finite state machine with "deterministic reverse transformations"
for all the "diacritic marks"?
If that possible,  first iterating forward and keep state for each "diacritic marks", and 
when reverse iterating string, a reverse state transformation can be carried out, maybe for each "diacritic marks"
, or could be only for current "diacritic marks".
(But this does not work for "random" access)

David 




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Paintcans for reverse iterating strings
  2025-06-29  7:58 ` Alexander Monakov
@ 2025-06-29 10:39   ` Joakim Sindholt
  2025-06-29 15:14     ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Joakim Sindholt @ 2025-06-29 10:39 UTC (permalink / raw)
  To: musl; +Cc: Alexander Monakov

On Sun, 29 Jun 2025 10:58:11 +0300 (MSK), Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> On Sat, 28 Jun 2025, Rich Felker wrote:
> 
> > One thing we're going to need for LC_COLLATE in locales where
> > second-level weights are applied in reverse order (diacritic marks
> > later in the string weigh more than earlier ones) is the ability to
> > traverse (& live transform to NFD) the input string in reverse.
> 
> Apologies if I'm forgetting some essential context, but what is the
> encoding of the input string, is it not always UTF-8? Reverse iteration
> of valid UTF-8 is easy (unless you need something more specific than
> just reverse iteration).

As I understand the problem it's not a matter of iterating UTF-8 in
reverse but the normalized NFD codepoint stream it transforms into in
reverse.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Re:[musl] Paintcans for reverse iterating strings
  2025-06-29  8:30 ` David Wang
@ 2025-06-29 15:06   ` Rich Felker
  0 siblings, 0 replies; 12+ messages in thread
From: Rich Felker @ 2025-06-29 15:06 UTC (permalink / raw)
  To: David Wang; +Cc: musl

On Sun, Jun 29, 2025 at 04:30:12PM +0800, David Wang wrote:
> 
> At 2025-06-29 08:08:43, "Rich Felker" <dalias@libc.org> wrote:
> >One thing we're going to need for LC_COLLATE in locales where
> >second-level weights are applied in reverse order (diacritic marks
> >later in the string weigh more than earlier ones) is the ability to
> >traverse (& live transform to NFD) the input string in reverse.
> >
> >In order not to take quadratic time doing this, we need a strategy for
> >laying down a logarithmic (thus bounded by a small constant) number of
> >waypoints ("paintcans") we can resume forward-processing at to
> >simulate reverse-order random access.
> >
> Hi,
> Maybe some mumbo jumbo, (I have no knowledge of LC_COLLATE,  just takes it  an interesting  algorithm problem),
> but is it possible to build a "small"  finite state machine with "deterministic reverse transformations"
> for all the "diacritic marks"?
> If that possible,  first iterating forward and keep state for each "diacritic marks", and 
> when reverse iterating string, a reverse state transformation can be carried out, maybe for each "diacritic marks"
> , or could be only for current "diacritic marks".
> (But this does not work for "random" access)

The state in question is conceptually unbounded in size (within the
range of size_t). It's potentially all characters up to the current
point.

In the common case of just a small number of nonstarters between
successive starters, in theory you can process UTF-8 backwards from
the end until you find a starter, then go forward. However, you need
a method to avoid quadratic time when you don't have that common-case,
and unless a hybrid approach of doing both is a lot faster in the
common case, it's just gratuitous complexity/codesize/bug-surface.

FWIW (I'll address this elsewhere) it's actually not even iterating
normalized characters in reverse, but iterating collation elements in
reverse. This adds a good deal more complexity to trying to do a
direct reverse scanning.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Paintcans for reverse iterating strings
  2025-06-29 10:39   ` Joakim Sindholt
@ 2025-06-29 15:14     ` Rich Felker
  0 siblings, 0 replies; 12+ messages in thread
From: Rich Felker @ 2025-06-29 15:14 UTC (permalink / raw)
  To: Joakim Sindholt; +Cc: musl, Alexander Monakov

On Sun, Jun 29, 2025 at 12:39:03PM +0200, Joakim Sindholt wrote:
> On Sun, 29 Jun 2025 10:58:11 +0300 (MSK), Alexander Monakov <amonakov@ispras.ru> wrote:
> > 
> > On Sat, 28 Jun 2025, Rich Felker wrote:
> > 
> > > One thing we're going to need for LC_COLLATE in locales where
> > > second-level weights are applied in reverse order (diacritic marks
> > > later in the string weigh more than earlier ones) is the ability to
> > > traverse (& live transform to NFD) the input string in reverse.
> > 
> > Apologies if I'm forgetting some essential context, but what is the
> > encoding of the input string, is it not always UTF-8? Reverse iteration
> > of valid UTF-8 is easy (unless you need something more specific than
> > just reverse iteration).
> 
> As I understand the problem it's not a matter of iterating UTF-8 in
> reverse but the normalized NFD codepoint stream it transforms into in
> reverse.

Yes, it's actually worse. It's the collation elements obtained from
forward traversal of the normalized NDF codepoint stream, which need
to be iterated in reverse. So there are like 4 levels of stacked
iteration:

    UTF-8 decoding -> decomposition -> canonical reordering ->
    collation elements

That last arrow is a multi-input (greedy), multi-output (individual
characters or sequences of characters can map to more than one
collation element). In theory it has its own "position within the
output", but since it's always the last step, it doesn't actually have
to be implemented as an iterator. The calling code could just iterate
(forwards or backwards as needed) over the array of elements obtained
from a given [sequence of] input[s]. There are only a finite number of
such arrays, of bounded size; they're not dynamically generated but
part of mmapped data.

Rich


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Paintcans for reverse iterating strings
  2025-06-29  0:08 [musl] Paintcans for reverse iterating strings Rich Felker
                   ` (2 preceding siblings ...)
  2025-06-29  8:30 ` David Wang
@ 2025-06-29 15:39 ` Nick Wellnhofer
  2025-06-29 16:03   ` [musl] " Rich Felker
  3 siblings, 1 reply; 12+ messages in thread
From: Nick Wellnhofer @ 2025-06-29 15:39 UTC (permalink / raw)
  To: musl

On Jun 29, 2025, at 02:08, Rich Felker <dalias@libc.org> wrote:
> One thing we're going to need for LC_COLLATE in locales where
> second-level weights are applied in reverse order (diacritic marks
> later in the string weigh more than earlier ones) is the ability to
> traverse (& live transform to NFD) the input string in reverse.

Assuming the context is strcoll and we're comparing two strings, wouldn't it be possible to compare the strings in normal, forward direction but instead of stopping at the first difference, comparing all collation elements and returning the last difference (if any)?

Nick



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Paintcans for reverse iterating strings
  2025-06-29 15:39 ` Nick Wellnhofer
@ 2025-06-29 16:03   ` Rich Felker
  2025-06-29 16:25     ` Nick Wellnhofer
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2025-06-29 16:03 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
> On Jun 29, 2025, at 02:08, Rich Felker <dalias@libc.org> wrote:
> > One thing we're going to need for LC_COLLATE in locales where
> > second-level weights are applied in reverse order (diacritic marks
> > later in the string weigh more than earlier ones) is the ability to
> > traverse (& live transform to NFD) the input string in reverse.
> 
> Assuming the context is strcoll and we're comparing two strings,
> wouldn't it be possible to compare the strings in normal, forward
> direction but instead of stopping at the first difference, comparing
> all collation elements and returning the last difference (if any)?

I believe you can do something like this for strcoll. Note that,
normally, you don't even get to second level weights when using
strcoll.

Where you can't do it is strxfrm (transforming into a byte sequence
that can be byte-by-byte compared).

But this is a very useful observation you've made, in that the code
should be factored such that reverse iteration need not be linked
unless strxfrm/wcsxfrm is linked.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Paintcans for reverse iterating strings
  2025-06-29 16:03   ` [musl] " Rich Felker
@ 2025-06-29 16:25     ` Nick Wellnhofer
  2025-06-29 16:48       ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Nick Wellnhofer @ 2025-06-29 16:25 UTC (permalink / raw)
  To: musl



> On Jun 29, 2025, at 18:03, Rich Felker <dalias@libc.org> wrote:
> 
> On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
>> On Jun 29, 2025, at 02:08, Rich Felker <dalias@libc.org> wrote:
>>> One thing we're going to need for LC_COLLATE in locales where
>>> second-level weights are applied in reverse order (diacritic marks
>>> later in the string weigh more than earlier ones) is the ability to
>>> traverse (& live transform to NFD) the input string in reverse.
>> 
>> Assuming the context is strcoll and we're comparing two strings,
>> wouldn't it be possible to compare the strings in normal, forward
>> direction but instead of stopping at the first difference, comparing
>> all collation elements and returning the last difference (if any)?
> 
> I believe you can do something like this for strcoll. Note that,
> normally, you don't even get to second level weights when using
> strcoll.
> 
> Where you can't do it is strxfrm (transforming into a byte sequence
> that can be byte-by-byte compared).

For strxfrm, I assume you have to perform Step 3 of the UCA main algorithm or something equivalent: https://www.unicode.org/reports/tr10/#Step_3

So you just have to reverse the second-level weights afterwards which seems trivial. Am I missing something?

Nick



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Paintcans for reverse iterating strings
  2025-06-29 16:25     ` Nick Wellnhofer
@ 2025-06-29 16:48       ` Rich Felker
  2025-06-30 15:18         ` [musl] " Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2025-06-29 16:48 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Sun, Jun 29, 2025 at 06:25:15PM +0200, Nick Wellnhofer wrote:
> 
> 
> > On Jun 29, 2025, at 18:03, Rich Felker <dalias@libc.org> wrote:
> > 
> > On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
> >> On Jun 29, 2025, at 02:08, Rich Felker <dalias@libc.org> wrote:
> >>> One thing we're going to need for LC_COLLATE in locales where
> >>> second-level weights are applied in reverse order (diacritic marks
> >>> later in the string weigh more than earlier ones) is the ability to
> >>> traverse (& live transform to NFD) the input string in reverse.
> >> 
> >> Assuming the context is strcoll and we're comparing two strings,
> >> wouldn't it be possible to compare the strings in normal, forward
> >> direction but instead of stopping at the first difference, comparing
> >> all collation elements and returning the last difference (if any)?
> > 
> > I believe you can do something like this for strcoll. Note that,
> > normally, you don't even get to second level weights when using
> > strcoll.
> > 
> > Where you can't do it is strxfrm (transforming into a byte sequence
> > that can be byte-by-byte compared).
> 
> For strxfrm, I assume you have to perform Step 3 of the UCA main
> algorithm or something equivalent:
> https://www.unicode.org/reports/tr10/#Step_3
> 
> So you just have to reverse the second-level weights afterwards
> which seems trivial. Am I missing something?

Hm, indeed maybe you can do the reversal in-place in the destination
buffer. I didn't think of that because I was assuming you needed to be
able to do the same for strcoll where you don't have a destination
buffer, but since you've found that it doesn't need to be done at all
for strcoll (because you can just keep a running last-difference and
use the final one) it seems like you always have a buffer you can do
the reversal in.

Note that this does require the weights themselves to be iterable in
reverse, which is not trivially the case in the CLDR format with
fractional byte-based weights rather than 16-bit integers, but they
can be assigned or transformed so that it's easy, or my algorithm can
just be applied to the output weight buffer (which is a lot less
costly than re-running iterators that decode, decompose, reorder, and
lookup collation elements).

Thanks!

Rich


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Paintcans for reverse iterating strings
  2025-06-29 16:48       ` Rich Felker
@ 2025-06-30 15:18         ` Rich Felker
  0 siblings, 0 replies; 12+ messages in thread
From: Rich Felker @ 2025-06-30 15:18 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Sun, Jun 29, 2025 at 12:48:23PM -0400, Rich Felker wrote:
> On Sun, Jun 29, 2025 at 06:25:15PM +0200, Nick Wellnhofer wrote:
> > 
> > 
> > > On Jun 29, 2025, at 18:03, Rich Felker <dalias@libc.org> wrote:
> > > 
> > > On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
> > >> On Jun 29, 2025, at 02:08, Rich Felker <dalias@libc.org> wrote:
> > >>> One thing we're going to need for LC_COLLATE in locales where
> > >>> second-level weights are applied in reverse order (diacritic marks
> > >>> later in the string weigh more than earlier ones) is the ability to
> > >>> traverse (& live transform to NFD) the input string in reverse.
> > >> 
> > >> Assuming the context is strcoll and we're comparing two strings,
> > >> wouldn't it be possible to compare the strings in normal, forward
> > >> direction but instead of stopping at the first difference, comparing
> > >> all collation elements and returning the last difference (if any)?
> > > 
> > > I believe you can do something like this for strcoll. Note that,
> > > normally, you don't even get to second level weights when using
> > > strcoll.
> > > 
> > > Where you can't do it is strxfrm (transforming into a byte sequence
> > > that can be byte-by-byte compared).
> > 
> > For strxfrm, I assume you have to perform Step 3 of the UCA main
> > algorithm or something equivalent:
> > https://www.unicode.org/reports/tr10/#Step_3
> > 
> > So you just have to reverse the second-level weights afterwards
> > which seems trivial. Am I missing something?
> 
> Hm, indeed maybe you can do the reversal in-place in the destination
> buffer. I didn't think of that because I was assuming you needed to be
> able to do the same for strcoll where you don't have a destination
> buffer, but since you've found that it doesn't need to be done at all
> for strcoll (because you can just keep a running last-difference and
> use the final one) it seems like you always have a buffer you can do
> the reversal in.
> 
> Note that this does require the weights themselves to be iterable in
> reverse, which is not trivially the case in the CLDR format with
> fractional byte-based weights rather than 16-bit integers, but they
> can be assigned or transformed so that it's easy, or my algorithm can
> just be applied to the output weight buffer (which is a lot less
> costly than re-running iterators that decode, decompose, reorder, and
> lookup collation elements).

OK, this admits a trivial solution that works regardless of the
encoding of the weights, without any need for being synchronizable for
reverse traversal.

When emitting the second-level weights during the first pass (before
reversing them), reverse the bytes for each collating element on a
byte-by-byte basis. Then, the entire span of bytes in the output
corresponding to the second level weights can be reversed,
byte-by-byte without any consideration for where the collating element
boundaries are, to produce a final bytestring consisting of collating
elements in reverse order.

Moreover, since "second order is reversed" is a property of the locale
that's fixed at the time the locale file is generated, the reversal
first reversal could even have taken place in the on-disk format, so
that it doesn't have to be processed at runtime. I'm not sure yet if
this is a good idea when strcoll is probably the more-likely case that
needs to perform well, but it could be considered.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-06-30 15:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-06-29  0:08 [musl] Paintcans for reverse iterating strings Rich Felker
2025-06-29  3:51 ` Rich Felker
2025-06-29  7:58 ` Alexander Monakov
2025-06-29 10:39   ` Joakim Sindholt
2025-06-29 15:14     ` Rich Felker
2025-06-29  8:30 ` David Wang
2025-06-29 15:06   ` [musl] " Rich Felker
2025-06-29 15:39 ` Nick Wellnhofer
2025-06-29 16:03   ` [musl] " Rich Felker
2025-06-29 16:25     ` Nick Wellnhofer
2025-06-29 16:48       ` Rich Felker
2025-06-30 15:18         ` [musl] " Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).