mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Paintcans for reverse iterating strings
@ 2025-06-29  0:08 Rich Felker
  2025-06-29  3:51 ` Rich Felker
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Rich Felker @ 2025-06-29  0:08 UTC (permalink / raw)
  To: musl

One thing we're going to need for LC_COLLATE in locales where
second-level weights are applied in reverse order (diacritic marks
later in the string weigh more than earlier ones) is the ability to
traverse (& live transform to NFD) the input string in reverse.

In order not to take quadratic time doing this, we need a strategy for
laying down a logarithmic (thus bounded by a small constant) number of
waypoints ("paintcans") we can resume forward-processing at to
simulate reverse-order random access.

I'm posting here a proposal for comments on whether this seems fairly
optimal or has any obvious improvements:


Maintain an array of up to N waypoints, initially populated just by
the halfway point.

When requesting a position, first remove any waypoints past the
requested position from the array, then start from the last remaining
waypoint and work forward. When crossing the halfway point between the
last waypoint and the desired position, if the waypoint array is not
already full, add the halfway position as a waypoint.


With unlimited N, I'm pretty sure this gives O(N log N) total
traversal time. In a finite real-machine model, N=8*sizeof(size_t)
then gives the same. And if we know ahead of time that the size of the
input is way less than the maximum possible length (note: total length
is known from a previous pass), we can pre-populate linear-spaced
early waypoints or space them differently (like 1/4 instead of 1/2 the
way to the target) so that less time is spent backtracking (I haven't
yet thought much about which is best).

Note that N=64 is no problem in terms of storage space; it's under 4k.
And the majority would not even get used/touched except on
pathologically long inputs to strcoll/strxfrm, and only in the case of
a locale with reversed second-level weights.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] Paintcans for reverse iterating strings
  2025-06-29  0:08 [musl] Paintcans for reverse iterating strings Rich Felker
@ 2025-06-29  3:51 ` Rich Felker
  2025-06-29  7:58 ` Alexander Monakov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Rich Felker @ 2025-06-29  3:51 UTC (permalink / raw)
  To: musl

On Sat, Jun 28, 2025 at 08:08:43PM -0400, Rich Felker wrote:
> One thing we're going to need for LC_COLLATE in locales where
> second-level weights are applied in reverse order (diacritic marks
> later in the string weigh more than earlier ones) is the ability to
> traverse (& live transform to NFD) the input string in reverse.
> 
> In order not to take quadratic time doing this, we need a strategy for
> laying down a logarithmic (thus bounded by a small constant) number of
> waypoints ("paintcans") we can resume forward-processing at to
> simulate reverse-order random access.
> 
> I'm posting here a proposal for comments on whether this seems fairly
> optimal or has any obvious improvements:
> 
> 
> Maintain an array of up to N waypoints, initially populated just by
> the halfway point.
> 
> When requesting a position, first remove any waypoints past the
> requested position from the array, then start from the last remaining
> waypoint and work forward. When crossing the halfway point between the
> last waypoint and the desired position, if the waypoint array is not
> already full, add the halfway position as a waypoint.
> 
> 
> With unlimited N, I'm pretty sure this gives O(N log N) total
> traversal time. In a finite real-machine model, N=8*sizeof(size_t)
> then gives the same. And if we know ahead of time that the size of the
> input is way less than the maximum possible length (note: total length
> is known from a previous pass), we can pre-populate linear-spaced
> early waypoints or space them differently (like 1/4 instead of 1/2 the
> way to the target) so that less time is spent backtracking (I haven't
> yet thought much about which is best).
> 
> Note that N=64 is no problem in terms of storage space; it's under 4k.
> And the majority would not even get used/touched except on
> pathologically long inputs to strcoll/strxfrm, and only in the case of
> a locale with reversed second-level weights.

At least empirically this seems to work. I'm getting expected output
and for a 21-character string it performs the forward-iterate
operation 63 times, which is not bad. This is without any logic to
emit extra waypoints when log(len) is small and only a few slots are
needed for the halfway steps.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] Paintcans for reverse iterating strings
  2025-06-29  0:08 [musl] Paintcans for reverse iterating strings Rich Felker
  2025-06-29  3:51 ` Rich Felker
@ 2025-06-29  7:58 ` Alexander Monakov
       [not found] ` <22432008.194c.197baceeae8.Coremail.00107082@163.com>
       [not found] ` <3936BA69-92D7-4FAA-80DE-2FDE6861A536@aevum.de>
  3 siblings, 0 replies; 6+ messages in thread
From: Alexander Monakov @ 2025-06-29  7:58 UTC (permalink / raw)
  To: musl


On Sat, 28 Jun 2025, Rich Felker wrote:

> One thing we're going to need for LC_COLLATE in locales where
> second-level weights are applied in reverse order (diacritic marks
> later in the string weigh more than earlier ones) is the ability to
> traverse (& live transform to NFD) the input string in reverse.

Apologies if I'm forgetting some essential context, but what is the
encoding of the input string, is it not always UTF-8? Reverse iteration
of valid UTF-8 is easy (unless you need something more specific than
just reverse iteration).

Alexander

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] Re:[musl] Paintcans for reverse iterating strings
       [not found] ` <22432008.194c.197baceeae8.Coremail.00107082@163.com>
@ 2025-06-29 15:06   ` Rich Felker
  0 siblings, 0 replies; 6+ messages in thread
From: Rich Felker @ 2025-06-29 15:06 UTC (permalink / raw)
  To: David Wang; +Cc: musl

On Sun, Jun 29, 2025 at 04:30:12PM +0800, David Wang wrote:
> 
> At 2025-06-29 08:08:43, "Rich Felker" <dalias@libc.org> wrote:
> >One thing we're going to need for LC_COLLATE in locales where
> >second-level weights are applied in reverse order (diacritic marks
> >later in the string weigh more than earlier ones) is the ability to
> >traverse (& live transform to NFD) the input string in reverse.
> >
> >In order not to take quadratic time doing this, we need a strategy for
> >laying down a logarithmic (thus bounded by a small constant) number of
> >waypoints ("paintcans") we can resume forward-processing at to
> >simulate reverse-order random access.
> >
> Hi,
> Maybe some mumbo jumbo, (I have no knowledge of LC_COLLATE,  just takes it  an interesting  algorithm problem),
> but is it possible to build a "small"  finite state machine with "deterministic reverse transformations"
> for all the "diacritic marks"?
> If that possible,  first iterating forward and keep state for each "diacritic marks", and 
> when reverse iterating string, a reverse state transformation can be carried out, maybe for each "diacritic marks"
> , or could be only for current "diacritic marks".
> (But this does not work for "random" access)

The state in question is conceptually unbounded in size (within the
range of size_t). It's potentially all characters up to the current
point.

In the common case of just a small number of nonstarters between
successive starters, in theory you can process UTF-8 backwards from
the end until you find a starter, then go forward. However, you need
a method to avoid quadratic time when you don't have that common-case,
and unless a hybrid approach of doing both is a lot faster in the
common case, it's just gratuitous complexity/codesize/bug-surface.

FWIW (I'll address this elsewhere) it's actually not even iterating
normalized characters in reverse, but iterating collation elements in
reverse. This adds a good deal more complexity to trying to do a
direct reverse scanning.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] Paintcans for reverse iterating strings
       [not found] ` <3936BA69-92D7-4FAA-80DE-2FDE6861A536@aevum.de>
@ 2025-06-29 16:03   ` Rich Felker
       [not found]     ` <B7E7DF17-45AB-4858-BEE2-BE0B0A1BF82B@aevum.de>
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2025-06-29 16:03 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
> On Jun 29, 2025, at 02:08, Rich Felker <dalias@libc.org> wrote:
> > One thing we're going to need for LC_COLLATE in locales where
> > second-level weights are applied in reverse order (diacritic marks
> > later in the string weigh more than earlier ones) is the ability to
> > traverse (& live transform to NFD) the input string in reverse.
> 
> Assuming the context is strcoll and we're comparing two strings,
> wouldn't it be possible to compare the strings in normal, forward
> direction but instead of stopping at the first difference, comparing
> all collation elements and returning the last difference (if any)?

I believe you can do something like this for strcoll. Note that,
normally, you don't even get to second level weights when using
strcoll.

Where you can't do it is strxfrm (transforming into a byte sequence
that can be byte-by-byte compared).

But this is a very useful observation you've made, in that the code
should be factored such that reverse iteration need not be linked
unless strxfrm/wcsxfrm is linked.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] Paintcans for reverse iterating strings
       [not found]       ` <20250629164822.GX1827@brightrain.aerifal.cx>
@ 2025-06-30 15:18         ` Rich Felker
  0 siblings, 0 replies; 6+ messages in thread
From: Rich Felker @ 2025-06-30 15:18 UTC (permalink / raw)
  To: Nick Wellnhofer; +Cc: musl

On Sun, Jun 29, 2025 at 12:48:23PM -0400, Rich Felker wrote:
> On Sun, Jun 29, 2025 at 06:25:15PM +0200, Nick Wellnhofer wrote:
> > 
> > 
> > > On Jun 29, 2025, at 18:03, Rich Felker <dalias@libc.org> wrote:
> > > 
> > > On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
> > >> On Jun 29, 2025, at 02:08, Rich Felker <dalias@libc.org> wrote:
> > >>> One thing we're going to need for LC_COLLATE in locales where
> > >>> second-level weights are applied in reverse order (diacritic marks
> > >>> later in the string weigh more than earlier ones) is the ability to
> > >>> traverse (& live transform to NFD) the input string in reverse.
> > >> 
> > >> Assuming the context is strcoll and we're comparing two strings,
> > >> wouldn't it be possible to compare the strings in normal, forward
> > >> direction but instead of stopping at the first difference, comparing
> > >> all collation elements and returning the last difference (if any)?
> > > 
> > > I believe you can do something like this for strcoll. Note that,
> > > normally, you don't even get to second level weights when using
> > > strcoll.
> > > 
> > > Where you can't do it is strxfrm (transforming into a byte sequence
> > > that can be byte-by-byte compared).
> > 
> > For strxfrm, I assume you have to perform Step 3 of the UCA main
> > algorithm or something equivalent:
> > https://www.unicode.org/reports/tr10/#Step_3
> > 
> > So you just have to reverse the second-level weights afterwards
> > which seems trivial. Am I missing something?
> 
> Hm, indeed maybe you can do the reversal in-place in the destination
> buffer. I didn't think of that because I was assuming you needed to be
> able to do the same for strcoll where you don't have a destination
> buffer, but since you've found that it doesn't need to be done at all
> for strcoll (because you can just keep a running last-difference and
> use the final one) it seems like you always have a buffer you can do
> the reversal in.
> 
> Note that this does require the weights themselves to be iterable in
> reverse, which is not trivially the case in the CLDR format with
> fractional byte-based weights rather than 16-bit integers, but they
> can be assigned or transformed so that it's easy, or my algorithm can
> just be applied to the output weight buffer (which is a lot less
> costly than re-running iterators that decode, decompose, reorder, and
> lookup collation elements).

OK, this admits a trivial solution that works regardless of the
encoding of the weights, without any need for being synchronizable for
reverse traversal.

When emitting the second-level weights during the first pass (before
reversing them), reverse the bytes for each collating element on a
byte-by-byte basis. Then, the entire span of bytes in the output
corresponding to the second level weights can be reversed,
byte-by-byte without any consideration for where the collating element
boundaries are, to produce a final bytestring consisting of collating
elements in reverse order.

Moreover, since "second order is reversed" is a property of the locale
that's fixed at the time the locale file is generated, the reversal
first reversal could even have taken place in the on-disk format, so
that it doesn't have to be processed at runtime. I'm not sure yet if
this is a good idea when strcoll is probably the more-likely case that
needs to perform well, but it could be considered.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-30 15:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-06-29  0:08 [musl] Paintcans for reverse iterating strings Rich Felker
2025-06-29  3:51 ` Rich Felker
2025-06-29  7:58 ` Alexander Monakov
     [not found] ` <22432008.194c.197baceeae8.Coremail.00107082@163.com>
2025-06-29 15:06   ` [musl] " Rich Felker
     [not found] ` <3936BA69-92D7-4FAA-80DE-2FDE6861A536@aevum.de>
2025-06-29 16:03   ` [musl] " Rich Felker
     [not found]     ` <B7E7DF17-45AB-4858-BEE2-BE0B0A1BF82B@aevum.de>
     [not found]       ` <20250629164822.GX1827@brightrain.aerifal.cx>
2025-06-30 15:18         ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).