num_in_chars incremented after each mbrtowc()

zsh-workers
 help / color / mirror / code / Atom feed

* num_in_chars incremented after each mbrtowc()
@ 2015-12-06  9:37 Sebastian Gniazdowski
  2015-12-06 10:24 ` Sebastian Gniazdowski
  2015-12-06 15:49 ` Peter Stephenson
  0 siblings, 2 replies; 8+ messages in thread
From: Sebastian Gniazdowski @ 2015-12-06  9:37 UTC (permalink / raw)
  To: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 239 bytes --]

Hello,
while working my hands off on implementing display width handling in
params.c rather than subst.c I encountered a bug in mb_metastrlenend.
It will reveal itself only on improper unicode strings.

Best regards,
Sebastian Gniazdowski

[-- Attachment #2: num_in_char.diff --]
[-- Type: text/plain, Size: 503 bytes --]

diff --git a/Src/utils.c b/Src/utils.c
index ca810de..b0a6625 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -5072,10 +5072,9 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
 	    inchar = *ptr;
 	ptr++;
 	ret = mbrtowc(&wc, &inchar, 1, &mb_shiftstate);
+        num_in_char++;
 
-	if (ret == MB_INCOMPLETE) {
-	    num_in_char++;
-	} else {
+	if (ret != MB_INCOMPLETE) {
 	    if (ret == MB_INVALID) {
 		/* Reset, treat as single character */
 		memset(&mb_shiftstate, 0, sizeof(mb_shiftstate));

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: num_in_chars incremented after each mbrtowc()
  2015-12-06  9:37 num_in_chars incremented after each mbrtowc() Sebastian Gniazdowski
@ 2015-12-06 10:24 ` Sebastian Gniazdowski
  2015-12-06 15:49 ` Peter Stephenson
  1 sibling, 0 replies; 8+ messages in thread
From: Sebastian Gniazdowski @ 2015-12-06 10:24 UTC (permalink / raw)
  To: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 482 bytes --]

Found one other missing incrementation of num_in_char. Diff against
utils.c with the previous patch applied

Best regards,
Sebastian Gniazdowski


On 6 December 2015 at 10:37, Sebastian Gniazdowski
<sgniazdowski@gmail.com> wrote:
> Hello,
> while working my hands off on implementing display width handling in
> params.c rather than subst.c I encountered a bug in mb_metastrlenend.
> It will reveal itself only on improper unicode strings.
>
> Best regards,
> Sebastian Gniazdowski

[-- Attachment #2: num_in_char2.diff --]
[-- Type: text/plain, Size: 483 bytes --]

diff --git a/Src/utils.c b/Src/utils.c
index b0a6625..d42d5c3 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -5066,9 +5066,10 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
 
     memset(&mb_shiftstate, 0, sizeof(mb_shiftstate));
     while (*ptr && !(eptr && ptr >= eptr)) {
-	if (*ptr == Meta)
+	if (*ptr == Meta) {
 	    inchar = *++ptr ^ 32;
-	else
+            num_in_char++;
+        } else
 	    inchar = *ptr;
 	ptr++;
 	ret = mbrtowc(&wc, &inchar, 1, &mb_shiftstate);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: num_in_chars incremented after each mbrtowc()
  2015-12-06  9:37 num_in_chars incremented after each mbrtowc() Sebastian Gniazdowski
  2015-12-06 10:24 ` Sebastian Gniazdowski
@ 2015-12-06 15:49 ` Peter Stephenson
  2015-12-06 16:40   ` Sebastian Gniazdowski
  2015-12-06 17:03   ` Sebastian Gniazdowski
  1 sibling, 2 replies; 8+ messages in thread
From: Peter Stephenson @ 2015-12-06 15:49 UTC (permalink / raw)
  To: Zsh hackers list

On Sun, 6 Dec 2015 10:37:21 +0100
Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
> Hello,
> while working my hands off on implementing display width handling in
> params.c rather than subst.c I encountered a bug in mb_metastrlenend.
> It will reveal itself only on improper unicode strings.

I don't understand your patch: the change is to increment num_in_char in
exactly the cases where it is deliberately set to 0 later to reflect the
fact we've now got a complete character and the effect is included in
num instead.

Somebody complained about this function a couple of months ago and I
explained, then, too; it suggests it needs some more comments, so I've
added some.

It may be the real difficulty is with the API, in which case you'll need
to say (in words, not videos, please) what you're expecting.  As long as
this is consistently dealt with in callers of the function it might be
possible to change --- I guess you're only worried about the case for
returning a width, which is uncommon in the code and indeed doesn't
really have a well defined result for incomplete/invalid characters.
Maybe you have a particular strategy in mind.

Consequently I haven't looked at your other patch.

pws

diff --git a/Src/utils.c b/Src/utils.c
index d131383..45f8286 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -5179,6 +5179,17 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
 	ret = mbrtowc(&wc, &inchar, 1, &mb_shiftstate);

 	if (ret == MB_INCOMPLETE) {
+	    /*
+	     * "num_in_char" is only used for incomplete characters.  The
+	     * assumption is that we will output this octet as a single
+	     * character (of single width) if we don't get a complete
+	     * character; if we do get a complete character, num_in_char
+	     * becomes irrelevant and is set to zero.
+	     *
+	     * This is in contrast to "num" which counts the characters
+	     * or widths in complete characters.  The two are summed,
+	     * so we don't count characters twice.
+	     */
 	    num_in_char++;
 	} else {
 	    if (ret == MB_INVALID) {

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: num_in_chars incremented after each mbrtowc()
  2015-12-06 15:49 ` Peter Stephenson
@ 2015-12-06 16:40   ` Sebastian Gniazdowski
  2015-12-06 17:13     ` Peter Stephenson
  2015-12-06 17:03   ` Sebastian Gniazdowski
  1 sibling, 1 reply; 8+ messages in thread
From: Sebastian Gniazdowski @ 2015-12-06 16:40 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

I start with the second patch (which you haven't look at?) as it is
more justified. To put it simply – ptr is incremented, num_in_char
isn't. That's wrong.

To put it less simply, I've written a function that returns raw
pointer into string and sets prevcharlen and nextcharlen. These two
values come from num_in_char. If I don't add one more incrementation
of num_in_char to the "if (*ptr == Meta) {" block, characters are
displayed as errors (i.e. the question marks symbols).

The first patch is less justified. Only bytes of incomplete characters
are counted in num_in_chars and that is what the code wants. But if
one defines the variable as "holds current byte width of character" it
is clear that *every* call to mbrtowc() should increment that
variable. Currently last byte will not be counted because there will
be no MB_INCOMPLETE returned.

To put it simply, call to mbrtowc() must be followed by num_in_char++,
it cannot be anything different.

Best regards,
Sebastian Gniazdowski

On 6 December 2015 at 16:49, Peter Stephenson
<p.w.stephenson@ntlworld.com> wrote:
> On Sun, 6 Dec 2015 10:37:21 +0100
> Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
>> Hello,
>> while working my hands off on implementing display width handling in
>> params.c rather than subst.c I encountered a bug in mb_metastrlenend.
>> It will reveal itself only on improper unicode strings.
>
> I don't understand your patch: the change is to increment num_in_char in
> exactly the cases where it is deliberately set to 0 later to reflect the
> fact we've now got a complete character and the effect is included in
> num instead.
>
> Somebody complained about this function a couple of months ago and I
> explained, then, too; it suggests it needs some more comments, so I've
> added some.
>
> It may be the real difficulty is with the API, in which case you'll need
> to say (in words, not videos, please) what you're expecting.  As long as
> this is consistently dealt with in callers of the function it might be
> possible to change --- I guess you're only worried about the case for
> returning a width, which is uncommon in the code and indeed doesn't
> really have a well defined result for incomplete/invalid characters.
> Maybe you have a particular strategy in mind.
>
> Consequently I haven't looked at your other patch.
>
> pws
>
> diff --git a/Src/utils.c b/Src/utils.c
> index d131383..45f8286 100644
> --- a/Src/utils.c
> +++ b/Src/utils.c
> @@ -5179,6 +5179,17 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
>         ret = mbrtowc(&wc, &inchar, 1, &mb_shiftstate);
>
>         if (ret == MB_INCOMPLETE) {
> +           /*
> +            * "num_in_char" is only used for incomplete characters.  The
> +            * assumption is that we will output this octet as a single
> +            * character (of single width) if we don't get a complete
> +            * character; if we do get a complete character, num_in_char
> +            * becomes irrelevant and is set to zero.
> +            *
> +            * This is in contrast to "num" which counts the characters
> +            * or widths in complete characters.  The two are summed,
> +            * so we don't count characters twice.
> +            */
>             num_in_char++;
>         } else {
>             if (ret == MB_INVALID) {


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: num_in_chars incremented after each mbrtowc()
  2015-12-06 15:49 ` Peter Stephenson
  2015-12-06 16:40   ` Sebastian Gniazdowski
@ 2015-12-06 17:03   ` Sebastian Gniazdowski
  2015-12-06 17:33     ` Peter Stephenson
  1 sibling, 1 reply; 8+ messages in thread
From: Sebastian Gniazdowski @ 2015-12-06 17:03 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On 6 December 2015 at 16:49, Peter Stephenson
<p.w.stephenson@ntlworld.com> wrote:
> On Sun, 6 Dec 2015 10:37:21 +0100
> Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
> Somebody complained about this function a couple of months ago and I
> explained, then, too; it suggests it needs some more comments, so I've
> added some.

It was me that complained. I couldn't get through to you with message
that num_in_char should be treated as representing single character.
Now you've written the same:

> +            * "num_in_char" is only used for incomplete characters.  The
> +            * assumption is that we will output this octet as a single
> +            * character (of single width) if we don't get a complete
> +            * character; if we do get a complete character, num_in_char

So, despite having num_in_char arbitrarily large, it represents single
character. The code in mb_metastrlenend() does something else:

    /* If incomplete, treat remainder as trailing single bytes */
    return num + num_in_char;

It should be:
    return num + num_in_char > 0 ? 1 : 0;

Best regards,
Sebastian Gniazdowski


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: num_in_chars incremented after each mbrtowc()
  2015-12-06 16:40   ` Sebastian Gniazdowski
@ 2015-12-06 17:13     ` Peter Stephenson
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Stephenson @ 2015-12-06 17:13 UTC (permalink / raw)
  To: Sebastian Gniazdowski; +Cc: Zsh hackers list

Sorry, I'm still not following; it still seems to indicate you don't
understand the function (which could well be my fault, which is why I've
improved the documentation).  Consequently, as I asked, what I need to
know is what output you expect from the function under what
circumstances.

(I do not need a patch and the following is without reference to the
innards of the function.)

Under what circumstances is what you are getting wrong?  You imply it is
to do with invalid characters (of some sort).  What (this is the key
question) do you expect to be returned for invalid characters?
Preferably with an example (text not video).  Also, is your issue
restricted to the question of width, or does it also apply to when the
function is returning a charater count (I'm guessing it's just widths?)

E.g.: I pass in the arguments ...  for the function and I expect this to
cause return ... so that the final output looks like ...

(What you say would be consistent, for example, with the fact that you
expect 2 to be returned for an octet from an invalid character because
the shell or something else happens to display the ones you've got as 2
characters.  That's certainly not designed into the function at the
moment and if that's what you want it's going to take a lot more work
because the width of display of invalid characters in the line editor
isn't fixed.  However, I'm not sure that the display you're talking
about is even in the line editor at all --- which means we're out of
luck because there is no fixed way of displaying invalid characters.
But I don't yet know this is the issue you're talking about.)

Thanks
pws

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: num_in_chars incremented after each mbrtowc()
  2015-12-06 17:03   ` Sebastian Gniazdowski
@ 2015-12-06 17:33     ` Peter Stephenson
  2015-12-06 17:40       ` Peter Stephenson
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2015-12-06 17:33 UTC (permalink / raw)
  To: Sebastian Gniazdowski; +Cc: Zsh hackers list

On Sun, 6 Dec 2015 18:03:08 +0100
> It should be:
>     return num + num_in_char > 0 ? 1 : 0;

OK, here's my full explanation of what this function actually does; it's
not what you say it does, but that doesn't mean I've got it right, of
course.  However, the real question is the one in the previous message,
about what API you actually need for what you're doing.

num is the "real" answer for real chracters.  It counts:

- 1 for MB_INVALID, and we count only the first octet from the input
string, then move down the input string for more.  We assume we'll
represent the character as a single width.

- 1 or the width for a valid character, depending on what the caller
requested.

Under these circumstances num_in_char is irrelevant.

num_in_char is only useful if we get some number of bytes for
MB_INCOMPLETE.  They can only occur if there are no more characters at
the end, since otherwise we would get MB_INVALID, not MB_INCOMPLETE.  In
this case, since we're never going to produce anything else, we *assume*
(and this assumption may be wrong) that the right way to deal with it is
as individual octets.  As there is no standard (for obvious reasons) as
to how to deal with incomplete characters, this may not be the most
convenient answer in practice.

If you think there is a logical error in the above, please state it.

Are you saying, for example, that a trailing set of chracters that are
MB_INCOMPLETE appear as a single output (albeit invalid) character (I
guess with a single width)?  That would mean the right return value was

     return num + (num_in_char > 0 ? 1 : 0);

(perhaps that was even what you meant above?)

pws

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: num_in_chars incremented after each mbrtowc()
  2015-12-06 17:33     ` Peter Stephenson
@ 2015-12-06 17:40       ` Peter Stephenson
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Stephenson @ 2015-12-06 17:40 UTC (permalink / raw)
  To: Zsh hackers list

On Sun, 6 Dec 2015 17:33:55 +0000
Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> Are you saying, for example, that a trailing set of chracters that are
> MB_INCOMPLETE appear as a single output (albeit invalid) character (I
> guess with a single width)?  That would mean the right return value was
> 
>      return num + (num_in_char > 0 ? 1 : 0);
> 
> (perhaps that was even what you meant above?)

Ah, reading your previous message in the light of the above, I think
that *is* what you're saying.

OK, as I said there isn't a really "right" answer here, just a
convenient one.  So if this is what works for you let's go with that.

pws

diff --git a/Src/utils.c b/Src/utils.c
index 45f8286..fc2b192 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -5180,11 +5180,15 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
 
 	if (ret == MB_INCOMPLETE) {
 	    /*
-	     * "num_in_char" is only used for incomplete characters.  The
-	     * assumption is that we will output this ocatet as a single
+	     * "num_in_char" is only used for incomplete characters.
+	     * The assumption is that we will output all trailing octets
+	     * that form part of an incomplete character as a single
 	     * character (of single width) if we don't get a complete
-	     * character; if we do get a complete character, num_in_char
-	     * becomes irrelevant and is set to zero.
+	     * character.  This is purely pragmatic --- I'm not aware
+	     * of a standard way of dealing with incomplete characters.
+	     *
+	     * If we do get a complete character, num_in_char
+	     * becomes irrelevant and is set to zero
 	     *
 	     * This is in contrast to "num" which counts the characters
 	     * or widths in complete characters.  The two are summed,
@@ -5216,8 +5220,8 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
 	}
     }
 
-    /* If incomplete, treat remainder as trailing single bytes */
-    return num + num_in_char;
+    /* If incomplete, treat remainder as trailing single character */
+    return num + (num_in_char ? 1 : 0);
 }
 
 /*


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-12-06 17:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-06  9:37 num_in_chars incremented after each mbrtowc() Sebastian Gniazdowski
2015-12-06 10:24 ` Sebastian Gniazdowski
2015-12-06 15:49 ` Peter Stephenson
2015-12-06 16:40   ` Sebastian Gniazdowski
2015-12-06 17:13     ` Peter Stephenson
2015-12-06 17:03   ` Sebastian Gniazdowski
2015-12-06 17:33     ` Peter Stephenson
2015-12-06 17:40       ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).