zsh-workers
 help / color / mirror / code / Atom feed
* zsh generates invalid UTF-8 encoding in the history
@ 2016-10-05 11:48 Vincent Lefevre
  2016-10-05 12:41 ` Mikael Magnusson
  2016-10-05 17:25 ` Bart Schaefer
  0 siblings, 2 replies; 7+ messages in thread
From: Vincent Lefevre @ 2016-10-05 11:48 UTC (permalink / raw)
  To: zsh-workers

With Debian's zsh 5.2-5 + some patches, when I execute commands with
some particular Unicode characters, the UTF-8 sequences are rewritten
incorrectly in the history. For instance:

cventin:~> unicode ─
U+2500 BOX DRAWINGS LIGHT HORIZONTAL
UTF-8: e2 94 80 UTF-16BE: 2500 Decimal: ─ Octal: \022400
─
Category: So (Symbol, Other)
Unicode block: 2500..257F; Box Drawing
Bidi: ON (Other Neutrals)

But in the history, instead of getting e2 94 80, I get: e2 83 b4 80.
Concerning "e2 83 b4 80":

cventin:~> unicode --fromcp utf-8 -x e283b4
U+20F4  - No such unicode character name in database
UTF-8: e2 83 b4 UTF-16BE: 20f4 Decimal: ⃴ Octal: \020364
⃴ (⃴)
Uppercase: 20F4
Category: Cn (Other, Not Assigned)
Unicode block: 20D0..20FF; Combining Diacritical Marks for Symbols

and the 80 on its own is not a valid UTF-8 sequence.

This breaks various tools processing the history (grep, lesspipe,
etc.), first because the expected character is no longer present,
also because of invalid UTF-8, which is not regarded as a character.
For instance:

cventin:~> grep -av '^.*$' .zhistory | tail -n 1 | hd
00000000  3a 20 31 34 37 35 36 36  36 34 31 38 3a 30 3b 75  |: 1475666418:0;u|
00000010  6e 69 63 6f 64 65 20 e2  83 b4 80 0a              |nicode .....|
0000001c

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh generates invalid UTF-8 encoding in the history
  2016-10-05 11:48 zsh generates invalid UTF-8 encoding in the history Vincent Lefevre
@ 2016-10-05 12:41 ` Mikael Magnusson
  2016-10-06 18:31   ` Bart Schaefer
  2016-10-05 17:25 ` Bart Schaefer
  1 sibling, 1 reply; 7+ messages in thread
From: Mikael Magnusson @ 2016-10-05 12:41 UTC (permalink / raw)
  To: zsh workers

On Wed, Oct 5, 2016 at 1:48 PM, Vincent Lefevre <vincent@vinc17.net> wrote:
> With Debian's zsh 5.2-5 + some patches, when I execute commands with
> some particular Unicode characters, the UTF-8 sequences are rewritten
> incorrectly in the history. For instance:

History entries are written in metafied form to the history file. You
can use the 'history' command to print a readable version of history,
or use this small utility http://mika.l3ib.org/code/unmetafy.c

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh generates invalid UTF-8 encoding in the history
  2016-10-05 11:48 zsh generates invalid UTF-8 encoding in the history Vincent Lefevre
  2016-10-05 12:41 ` Mikael Magnusson
@ 2016-10-05 17:25 ` Bart Schaefer
  1 sibling, 0 replies; 7+ messages in thread
From: Bart Schaefer @ 2016-10-05 17:25 UTC (permalink / raw)
  To: zsh-workers

On Oct 5,  1:48pm, Vincent Lefevre wrote:
}
} With Debian's zsh 5.2-5 + some patches, when I execute commands with
} some particular Unicode characters, the UTF-8 sequences are rewritten
} incorrectly in the history.

They're not rewritten incorrectly; the zsh history file is stored in
zsh's internal metafied format.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh generates invalid UTF-8 encoding in the history
  2016-10-05 12:41 ` Mikael Magnusson
@ 2016-10-06 18:31   ` Bart Schaefer
  2016-10-07  8:57     ` Vincent Lefevre
  0 siblings, 1 reply; 7+ messages in thread
From: Bart Schaefer @ 2016-10-06 18:31 UTC (permalink / raw)
  To: zsh workers

On Oct 5,  2:41pm, Mikael Magnusson wrote:
}
} History entries are written in metafied form to the history file. You
} can use the 'history' command to print a readable version of history,
} or use this small utility http://mika.l3ib.org/code/unmetafy.c

Nice little utility, but it doesn't handle converting "\\\n" into "\n".
Maybe that's not desirable anyway, as the downstream utility might
want to do that itself.

To unmetafy with the shell (including backslash-newline processing):

    () { fc -pa $HISTFILE $SAVEHIST && fc -rnl -1 1 }

Or more generically

    unmetafy() {
      () { fc -pa $1 ${$(wc -l $1)[1]} && fc -rnl -1 1 } ${1:-=(cat)}
    }

Vincent might also consider using a zshaddhistory hook to maintain a copy
of the history file that is unmetafied and stripped of extended-history
markup, though that gets a bit more complicated.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh generates invalid UTF-8 encoding in the history
  2016-10-06 18:31   ` Bart Schaefer
@ 2016-10-07  8:57     ` Vincent Lefevre
  2016-10-07 17:01       ` Bart Schaefer
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Lefevre @ 2016-10-07  8:57 UTC (permalink / raw)
  To: zsh-workers

On 2016-10-06 11:31:12 -0700, Bart Schaefer wrote:
> To unmetafy with the shell (including backslash-newline processing):
> 
>     () { fc -pa $HISTFILE $SAVEHIST && fc -rnl -1 1 }

Thanks. Note that it is not clear in the man page whether the
least recent starts at 0 or at 1. And when -r is used, whether
one should use <most recent> <least recent> or the reverse.

And how can one cleanly append history lines to a history file?
I was using "cat some_file >> ~/.zhistory", which seems to work,
but I suppose that there may be clashes if zsh thinks that some
lines are already metafied. I can't see a way to do this, and
perhaps there should be a new feature.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh generates invalid UTF-8 encoding in the history
  2016-10-07  8:57     ` Vincent Lefevre
@ 2016-10-07 17:01       ` Bart Schaefer
  2017-11-29 15:46         ` Vincent Lefevre
  0 siblings, 1 reply; 7+ messages in thread
From: Bart Schaefer @ 2016-10-07 17:01 UTC (permalink / raw)
  To: zsh-workers

On Oct 7, 10:57am, Vincent Lefevre wrote:
} Subject: Re: zsh generates invalid UTF-8 encoding in the history
}
} Thanks. Note that it is not clear in the man page whether the
} least recent starts at 0 or at 1. And when -r is used, whether
} one should use <most recent> <least recent> or the reverse.

Indeed, the use of "first last" in the doc is a bit confusing, and
so is the interpretation of the numbers by the command.  It's always
most then least recent; but since negative numbers count backwards
from the most recent (largest number) and the default behavior is
described in terms of using negative offsets, it can be confusing.
 
} And how can one cleanly append history lines to a history file?
} I was using "cat some_file >> ~/.zhistory", which seems to work,

It'll work as long as there are no 0x83 bytes in some_file.

To be completely safe, you need to do something like this:

  # Pass input file as $1, output as $2
  append_plain_file_to_history_file() {
      emulate -LR zsh
      local -a entries

      # Implementation issue:  read -r ignores backslash-newline
      # folding, but without -r embedded backslashes are stripped,
      # which seems a bigger problem.  Fix up $entries later.

      IFS=$'\n' read -r -d '' -A entries <$1
      (( $#entries )) || return

      # Must supply a file name here to set HISTSIZE and SAVEHIST
      fc -pa /dev/null $#entries $(( SAVEHIST + $#entries ))

      while (( $#entries )); do
          if [[ "$entries[1]" == *\\ ]]; then
              entries[1,2]=( ${entries[1]%\\}$'\n'${entries[2]} )
          else
              print -S $entries[1]
              shift 1 entries
          fi
      done
      fc -A ${2:-$HISTFILE}

      # Reset SAVEHIST to avoid attempting to lock /dev/null
      SAVEHIST=0	# fc -p makes this implicitly local
  }


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh generates invalid UTF-8 encoding in the history
  2016-10-07 17:01       ` Bart Schaefer
@ 2017-11-29 15:46         ` Vincent Lefevre
  0 siblings, 0 replies; 7+ messages in thread
From: Vincent Lefevre @ 2017-11-29 15:46 UTC (permalink / raw)
  To: zsh-workers

Hi,

This is a bit old, but...

On 2016-10-07 10:01:37 -0700, Bart Schaefer wrote:
> On Oct 7, 10:57am, Vincent Lefevre wrote:
> } And how can one cleanly append history lines to a history file?
> } I was using "cat some_file >> ~/.zhistory", which seems to work,
> 
> It'll work as long as there are no 0x83 bytes in some_file.
> 
> To be completely safe, you need to do something like this:
> 
>   # Pass input file as $1, output as $2
>   append_plain_file_to_history_file() {
>       emulate -LR zsh
>       local -a entries
> 
>       # Implementation issue:  read -r ignores backslash-newline
>       # folding, but without -r embedded backslashes are stripped,
>       # which seems a bigger problem.  Fix up $entries later.
> 
>       IFS=$'\n' read -r -d '' -A entries <$1
>       (( $#entries )) || return
> 
>       # Must supply a file name here to set HISTSIZE and SAVEHIST
>       fc -pa /dev/null $#entries $(( SAVEHIST + $#entries ))
> 
>       while (( $#entries )); do
>           if [[ "$entries[1]" == *\\ ]]; then
>               entries[1,2]=( ${entries[1]%\\}$'\n'${entries[2]} )
>           else
>               print -S $entries[1]

There should be a -r option:

  print -r -S $entries[1]

otherwise \r yields a CR character.

>               shift 1 entries
>           fi
>       done
>       fc -A ${2:-$HISTFILE}
> 
>       # Reset SAVEHIST to avoid attempting to lock /dev/null
>       SAVEHIST=0	# fc -p makes this implicitly local
>   }

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-11-29 15:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-05 11:48 zsh generates invalid UTF-8 encoding in the history Vincent Lefevre
2016-10-05 12:41 ` Mikael Magnusson
2016-10-06 18:31   ` Bart Schaefer
2016-10-07  8:57     ` Vincent Lefevre
2016-10-07 17:01       ` Bart Schaefer
2017-11-29 15:46         ` Vincent Lefevre
2016-10-05 17:25 ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).