zsh_error_db --- hash-based database of error messages

zsh-workers
 help / color / mirror / code / Atom feed

* zsh_error_db --- hash-based database of error messages
@ 2022-12-16 16:42 Peter Stephenson
  2022-12-16 17:46 ` Daniel Shahaf
  2022-12-17 21:33 ` Oliver Kiddle
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Stephenson @ 2022-12-16 16:42 UTC (permalink / raw)
  To: zsh workers

Following on from the retread of the discussion on error messages,
here's a very simply proof of concept for a hash-based database of
error messages.  Even if it's adopted I don't intend the C code
to get much larger as the point is to allow it to be able to do
everything in shell code.

Apart from more user-friendly messages which someone with no
access to the C code can dedicate their life to tweaking,
this is obviously a boon for internationalization (which
I approve of so much I've even spelled it with a 'z').

Trivial example:

% typeset -A zsh_error_db=(E3 "not very many matches found: %s")
% setopt nomatch                                                
% echo fudge*
zsh: not very many matches found: fudge*

Apart from updating the error message to examine the hash, the
only change is to prefix the error message format string with
the code such as "E3" and a colon.  This would obviously
be done universally.

The E, a set of digits, and a colon is tested for.  I originally
didn't have the E in front, but see my other comment on hash lookups
below.  There is obviously room for doing this differently, and
some way of making it easy to avoid duplicates would probably be
good, but I think it needs to be reasonably simple.

I didn't look exhaustively, but it seems we don't have a potted
function that takes the name of a hash and an entry and looks
it up as a string, though the code for this isn't very verbose
anyway.  We don't actually force the entry of the hash to
be a string type, but in practice I think it always is.
The "E" prefix results because at one stage numbers were
doing positional parameter lookups; I've since then gone
into the hash handling at lower level so that doesn't happen
any  more, but I left the E as giving a slightly more
extensible interface, which I expect everyone else is going
to tear apart anyway.

The signatures of the format strings, i.e. where the % escapes
live, is checked for, though if there are cases where
we have more than a single letter code it'll fall over.
There's no provision for reordering escapes, either, but I
wouldn't expect to do that in an initial implementation anyway.

pws

diff --git a/Src/exec.c b/Src/exec.c
index 2b7e0c7c5..5ad56eced 100644
--- a/Src/exec.c
+++ b/Src/exec.c
@@ -3333,7 +3333,7 @@ execcmd_exec(Estate state, Execcmd_params eparams,
 		     * has provided a status.
 		     */
 		    if (badcshglob == 1) {
-			zerr("no match");
+			zerr("E2:no match");
 			lastval = 1;
 			if (forked)
 			    _realexit();
diff --git a/Src/glob.c b/Src/glob.c
index 490bafc37..493c4227a 100644
--- a/Src/glob.c
+++ b/Src/glob.c
@@ -1868,7 +1868,7 @@ zglob(LinkList list, LinkNode np, int nountok)
 	if (isset(CSHNULLGLOB)) {
 	    badcshglob |= 1;	/* at least one cmd. line expansion failed */
 	} else if (isset(NOMATCH)) {
-	    zerr("no matches found: %s", ostr);
+	    zerr("E3:no matches found: %s", ostr);
 	    zfree(matchbuf, 0);
 	    restore_globstate(saved);
 	    return;
diff --git a/Src/subst.c b/Src/subst.c
index 0f98e6ea3..571673d11 100644
--- a/Src/subst.c
+++ b/Src/subst.c
@@ -504,7 +504,7 @@ globlist(LinkList list, int flags)
     if (noerrs)
 	badcshglob = 0;
     else if (badcshglob == 1)
-	zerr("no match");
+	zerr("E1:no match");
 }

 /* perform substitution on a single word */
diff --git a/Src/utils.c b/Src/utils.c
index edf5d3df7..e8070df1e 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -119,6 +119,93 @@ set_widearray(char *mb_array, Widechar_array wca)
 }
 #endif

+/**/
+static const char *
+zerrmsg_bad_signature(const char *code, const char *internal)
+{
+    /* Don't attempt to use error message system here! */
+    fprintf(stderr, "zsh_err_db entry `%s' has incorrect signature for:\n%s\n",
+	code, internal);
+    return internal;
+}
+
+/* Attempt to use hash zsh_error_db to update message */
+/**/
+static const char *
+zerrmsg_from_hash(const char *msg)
+{
+    Param errdb, msgpm;
+    HashTable errtab;
+    const char *postcode = msg, *sigmsg, *sigvar, *imsg;
+    char *errcode, *newmsg;
+
+    if (*postcode++ != 'E')
+	return msg;
+    while (idigit(*postcode))
+	++postcode;
+    if (postcode == msg || *postcode != ':')
+	return msg;
+
+    imsg = postcode+1;
+    errdb = (Param)paramtab->getnode(paramtab, "zsh_error_db");
+    if (!errdb || !(errdb->node.flags & PM_HASHED)) {
+	return imsg;
+    }
+
+    errcode = dupstrpfx(msg, postcode-msg);
+    errtab = errdb->gsu.h->getfn(errdb);
+    if (!errtab)
+	return imsg;
+    msgpm = (Param)errtab->getnode(errtab, errcode);
+    if (PM_TYPE(msgpm->node.flags)) {
+	/* Not a plain string, bail out (safety) */
+	return imsg;
+    }
+    newmsg = msgpm->gsu.s->getfn(msgpm);
+
+    if (!newmsg || !*newmsg)
+	return imsg;
+
+    /* Check the %-signature matches */
+    sigmsg = imsg;
+    sigvar = newmsg;
+
+    for (;;) {
+	while (*sigmsg && *sigmsg != '%')
+	    sigmsg++;
+	if (!*sigmsg)
+	    break;
+	++sigmsg;
+	if (*sigmsg == '%') {
+	    ++sigmsg;
+	    continue;
+	}
+	while (*sigvar) {
+	    if (*sigvar++ == '%')
+	    {
+		if (*sigvar != '%')
+		    break;
+		++sigvar;
+	    }
+	}
+	if (!*sigvar || *sigvar != *sigmsg)
+	    return zerrmsg_bad_signature(errcode, imsg);
+	++sigvar;
+	++sigmsg;
+    }
+    while (*sigvar)
+    {
+	if (*sigvar++ == '%')
+	{
+	    if (*sigvar != '%')
+		return zerrmsg_bad_signature(errcode, imsg);
+	    ++sigvar;
+	}
+    }
+
+    return newmsg;
+}
+

 /* Print an error

@@ -305,6 +392,8 @@ zerrmsg(FILE *file, const char *fmt, va_list ap)
     } else
 	fputc((unsigned char)' ', file);

+    fmt = zerrmsg_from_hash(fmt);
+
     while (*fmt)
 	if (*fmt == '%') {
 	    fmt++;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh_error_db --- hash-based database of error messages
  2022-12-16 16:42 zsh_error_db --- hash-based database of error messages Peter Stephenson
@ 2022-12-16 17:46 ` Daniel Shahaf
  2022-12-16 18:05   ` Peter Stephenson
  2022-12-17 21:33 ` Oliver Kiddle
  1 sibling, 1 reply; 6+ messages in thread
From: Daniel Shahaf @ 2022-12-16 17:46 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh workers

Peter Stephenson wrote on Fri, Dec 16, 2022 at 16:42:53 +0000:
> Following on from the retread of the discussion on error messages,
> here's a very simply proof of concept for a hash-based database of
> error messages.  Even if it's adopted I don't intend the C code
> to get much larger as the point is to allow it to be able to do
> everything in shell code.
> 

So, tl;dr:

- Every error message would get an E42 identifier in the source string.

- The "E42" will be looked up as a string key in a well-known assoc,
  where the value will be a more elaborate error message.

- The more elaborate message, if there is one, will be used instead of
  the default message.

Code review below.

> +++ b/Src/utils.c
> @@ -119,6 +119,93 @@ set_widearray(char *mb_array, Widechar_array wca)
> +/* Attempt to use hash zsh_error_db to update message */
> +/**/
> +static const char *
> +zerrmsg_from_hash(const char *msg)
> +{
> +    Param errdb, msgpm;
> +    HashTable errtab;
> +    const char *postcode = msg, *sigmsg, *sigvar, *imsg;
> +    char *errcode, *newmsg;
> +
> +    if (*postcode++ != 'E')
> +	return msg;
> +    while (idigit(*postcode))
> +	++postcode;
> +    if (postcode == msg || *postcode != ':')
> +	return msg;

The first disjunct can't be true at this point, due to the earlier
«*postcode++».  Maybe you meant «postcode == msg + 1», to catch the case
"E:foo" with no digits?

> +
> +    imsg = postcode+1;
> +    errdb = (Param)paramtab->getnode(paramtab, "zsh_error_db");
> +    if (!errdb || !(errdb->node.flags & PM_HASHED)) {
> +	return imsg;
> +    }
> +
> +    errcode = dupstrpfx(msg, postcode-msg);
> +    errtab = errdb->gsu.h->getfn(errdb);
> +    if (!errtab)
> +	return imsg;
> +    msgpm = (Param)errtab->getnode(errtab, errcode);
> +    if (PM_TYPE(msgpm->node.flags)) {
> +	/* Not a plain string, bail out (safety) */
> +	return imsg;
> +    }
> +    newmsg = msgpm->gsu.s->getfn(msgpm);
> +
> +    if (!newmsg || !*newmsg)
> +	return imsg;
> +
> +    /* Check the %-signature matches */
> +    sigmsg = imsg;
> +    sigvar = newmsg;
> +
> +    for (;;) {
> +	while (*sigmsg && *sigmsg != '%')
> +	    sigmsg++;
> +	if (!*sigmsg)
> +	    break;

That's just reinventing strchr(), isn't it?

> +	++sigmsg;
> +	if (*sigmsg == '%') {
> +	    ++sigmsg;
> +	    continue;
> +	}
> +	while (*sigvar) {
> +	    if (*sigvar++ == '%')
> +	    {
> +		if (*sigvar != '%')
> +		    break;
> +		++sigvar;
> +	    }
> +	}
> +	if (!*sigvar || *sigvar != *sigmsg)
> +	    return zerrmsg_bad_signature(errcode, imsg);
> +	++sigvar;
> +	++sigmsg;
> +    }
> +    while (*sigvar)
> +    {
> +	if (*sigvar++ == '%')
> +	{
> +	    if (*sigvar != '%')
> +		return zerrmsg_bad_signature(errcode, imsg);
> +	    ++sigvar;
> +	}
> +    }
> +
> +    return newmsg;
> +}
> +
>  
>  /* Print an error
>  


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh_error_db --- hash-based database of error messages
  2022-12-16 17:46 ` Daniel Shahaf
@ 2022-12-16 18:05   ` Peter Stephenson
  2022-12-16 20:05     ` Daniel Shahaf
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Stephenson @ 2022-12-16 18:05 UTC (permalink / raw)
  To: zsh-workers

On Fri, 2022-12-16 at 17:46 +0000, Daniel Shahaf wrote:
> Peter Stephenson wrote on Fri, Dec 16, 2022 at 16:42:53 +0000:
> > Following on from the retread of the discussion on error messages,
> > here's a very simply proof of concept for a hash-based database of
> > error messages.  Even if it's adopted I don't intend the C code
> > to get much larger as the point is to allow it to be able to do
> > everything in shell code.
> > 
> 
> So, tl;dr:
> 
> - Every error message would get an E42 identifier in the source string.
> 
> - The "E42" will be looked up as a string key in a well-known assoc,
>   where the value will be a more elaborate error message.
> 
> - The more elaborate message, if there is one, will be used instead of
>   the default message.

Yes, that's it in a nutshell; any further complexity would ideally
be in shell code.

I think at least we need something better than E <num> to try to avoid
duplicates, for example E <file-code> <num>.  It's easy to search
the file for a duplicate, a bit more of a pain to search the entire
codebase.  Maybe a two letter code, so we can have ZW for zle_word.c?

> Code review below.

I think those are all accurate, I'll update for the next version.

pws



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh_error_db --- hash-based database of error messages
  2022-12-16 18:05   ` Peter Stephenson
@ 2022-12-16 20:05     ` Daniel Shahaf
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Shahaf @ 2022-12-16 20:05 UTC (permalink / raw)
  To: zsh-workers

Peter Stephenson wrote on Fri, 16 Dec 2022 18:05 +00:00:
> On Fri, 2022-12-16 at 17:46 +0000, Daniel Shahaf wrote:
>> Peter Stephenson wrote on Fri, Dec 16, 2022 at 16:42:53 +0000:
>> > Following on from the retread of the discussion on error messages,
>> > here's a very simply proof of concept for a hash-based database of
>> > error messages.  Even if it's adopted I don't intend the C code
>> > to get much larger as the point is to allow it to be able to do
>> > everything in shell code.
>> > 
>> 
>> So, tl;dr:
>> 
>> - Every error message would get an E42 identifier in the source string.
>> 
>> - The "E42" will be looked up as a string key in a well-known assoc,
>>   where the value will be a more elaborate error message.
>> 
>> - The more elaborate message, if there is one, will be used instead of
>>   the default message.
>

FWIW, I'm not sure whether I would prefer for the elaborate message to
be shown /instead of/ the default message or /in addition/ to it.

> Yes, that's it in a nutshell; any further complexity would ideally
> be in shell code.

How some other projects do this:

- Apache httpd: a APLOGNO(00042) macro in the C sources expands to
  "AH00042: ".  See
  https://svn.apache.org/repos/asf/httpd/httpd/trunk/docs/log-message-tags/README
  for details, including why they use a macro (their reasoning seems to
  apply to our case too).  They have a script for use before commit that
  assigns new error message numbers to newly-added APLOGNO() calls.

- Vim: E<number>.  I don't know what they do when they add new errors,
  but every E42 code has an entry in the manual's tags file (≈ an index
  entry), so something like `< runtime/doc/tags grep -E '^E[0-9]' |
  sort -n | tail -n1` would programmatically determine the highest
  unused E-number.

- Subversion: E<number> where the number is taken from a particular .h
  file in the project's sources.  The numbers are assigned non-consecutively
  and grouped by module (e.g., all CLI parse error codes are numerically
  next to each other).

> I think at least we need something better than E <num> to try to avoid
> duplicates, for example E <file-code> <num>.  It's easy to search
> the file for a duplicate, a bit more of a pain to search the entire
> codebase.  Maybe a two letter code, so we can have ZW for zle_word.c?

I don't think grepping all of Src/ is that much of a problem.  I do that
regularly.  It's also what httpd's aforementioned script does, since httpd
doesn't have a single file registry of all error numbers it has assigned.

So, perhaps E<num> would suffice for us, too.  Where it might fall short
is with third-party modules.  Therefore, I guess should invent some
scheme for third-party modules to use, say, «X-<identifier>:E<num>»
where <identifier> matches /[A-Za-z_][A-Za-z0- 9_]+/.

Conversely, if we want per-file error numbers:

- "A single .c source file" seems like the wrong level of abstraction to
  expose in a user-facing error message.

- Two-letter codes would effect a risk of collisions between different
  files.  I suppose we could use ${__file__#**/Src/} instead of the
  acronym of __file__'s basename, but see the previous bullet.  Or we
  could have a single file registry of assigned two-letter codes, but then
  why don't we amend that and have a single file registry of assigned
  global numeric codes?

- How'd we implement a script that automatically determines the highest
  error number already in use in a particular file?  (If we adopt the APLOGNO()
  idea this will be easy, but I'm not assuming either way.)

Cheers,

Daniel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh_error_db --- hash-based database of error messages
  2022-12-16 16:42 zsh_error_db --- hash-based database of error messages Peter Stephenson
  2022-12-16 17:46 ` Daniel Shahaf
@ 2022-12-17 21:33 ` Oliver Kiddle
  2022-12-19  9:19   ` Peter Stephenson
  1 sibling, 1 reply; 6+ messages in thread
From: Oliver Kiddle @ 2022-12-17 21:33 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh workers

Peter Stephenson wrote:
> Following on from the retread of the discussion on error messages,
> here's a very simply proof of concept for a hash-based database of
> error messages.  Even if it's adopted I don't intend the C code
> to get much larger as the point is to allow it to be able to do
> everything in shell code.
>
> Apart from more user-friendly messages which someone with no
> access to the C code can dedicate their life to tweaking,
> this is obviously a boon for internationalization (which

If we want to address internationalisation, we might come to regret not
just using the normal POSIX catgets(3) interface and catalogue files.
What they do looks simple on the surface but there's probably hidden
areas of complexity that we'd end up duplicating. I can see the value of
allowing things to be done in shell code but the two are not mutually
exclusive.

While error messages encompass the majority of strings that zsh ever
needs to print there are some strings that aren't error messages.

> Apart from updating the error message to examine the hash, the
> only change is to prefix the error message format string with
> the code such as "E3" and a colon.  This would obviously
> be done universally.

E and a number sounds fine. I'm not keen on trying to encode file names
such as ZW for zle_word.c because we may want to move code from one file
to another in the future. If there's a need to divide things up, perhaps
do something like 1000s for history, 2000s for zle, etc like we have for
test cases. Parsing the E-number out of the string may be less flexible
than adding an extra parameter and/or having a macro to wrap it. If
duplicates are a concern scripts can be written to do checking or maybe
some clever use of X macros.

The best example for easy error messages I've come across is the Rust
compiler. This quotes your code with highlighting and colours and
finishes with a message like:
  For more information about this error, try `rustc --explain E0463`.
So there are two levels of detail about an error available.

If the only way to get friendly error explanations is to autoload
functions and enable plugins, the type of users that will benefit from
them will likely not find out about them until they're knowledgeable
enough not to want them. Perhaps prompting with the explain invocation
needs to be enabled by default with a setopt to turn that off. For
people that do turn it off, a special parameter could store the number
of the last error that was printed so that explanations can still be
accessed.

> The signatures of the format strings, i.e. where the % escapes
> live, is checked for, though if there are cases where
> we have more than a single letter code it'll fall over.
> There's no provision for reordering escapes, either, but I
> wouldn't expect to do that in an initial implementation anyway.

Reordering is quite common for internationalisation. Our printf builtin
does handle that with things like %2$s so we already have an
implementation of parsing printf formats but directed at a rather
aifferent purpose.

Oliver

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: zsh_error_db --- hash-based database of error messages
  2022-12-17 21:33 ` Oliver Kiddle
@ 2022-12-19  9:19   ` Peter Stephenson
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2022-12-19  9:19 UTC (permalink / raw)
  To: Oliver Kiddle; +Cc: zsh workers

> If we want to address internationalisation, we might come to regret not
> just using the normal POSIX catgets(3) interface and catalogue files.
> What they do looks simple on the surface but there's probably hidden
> areas of complexity that we'd end up duplicating. I can see the value of
> allowing things to be done in shell code but the two are not mutually
> exclusive.

Actually, internationalization was only one possible use.  The idea
was to allow for more descriptive error messages for those who wanted
them.  But if that's not going anywhere this can be ditched.  It was
only a proposal to allow the minimum of work within the shell internals
that seemed it might get something off the ground.

> E and a number sounds fine.

It sounds like that, and a script to keep this in order, is the way
this would be heading, if it seems worth pursuing.  The script
should be able to list out the default messages, checked for
duplicates, and check for the next unassigned number.  If
the number appears in a macro, as noted by Daniel, that makes
it easy.  That can use ## to join strings so still doesn't
need a change to the message API.

pws

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-12-19  9:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-16 16:42 zsh_error_db --- hash-based database of error messages Peter Stephenson
2022-12-16 17:46 ` Daniel Shahaf
2022-12-16 18:05   ` Peter Stephenson
2022-12-16 20:05     ` Daniel Shahaf
2022-12-17 21:33 ` Oliver Kiddle
2022-12-19  9:19   ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).