public inbox for developer@lists.illumos.org (since 2011-08)
 help / color / mirror / Atom feed
* [REVIEW] 16127 regex misidentifies mixed sets as singletons
@ 2023-12-19 21:19 Bill Sommerfeld
  0 siblings, 0 replies; only message in thread
From: Bill Sommerfeld @ 2023-12-19 21:19 UTC (permalink / raw)
  To: developer

Issue: https://www.illumos.org/issues/16127
CR: https://code.illumos.org/c/illumos-gate/+/3193
Diff: 
https://code.illumos.org/~diff/fb3b42210e4a2961ef48c4e3513986be3e03a552

While working on other fixes to the regex code I spotted a bug in the
"singleton" function used by regcomp() to turn character set matches
into exact character matches if a character set has exactly one
element.

The underlying cset representation is complex; most critically for
review, it records "small" characters (codepoint less than either 128
or 256 depending on locale) in a bit vector, and "wide" characters in
a secondary array.

Unfortunately the "singleton" function uses to identify singleton sets
treated a cset as a singleton if either the "small" or the "wide" sets
had exactly one element (it would then ignore the other set).

The easiest way to demonstrate this bug:

	$ export LANG=C.UTF-8
	$ echo 'a' | grep '[abà]'

It should match (and print "a") but instead it doesn't match because the
single accented character in the set is misinterpreted as a singleton.

Fixing this requires reworking "singleton" and adding a few test cases.

Thanks in advance for your review.

					- Bill




^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2023-12-19 21:19 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-19 21:19 [REVIEW] 16127 regex misidentifies mixed sets as singletons Bill Sommerfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).