This caught my attention: static wchar_t charref(char *x, char *y) { wchar_t wc; size_t ret; if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(*x) & 0x80)) return (wchar_t) STOUC(*x); well, this is definitely not valid for arbitrary multibyte character set. I am just curious if it is possible to consistently assume that UTF-8 is in use? That can definitely simplify things.