Broken assumptions in gen_wctype

Fri Mar 6 19:39:41 UTC 2015

I just tried building uclibc (via buildroot) on my Alpine Linux
system, which is based on musl libc. I encountered some portability
issues in the locale generation code which I'll report separately
later, but the big issue I found here was a silent failure to generate
wctables.h. Looking into gen_wctype.c, I found it's failing at this
test:

	if ((l != (short)l) || (u != (short)u)) {
		verbose_msg("range assumption error!  %x  %ld  %ld\n", c, l, u);
		return EXIT_FAILURE;
	}

Apparently uclibc's locale system encodes an assumption that
towlower/towupper map a character to an offset that fits in a signed
16-bit offset. This assumption is false on the current versions of
Unicode (and even fairly old ones); at least the following characters
fail it:

char   up   low  du dl
0265 ɥ 0265 a78d 0 42280
0266 ɦ 0266 a7aa 0 42308
1d79 ᵹ 1d79 a77d 0 35332

Presumably the only reason uclibc's gen_wctype works right now on
glibc-based hosts is that glibc's Unicode alignment is severely
outdated. This is about to change; see
https://sourceware.org/ml/libc-alpha/2014-11/msg00664.html and the
thread that extends into the following months. So without a fix,
uclibc will probably break soon "in the wild".

A suitable replacement assumption might be that towupper/towlower stay
in the same "plane", so that instead of a signed 16-bit offset, uclibc
could use an unsigned 16-bit replacement of the low 16 bits. I have no
idea how practical this might be to implement but if it works it would
at least avoid increasing the size of the tables.

Rich