Broken assumptions in gen_wctype

Rich Felker dalias at libc.org
Fri Mar 6 19:39:41 UTC 2015


I just tried building uclibc (via buildroot) on my Alpine Linux
system, which is based on musl libc. I encountered some portability
issues in the locale generation code which I'll report separately
later, but the big issue I found here was a silent failure to generate
wctables.h. Looking into gen_wctype.c, I found it's failing at this
test:

	if ((l != (short)l) || (u != (short)u)) {
		verbose_msg("range assumption error!  %x  %ld  %ld\n", c, l, u);
		return EXIT_FAILURE;
	}

Apparently uclibc's locale system encodes an assumption that
towlower/towupper map a character to an offset that fits in a signed
16-bit offset. This assumption is false on the current versions of
Unicode (and even fairly old ones); at least the following characters
fail it:

char   up   low  du dl
0265 ɥ 0265 a78d 0 42280
0266 ɦ 0266 a7aa 0 42308
1d79 ᵹ 1d79 a77d 0 35332

Presumably the only reason uclibc's gen_wctype works right now on
glibc-based hosts is that glibc's Unicode alignment is severely
outdated. This is about to change; see
https://sourceware.org/ml/libc-alpha/2014-11/msg00664.html and the
thread that extends into the following months. So without a fix,
uclibc will probably break soon "in the wild".

A suitable replacement assumption might be that towupper/towlower stay
in the same "plane", so that instead of a signed 16-bit offset, uclibc
could use an unsigned 16-bit replacement of the low 16 bits. I have no
idea how practical this might be to implement but if it works it would
at least avoid increasing the size of the tables.

Rich


More information about the uClibc mailing list