Broken assumptions in gen_wctype

Sat Mar 14 22:15:47 UTC 2015

On March 6, 2015 8:39:41 PM GMT+01:00, Rich Felker <dalias at libc.org> wrote:
>I just tried building uclibc (via buildroot) on my Alpine Linux
>system, which is based on musl libc. I encountered some portability
>issues in the locale generation code which I'll report separately
>later, but the big issue I found here was a silent failure to generate
>wctables.h. Looking into gen_wctype.c, I found it's failing at this
>test:
>
>	if ((l != (short)l) || (u != (short)u)) {
>		verbose_msg("range assumption error!  %x  %ld  %ld\n", c, l, u);
>		return EXIT_FAILURE;
>	}
>
>Apparently uclibc's locale system encodes an assumption that
>towlower/towupper map a character to an offset that fits in a signed
>16-bit offset. This assumption is false on the current versions of
>Unicode (and even fairly old ones); at least the following characters
>fail it:
>
>char   up   low  du dl
>0265 ɥ 0265 a78d 0 42280
>0266 ɦ 0266 a7aa 0 42308
>1d79 ᵹ 1d79 a77d 0 35332
>
>Presumably the only reason uclibc's gen_wctype works right now on
>glibc-based hosts is that glibc's Unicode alignment is severely
>outdated. This is about to change; see
>https://sourceware.org/ml/libc-alpha/2014-11/msg00664.html and the
>thread that extends into the following months. So without a fix,
>uclibc will probably break soon "in the wild".
>
>A suitable replacement assumption might be that towupper/towlower stay
>in the same "plane", so that instead of a signed 16-bit offset, uclibc
>could use an unsigned 16-bit replacement of the low 16 bits. I have no
>idea how practical this might be to implement but if it works it would
>at least avoid increasing the size of the tables.

Mhm. Thanks alot for the heads-up!

Cheers,