Bug 6356 - Unicode support logic badly broken
Summary: Unicode support logic badly broken
Status: NEW
Alias: None
Product: Busybox
Classification: Unclassified
Component: Other (show other bugs)
Version: unspecified
Hardware: PC Linux
: P5 normal
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
: 7538 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-06-29 19:38 UTC by bugdal
Modified: 2016-02-18 07:02 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description bugdal 2013-06-29 19:38:13 UTC
The unicode detection logic in Busybox is presently broken; rather than letting setlocale() determine the correct locale via the various LC_* environment variables and the implementation-defined default if none of them are set, it looks for the LANG variable itself and uses the C locale if LANG is not set.

My best guess is that the motivation for this elaborate-but-incorrect approach was to allow changes in the LANG variable in shells to take place immediately, even if they're not added to the environment in the shell process. However, the logic is not correct, so it's hardly usable.

To fix the issue, simply change the setlocale line in libbb/unicode.c from:

setlocale(LC_ALL, (LANG && LANG[0]) ? LANG : "C");

to:

setlocale(LC_ALL, "");

However, this does not address runtime changes to locale in shells. The only way I know to achieve the latter without setenv/putenv in the shell process whenever an environment variable is changed (which would be a bad idea) is to run the external "locale" utility (which could be a busybox internal if someone implements it) in a separate process, with all the locale-related variables exported to it. It can then determine the resulting LC_CTYPE and return it to the parent for the parent to pass to setlocale.

Personally, I think it would be acceptable for the shell to only honor the locale setting when it's started, and not to attempt to handle changes while it's running.
Comment 1 Denys Vlasenko 2013-06-30 19:32:58 UTC
(In reply to comment #0)
> The unicode detection logic in Busybox is presently broken; rather than letting
> setlocale() determine the correct locale via the various LC_* environment
> variables and the implementation-defined default if none of them are set, it
> looks for the LANG variable itself and uses the C locale if LANG is not set.

There is no plan to have full locale support in busybox.
For example, messages will not be translated, and only ASCII or Unicode encodings will be supported.

Please describe in what environment you observe incorrect busybox's behavior, what that behavior is, and what behavior do you consider to be correct in this case.
Comment 2 bugdal 2013-07-01 03:04:10 UTC
I realize I should have been more clear. I am not talking about message translation or any other i18n features, purely processing character data as UTF-8 rather than a legacy 8bit codepage.

Assuming busybox was compiled to use the system locale for Unicode, and is built against glibc or uClibc, the following commands should reproduce the bug:

unset LANG LC_ALL LC_CTYPE
LC_CTYPE=en_US.UTF-8 busybox ash

The expected behavior is that UTF-8 line editing works, since the LC_CTYPE category is set to a UTF-8 locale. The bug is that it doesn't work because busybox is only inspecting the LANG variable rather than letting the system locale selection logic do its thing.

I posted more about the issue on the mailing list, in this thread:
http://lists.busybox.net/pipermail/busybox/2013-June/079463.html
Comment 3 BitJam 2014-08-03 00:23:38 UTC
I've hit a similar problem when trying to use a busybox shell script as /init in an initrd (initramfs).   What's frustrating is that the environment is set by the bootloader (both isolinux and legacy grub) according to the boot parameters given by the user.  That would be okay except for the fact that in this (not uncommon) situation the LANG variable appears to be somewhat immutable.

I just need the length of unicode strings.  The following code to count the number of characters in a string works consistently in my development environment regardless of what LANG is set to in the calling environment but fails when run in the /init script in an initrd:

echo -n "$1" | LANG=en_US.UTF-8 sed 's/./x/g' | wc -c

Likewise exporting LANG does not help:

export LANG=en_US.UTF-8

Someone else ran into this problem  and submitted a patch to force the LANG variable to always be en_US.UTF-8 in order to get consistent unicode support:
http://lists.busybox.net/pipermail/busybox/2014-June/081021.html

> Exporting LANG  in rcS didnt have an effect.

If I could consistently control the LANG variable, I would be happy.  I think this was the problem the rejected patch was addressing.  The submitter had tried using export and the results were not consistent.  If "export LANG=en_US.UTF-8" worked as expected then there would be no need for the patch.  I'd also be happy if I could somehow unconditionally enable unicode support perhaps via a config option or a command line parameter.

The only way I've found to get consistent unicode support is to use a ulang=xx boot parameter for the language to be used and always add a lang=en_US.UTF-8 boot parameter. This workaround is fragile, confusing, and cumbersome.

It is possible the immutability is caused by the bootloaders but I ran into the same problem with two different bootloaders.  It would be great if busybox could work around it somehow.
Comment 4 Mike Frysinger 2016-02-18 07:02:21 UTC
*** Bug 7538 has been marked as a duplicate of this bug. ***