| Summary: | Unicode support logic badly broken | ||
|---|---|---|---|
| Product: | Busybox | Reporter: | bugdal |
| Component: | Other | Assignee: | unassigned |
| Status: | NEW --- | ||
| Severity: | normal | CC: | busybox-cvs, mad.deer |
| Priority: | P5 | ||
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Host: | Target: | ||
| Build: | |||
|
Description
bugdal
2013-06-29 19:38:13 UTC
(In reply to comment #0) > The unicode detection logic in Busybox is presently broken; rather than letting > setlocale() determine the correct locale via the various LC_* environment > variables and the implementation-defined default if none of them are set, it > looks for the LANG variable itself and uses the C locale if LANG is not set. There is no plan to have full locale support in busybox. For example, messages will not be translated, and only ASCII or Unicode encodings will be supported. Please describe in what environment you observe incorrect busybox's behavior, what that behavior is, and what behavior do you consider to be correct in this case. I realize I should have been more clear. I am not talking about message translation or any other i18n features, purely processing character data as UTF-8 rather than a legacy 8bit codepage. Assuming busybox was compiled to use the system locale for Unicode, and is built against glibc or uClibc, the following commands should reproduce the bug: unset LANG LC_ALL LC_CTYPE LC_CTYPE=en_US.UTF-8 busybox ash The expected behavior is that UTF-8 line editing works, since the LC_CTYPE category is set to a UTF-8 locale. The bug is that it doesn't work because busybox is only inspecting the LANG variable rather than letting the system locale selection logic do its thing. I posted more about the issue on the mailing list, in this thread: http://lists.busybox.net/pipermail/busybox/2013-June/079463.html I've hit a similar problem when trying to use a busybox shell script as /init in an initrd (initramfs). What's frustrating is that the environment is set by the bootloader (both isolinux and legacy grub) according to the boot parameters given by the user. That would be okay except for the fact that in this (not uncommon) situation the LANG variable appears to be somewhat immutable. I just need the length of unicode strings. The following code to count the number of characters in a string works consistently in my development environment regardless of what LANG is set to in the calling environment but fails when run in the /init script in an initrd: echo -n "$1" | LANG=en_US.UTF-8 sed 's/./x/g' | wc -c Likewise exporting LANG does not help: export LANG=en_US.UTF-8 Someone else ran into this problem and submitted a patch to force the LANG variable to always be en_US.UTF-8 in order to get consistent unicode support: http://lists.busybox.net/pipermail/busybox/2014-June/081021.html > Exporting LANG in rcS didnt have an effect. If I could consistently control the LANG variable, I would be happy. I think this was the problem the rejected patch was addressing. The submitter had tried using export and the results were not consistent. If "export LANG=en_US.UTF-8" worked as expected then there would be no need for the patch. I'd also be happy if I could somehow unconditionally enable unicode support perhaps via a config option or a command line parameter. The only way I've found to get consistent unicode support is to use a ulang=xx boot parameter for the language to be used and always add a lang=en_US.UTF-8 boot parameter. This workaround is fragile, confusing, and cumbersome. It is possible the immutability is caused by the bootloaders but I ran into the same problem with two different bootloaders. It would be great if busybox could work around it somehow. |