Created attachment 3289 [details] Patch against lastest master GIT Unicode internal char tests are not working. But there is no need to use these checks to filter ls output... because invalid chars cannot exists in filenames. So, i made a little hack when the CONFIG_UNICODE_PRESERVE_BROKEN is selected https://github.com/tpruvot/android_external_busybox/commit/27a33d514c0fc12b06b99f31871330e67c27e9d6 https://github.com/tpruvot/android_external_busybox/commits/1_19 My fork made for Bionic (and android build system) is linked to your git (i can merge easily your lastest changes).
Created attachment 3295 [details] config file
Comment on attachment 3289 [details] Patch against lastest master GIT diff from commit 12bc152b31420c3e3d441c87a995fe7b65dd23fe
(In reply to comment #0) > Created attachment 3289 [details] > Patch against lastest master GIT > > Unicode internal char tests are not working. Can you be more specific? Your .config? What filenames are not properly shown? In what way they are not properly shown? > But there is no need to use these checks to filter ls output... because invalid chars cannot exists in filenames. Try this: touch `printf '\xff'` Bingo: file with invalid Unicode string as its name is created.
config is there in attachments 1.19 doesnt show unicode filenames Like Russian Or Chinese (yours, my bionic fork) static and dynamic builds... in your code (and mine) : https://github.com/tpruvot/android_external_busybox/blob/gingerbread/libbb/unicode.c#L42 the unicode test for width always results 3... I tried, i think all kind of unicode config with LOCALE ENV etc... and we got only "?" or "??" per char in ls output...
(In reply to comment #4) > config is there in attachments I can reproduce it with your config, when I use libc with limited locale support. On "big" config with glibc, I can't. > 1.19 doesnt show unicode filenames Like Russian Or Chinese (yours, my bionic > fork) static and dynamic builds... > > in your code (and mine) : > > https://github.com/tpruvot/android_external_busybox/blob/gingerbread/libbb/unicode.c#L42 Yes. This is the test that libc locale subsystem understood byte sequence 0xce, 0x94 as one Unicode character. If it didn't (width != 1), then we conclude that current locale (as set by $LANG) is not Unicode, and turn Unicode support off: unicode_status = (width == 1 ? UNICODE_ON : UNICODE_OFF); and from that point on, any chars with high byte set will be treated as invalid. Either set $LANG properly (say, to "en_US.utf8") and/or make sure your libc does support Unicode, or unset CONFIG_UNICODE_USING_LOCALE in the .config - I unset it and now it works for me: $ ./busybox ls /.2/video_rus/*_11.avi /.2/video_rus/Штрафбат_11.avi
The problem is : You dont need to filter/convert ls output if terminal is in utf8
(In reply to comment #6) > The problem is : > > You dont need to filter/convert ls output if terminal is in utf8 How does program know that terminal is in utf8? If you selected CONFIG_UNICODE_USING_LOCALE=y, you said "use setlocale(LC_ALL, getenv("LANG"))". For your system it doesn't work. If your system assumes that everything is working in Unicode, then set: # CONFIG_UNICODE_USING_LOCALE is not set # CONFIG_FEATURE_CHECK_UNICODE_IN_ENV is not set and busybox will also think that everything is working in Unicode, it will not look into $LANG.
CONFIG_UNICODE_PRESERVE_BROKEN is made to skip filter, so why not doing the same if unicode is auto-disabled ? The code will be reduced
(In reply to comment #8) > CONFIG_UNICODE_PRESERVE_BROKEN is made to skip filter, so why not doing the > same if unicode is auto-disabled? UNICODE_PRESERVE_BROKEN enables invalid Unicode sequences *on input*, not on output. Read its help text: config UNICODE_PRESERVE_BROKEN bool "Make it possible to enter sequences of chars which are not Unicode" default n depends on UNICODE_SUPPORT help With this option on, invalid UTF-8 bytes are not substituted with the selected substitution character. For example, this means that entering 'l', 's', ' ', 0xff, [Enter] at shell prompt will list file named 0xff (single char name with char value 255), not file named '?'. When we are in UNICODE_OFF mode, it means that we think that *output* devices don't support Unicode. And since bbox doesn't support anything else than ASCII and Unicode, and since bbox ls wants to produce non-garbled output, in UNICODE_OFF mode it replaces any non-ASCII-printable bytes in filenames with '?'.
Closing, since this is not a bug. If you disagree, please reopen and explain what is a buggy behavior.