Bug 3697 - [Patch] 1.19.0 (git) Allow unicode output for "busybox ls"
Summary: [Patch] 1.19.0 (git) Allow unicode output for "busybox ls"
Status: RESOLVED INVALID
Alias: None
Product: Busybox
Classification: Unclassified
Component: Other (show other bugs)
Version: 1.18.x
Hardware: Other Linux
: P5 major
Target Milestone: ---
Assignee: unassigned
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2011-05-09 18:31 UTC by Tanguy Pruvot
Modified: 2011-05-14 13:28 UTC (History)
2 users (show)

See Also:
Host: x86_64
Target: arm
Build: arm-2010q1-188-arm-gnueabi and google bionic


Attachments
Patch against lastest master GIT (779 bytes, patch)
2011-05-09 18:31 UTC, Tanguy Pruvot
Details
config file (26.90 KB, text/plain)
2011-05-09 18:35 UTC, Tanguy Pruvot
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tanguy Pruvot 2011-05-09 18:31:07 UTC
Created attachment 3289 [details]
Patch against lastest master GIT

Unicode internal char tests are not working.

But there is no need to use these checks to filter ls output... because invalid chars cannot exists in filenames.

So, i made a little hack when the CONFIG_UNICODE_PRESERVE_BROKEN is selected

https://github.com/tpruvot/android_external_busybox/commit/27a33d514c0fc12b06b99f31871330e67c27e9d6

https://github.com/tpruvot/android_external_busybox/commits/1_19

My fork made for Bionic (and android build system) is linked to your git (i can merge easily your lastest changes).
Comment 1 Tanguy Pruvot 2011-05-09 18:35:06 UTC
Created attachment 3295 [details]
config file
Comment 2 Tanguy Pruvot 2011-05-09 18:35:57 UTC
Comment on attachment 3289 [details]
Patch against lastest master GIT

diff from commit 
12bc152b31420c3e3d441c87a995fe7b65dd23fe
Comment 3 Denys Vlasenko 2011-05-12 00:16:46 UTC
(In reply to comment #0)
> Created attachment 3289 [details]
> Patch against lastest master GIT
> 
> Unicode internal char tests are not working.

Can you be more specific? Your .config? What filenames are not properly shown? In what way they are not properly shown?

> But there is no need to use these checks to filter ls output... because invalid chars cannot exists in filenames.

Try this:

touch `printf '\xff'`

Bingo: file with invalid Unicode string as its name is created.
Comment 4 Tanguy Pruvot 2011-05-12 00:23:35 UTC
config is there in attachments

1.19 doesnt show unicode filenames Like Russian Or Chinese (yours, my bionic fork) static and dynamic builds...

in your code (and mine) :

https://github.com/tpruvot/android_external_busybox/blob/gingerbread/libbb/unicode.c#L42

the unicode test for width always results 3... I tried, i think all kind of unicode config with LOCALE ENV etc... and we got only "?" or "??" per char in ls output...
Comment 5 Denys Vlasenko 2011-05-12 00:58:37 UTC
(In reply to comment #4)
> config is there in attachments

I can reproduce it with your config, when I use libc with limited locale support. On "big" config with glibc, I can't.
 
> 1.19 doesnt show unicode filenames Like Russian Or Chinese (yours, my bionic
> fork) static and dynamic builds...
>
> in your code (and mine) :
> 
> https://github.com/tpruvot/android_external_busybox/blob/gingerbread/libbb/unicode.c#L42

Yes. This is the test that libc locale subsystem understood byte sequence 0xce, 0x94 as one Unicode character. If it didn't (width != 1), then we conclude that current locale (as set by $LANG) is not Unicode, and turn Unicode support off:

unicode_status = (width == 1 ? UNICODE_ON : UNICODE_OFF);

and from that point on, any chars with high byte set will be treated as invalid.

Either set $LANG properly (say, to "en_US.utf8") and/or make sure your libc does support Unicode, or unset CONFIG_UNICODE_USING_LOCALE in the .config - I unset it and now it works for me:

$ ./busybox ls /.2/video_rus/*_11.avi
/.2/video_rus/Штрафбат_11.avi
Comment 6 Tanguy Pruvot 2011-05-12 01:06:09 UTC
The problem is :

You dont need to filter/convert ls output if terminal is in utf8
Comment 7 Denys Vlasenko 2011-05-12 01:12:16 UTC
(In reply to comment #6)
> The problem is :
> 
> You dont need to filter/convert ls output if terminal is in utf8

How does program know that terminal is in utf8?

If you selected CONFIG_UNICODE_USING_LOCALE=y, you said "use setlocale(LC_ALL, getenv("LANG"))". For your system it doesn't work.

If your system assumes that everything is working in Unicode, then set:

# CONFIG_UNICODE_USING_LOCALE is not set
# CONFIG_FEATURE_CHECK_UNICODE_IN_ENV is not set

and busybox will also think that everything is working in Unicode, it will not look into $LANG.
Comment 8 Tanguy Pruvot 2011-05-12 02:32:21 UTC
CONFIG_UNICODE_PRESERVE_BROKEN is made to skip filter, so why not doing the
same if unicode is auto-disabled ?

The code will be reduced
Comment 9 Denys Vlasenko 2011-05-12 10:02:10 UTC
(In reply to comment #8)
> CONFIG_UNICODE_PRESERVE_BROKEN is made to skip filter, so why not doing the
> same if unicode is auto-disabled?

UNICODE_PRESERVE_BROKEN enables invalid Unicode sequences *on input*, not on output. Read its help text:

config UNICODE_PRESERVE_BROKEN
        bool "Make it possible to enter sequences of chars which are not Unicode"
        default n
        depends on UNICODE_SUPPORT
        help
          With this option on, invalid UTF-8 bytes are not substituted
          with the selected substitution character.
          For example, this means that entering 'l', 's', ' ', 0xff, [Enter]
          at shell prompt will list file named 0xff (single char name
          with char value 255), not file named '?'.

When we are in UNICODE_OFF mode, it means that we think that *output* devices don't support Unicode. And since bbox doesn't support anything else than ASCII and Unicode, and since bbox ls wants to produce non-garbled output, in UNICODE_OFF mode it replaces any non-ASCII-printable bytes in filenames with '?'.
Comment 10 Denys Vlasenko 2011-05-14 13:28:04 UTC
Closing, since this is not a bug. If you disagree, please reopen and explain what is a buggy behavior.