Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shell does not support UTF-8 #13

Closed
ghost opened this issue Oct 6, 2016 · 7 comments
Closed

Shell does not support UTF-8 #13

ghost opened this issue Oct 6, 2016 · 7 comments

Comments

@ghost
Copy link

ghost commented Oct 6, 2016

I can't input to or output from shell UTF-8 text (Russian).

It's actual for all variants: busybox:uclibc, busybox:glibc, busybox:musl

@tianon
Copy link
Member

tianon commented Oct 6, 2016

Do you have some examples that don't work so we can more easily attempt to test/verify? (perhaps a short shell script which shows the failure, or specific characters you're having trouble inputting?)

I've done some very simple testing with and it seems to be working, although the input looks a bit strange (replaces the UTF-8 character with ? as soon as I paste it in, but it prints back out correctly):

$ docker run -it --rm busybox
/ # echo '?'

/ # echo -e "\xE2\x98\xA0"

@ghost
Copy link
Author

ghost commented Oct 6, 2016

Output works, sorry for that.

But input not. When I type UTF-8 characters it shows '?' marks in Shell as you pointed.

In vi it shows '.' marks (docker run -it --rm --entrypoint vi busybox). But it prints correct vi edited text.

@tianon
Copy link
Member

tianon commented Oct 7, 2016

Interesting -- while looking into this, I see CONFIG_UNICODE_SUPPORT exists, but it's enabled. There's also CONFIG_FEATURE_CHECK_UNICODE_IN_ENV, which checks whether LANG is set to something ending in .utf8, but it's disabled (and when it's disabled, "Unicode support will be always enabled and active.")

@tianon
Copy link
Member

tianon commented Oct 7, 2016

This looks related:

CONFIG_SUBST_WCHAR:                                               

Typical values are 63 for '?' (works with any output device),     
30 for ASCII substitute control code,                             
65533 (0xfffd) for Unicode replacement character.                 

Symbol: SUBST_WCHAR [=63]                                         
Prompt: Character code to substitute unprintable characters with  
  Defined at Config.in:186                                        
  Depends on: UNICODE_SUPPORT                                     
  Location:                                                       
    -> Busybox Settings                                           
      -> General Configuration                                    
        -> Support Unicode (UNICODE_SUPPORT [=y])                 

@tianon
Copy link
Member

tianon commented Oct 7, 2016

And this:

CONFIG_LAST_SUPPORTED_WCHAR:                                                                        

Any character with Unicode value bigger than this is assumed                                        
to be non-printable on output device. Many applets replace                                          
such chars with substitution character.                                                             

The idea is that many valid printable Unicode chars are                                             
nevertheless are not displayed correctly. Think about                                               
combining charachers, double-wide hieroglyphs, obscure                                              
characters in dozens of ancient scripts...                                                          
Many terminals, terminal emulators, xterms etc will fail                                            
to handle them correctly. Choose the smallest value                                                 
which suits your needs.                                                                             

Typical values are:                                                                                 
126 - ASCII only                                                                                    
767 (0x2ff) - there are no combining chars in [0..767] range                                        
              (the range includes Latin 1, Latin Ext. A and B),                                     
              code is ~700 bytes smaller for this case.                                             
4351 (0x10ff) - there are no double-wide chars in [0..4351] range,                                  
              code is ~300 bytes smaller for this case.                                             
12799 (0x31ff) - nearly all non-ideographic characters are                                          
              available in [0..12799] range, including                                              
              East Asian scripts like katakana, hiragana, hangul,                                   
              bopomofo...                                                                           
0 - off, any valid printable Unicode character will be printed.                                     

Symbol: LAST_SUPPORTED_WCHAR [=767]                                                                 
Prompt: Range of supported Unicode characters                                                       
  Defined at Config.in:195                                                                          
  Depends on: UNICODE_SUPPORT                                                                       
  Location:                                                                                         
    -> Busybox Settings                                                                             
      -> General Configuration                                                                      
        -> Support Unicode (UNICODE_SUPPORT [=y])                                                   

@tianon
Copy link
Member

tianon commented Oct 7, 2016

For what it's worth, here are the values Debian uses for its busybox package:

CONFIG_UNICODE_SUPPORT=y
# CONFIG_UNICODE_USING_LOCALE is not set
CONFIG_FEATURE_CHECK_UNICODE_IN_ENV=y
CONFIG_SUBST_WCHAR=63
CONFIG_LAST_SUPPORTED_WCHAR=767
CONFIG_UNICODE_COMBINING_WCHARS=y
CONFIG_UNICODE_WIDE_WCHARS=y

@tianon
Copy link
Member

tianon commented Oct 7, 2016

Contrast that with Alpine's values:

CONFIG_UNICODE_SUPPORT=y
CONFIG_UNICODE_USING_LOCALE=y
# CONFIG_FEATURE_CHECK_UNICODE_IN_ENV is not set
CONFIG_SUBST_WCHAR=63
CONFIG_LAST_SUPPORTED_WCHAR=1114111
CONFIG_UNICODE_COMBINING_WCHARS=y
CONFIG_UNICODE_WIDE_WCHARS=y
# CONFIG_UNICODE_BIDI_SUPPORT is not set
# CONFIG_UNICODE_NEUTRAL_TABLE is not set
CONFIG_UNICODE_PRESERVE_BROKEN=y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant