How do c++ and g++ deal with unicode? -
i'm trying figure out proper way deal unicode in c++. want understand how g++ handles literal wide character strings, , regular c strings containing unicode characters. have set basic tests , don't understand happening.
wstring ws1(l"«¬.txt"); // these first 2 characters correspond 0xab, 0xac string s1("«¬.txt"); ifstream in_file( s1.c_str() ); // wifstream in_file( s1.c_str() ); // throws exception when // call in_file >> s; string s; in_file >> s; // s contains «¬ wstring ws = texttowide(s); wcout << ws << endl; // these 2 lines work independently of each other, // combining them makes second 1 print incorrectly cout << s << endl; printf( "%s", s.c_str() ); // same case here, these work independently of 1 another, // calling 1 after other makes second call // print incorrectly wprintf( l"%s", ws.c_str() ); wstring texttowide(string s) { mbstate_t mbstate; char *cc = new char[s.length() + 1]; strcpy(cc, s.c_str()); cc[s.length()] = 0; size_t numbytes = mbsrtowcs(0, (const char **)&cc, 0, &mbstate); wchar_t *buff = new wchar_t[numbytes + 1]; mbsrtowcs(buff, (const char **)&cc, numbytes + 1, &mbstate); wstring ws = buff; delete [] cc; delete [] buff; return ws; }
it seems calls wcout , wprintf corrupt stream somehow, , safe call cout , printf long strings encoded utf-8.
would best way handle unicode convert input wide before processing, , convert output utf-8 before sending outupt?
the comprehensive way handle unicode use unicode library such icu. unicode has many more aspects bunch of encodings. c++ not offer apis work of these aspects. icu does.
if want handle encodings, working way use built-in c++ methods correctly. includes calling
std::setlocale(lc_all, /*some system-specific locale name, */ "en_us.utf-8")
in beginning of program. also, not using cout
/printf
, wcout
/wprintf
in same program. (you can use regular , wide stream objects other standard handles in same program).
converting input wide , converting output utf-8 reasonable strategy. working utf-8 reasonable too. lot depends on application. c++11 has built-in utf8, utf16 , utf32 string types simplify task somewhat.
whatever do, don't use elements of extended character set in string literals. (in c++11 it's ok use them in utf8/16/32 string literals).
Comments
Post a Comment