2023-10-07

Accessing individual characters (wchar_t) in a wstring

I am reading in text from a file that contains unicode characters and I store the text into a wstring. I am interested in iterating over the wstring to determine which characters need more than one byte for storage.

My issue is that str.length() (where str is a wstring) seems to be indicating the number of bytes in the string instead of the number of characters. Also, as I iterate over the string using str[i], the bracket operator seems to return only 1 byte.

Here is some example code to replicate my issue:

wifstream inFile;
inFile.open(L"myFile.txt");
    
wstring str;
getline(inFile, str);

wcout << str.length() << endl;
for (unsigned int i = 0; i < str.length(); i++) {
  wcout << str[i] << L" (" << (unsigned int)str[i] << L')' << endl;
}

wofstream outFile;  outFile.open(L"outFile.txt");
outFile << str << endl;

outFile.close();
inFile.close();

Output of code:

5
H (72)
├ (195)
í (161)
l (108)
o (111)

I tried with a file that contains the string "Hálo". str.length() reports 5, which appears to be minimum number of bytes needed to store the string (assuming you use one byte for all characters except for the á). This confuses me because sizeof(wchar_t) is 2 within my environment. I figure an array of 4 characters within the wstring would require 8 bytes minimum. Yet, it seems "Hálo" is being stored as 01001000 {11000011 10100001} 01101100 01101100 (curly brackets to indicate the unicode character). So as I iterate over this, I get everything returned as if they were just char and that unicode character á comes back as 2 characters ├í.

Strangely enough, when I write the wstring to a file (seen in the code above), the text comes out as expected with the unicode character properly interpreted.

Is there a way to iterate over the actual characters within the wstring instead of just the bytes? Also, why is the wstring storing it in just 5 bytes instead of 8? I suppose it saves space but it makes accessing the elements seem unintuitive.

EDIT: I understand that my terminal may not be able to display a wchar_t properly, though I would still hope to print the integer value of it.



No comments:

Post a Comment