Problems with UNICODE files and chinese characters

I was working on a XML parser that we had written some time back. We used Microsoft’s SAX (Simple API’s for XML) for parsing the xml. Here is a very useful and elaborate SAX tutorial. All was working fine until a few XML files with Chinese characters showed up. Well, basically the program revolved around:

–> Parse the input xml

–> do something with the parsed data

–> and create an output xml

The ouput XML, of course, depended on the data we parsed in step 2. The problem was that when the input XML contained chinese characters, our output XML would contain boxes! And this immediately reminded me of this post by Joel Spolsky. I checked the code and found what was wrong. I was reading the data into WCHAR from the input XML and while writing the data I converted it to a multi-byte string using wcstombs. Which obviously was incorrect. When the API tried to convert the chinese characters into multi-byte it went nuts! So I went ahead and changed the code, the changed code looked something like this:

WCHAR buffer[MAX_BUFF_SIZE] = L"";

WCHAR temp;

DWORD bytesWritten = -1, bytesRead = -1;

int counter = 0;

do

{
    //Read from the file and store in buffer

    if(ReadFile(hInputFile, temp, sizeof(temp), &bytesRead, NULL))

    {

        buffer[counter++] = temp;

        if(temp == L’\n’)

        {

            //got a line, do something with it

            ..

            ..

            WriteFile(hOutputFile, buffer, counter, &bytesWritten, NULL);

            //reset the counter for the next line

            counter = 0;

        }

    }

}while (bytesRead > 0);

After this I saw that a lot of things were missing in the output file! You can see the problem at first glance can’t you? The variable counter keeps track of the number of WCHAR characters, each of which is 2 bytes wide. In the call to WriteFile(), the counter parameter specifies the number of bytes to write, so only half the data was getting written. The write file call should really have been:

WriteFile(hOutputFile, buffer, counter*sizeof(buffer[0]), &bytesWritten, NULL);

That fixed it. But there was still a problem, I was still getting boxes. What else could be wrong now! So I binged around a little and found out that the first two bytes in a unicode file must always be 0xFF 0xFE. Joel mentions this in his post on encoding schemes. The problem is that without the 0xFF 0xFE the text editor thought that it was a normal text file encoded in ANSI or something. Basically making the editor to process every byte of the data as an ANSI character and that turned all zeroes into boxes. 0xFF 0xFE told the editors to process them as UNICODE encoded files, thus making them interpret every two bytes of the file. I wrote the two bytes into the output file before writing anything else into it and it worked. The chinese characters showed correctly in the output file.

Leave a Reply Cancel reply