{"id":281,"date":"2009-06-12T23:31:05","date_gmt":"2009-06-12T18:01:05","guid":{"rendered":"http:\/\/techtwaddle.net\/?p=281"},"modified":"2011-04-12T23:31:22","modified_gmt":"2011-04-12T18:01:22","slug":"problems-with-unicode-files-and-chinese-characters","status":"publish","type":"post","link":"https:\/\/techtwaddle.co.in\/blog\/2009\/06\/12\/problems-with-unicode-files-and-chinese-characters\/","title":{"rendered":"Problems with UNICODE files and chinese characters"},"content":{"rendered":"<div style=\"text-align: justify;\"><span style=\"font-family: Comic Sans MS;\">I was working on a XML parser that we had written some time back. We used Microsoft&#8217;s SAX (Simple API&#8217;s for XML) for parsing the xml. Here is a very useful and elaborate <\/span><a href=\"http:\/\/www.codeguru.com\/cpp\/data\/mfc_database\/xml\/article.php\/c11517__1\/\" style=\"font-family: Comic Sans MS;\">SAX tutorial<\/a><span style=\"font-family: Comic Sans MS;\">. All was working fine until a few XML files with Chinese characters showed up. Well, basically the program revolved around:<\/span><br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<span style=\"font-family: Comic Sans MS;\">&#8211;&gt; Parse the input xml<\/span><br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<span style=\"font-family: Comic Sans MS;\">&#8211;&gt; do something with the parsed data<\/span><br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<span style=\"font-family: Comic Sans MS;\">&#8211;&gt; and create an output xml<\/span><br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<span style=\"font-family: Comic Sans MS;\">The ouput XML, of course, depended on the data we parsed in step 2. The problem was that when the input XML contained chinese characters, our output XML would contain boxes! And this immediately reminded me of this <\/span><a href=\"http:\/\/www.joelonsoftware.com\/articles\/Unicode.html\" style=\"font-family: Comic Sans MS;\">post by Joel Spolsky<\/a><span style=\"font-family: Comic Sans MS;\">. I checked the code and found what was wrong. I was reading the data into WCHAR from the input XML and while writing the data I converted it to a multi-byte string using <\/span><a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/5d7tc9zw(VS.80).aspx\" style=\"font-family: Comic Sans MS;\">wcstombs<\/a><span style=\"font-family: Comic Sans MS;\">. Which obviously was incorrect. When the API tried to convert the chinese characters into multi-byte it went nuts! So I went ahead and changed the code, the changed code looked something like this:<\/span><br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">WCHAR buffer[MAX_BUFF_SIZE] = L&quot;&quot;;<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">WCHAR temp;<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">DWORD bytesWritten = -1, bytesRead = -1;<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">int counter = 0;<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">do<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">{<\/span><br \/>\n<span style=\"font-family: Verdana; color: rgb(0, 0, 128);\">&nbsp;&nbsp;&nbsp; \/\/Read from the file and store in buffer<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; if(ReadFile(hInputFile, temp, sizeof(temp), &amp;bytesRead, NULL))<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; {<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; buffer[counter++] = temp;<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(temp == L&#8217;\\n&#8217;)<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; {<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; \/\/got a line, do something with it<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; ..<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; ..<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; WriteFile(hOutputFile, buffer, counter, &amp;bytesWritten, NULL);<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<br \/>\n<span style=\"font-family: Verdana; color: rgb(0, 0, 128);\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \/\/reset the counter for the next line<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; counter = 0;<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">&nbsp;&nbsp;&nbsp; }<\/span><br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<br style=\"color: rgb(0, 0, 128); font-family: Verdana;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">}while (bytesRead &gt; 0);<\/span><br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<span style=\"font-family: Comic Sans MS;\">After this I saw that a lot of things were missing in the output file! You can see the problem at first glance can&#8217;t you? The variable <span style=\"font-family: Verdana; color: rgb(0, 0, 128);\">counter <\/span>keeps track of the number of <span style=\"font-family: Verdana; color: rgb(0, 0, 128);\">WCHAR <\/span>characters, each of which is 2 bytes wide. In the call&nbsp; to <span style=\"font-family: Verdana; color: rgb(0, 0, 128);\">WriteFile()<\/span>, the <span style=\"font-family: Verdana; color: rgb(0, 0, 128);\">counter <\/span>parameter specifies the number of bytes to write, so only half the data was getting written. The write file call should really have been:<\/span><br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<span style=\"color: rgb(0, 0, 128); font-family: Verdana;\">WriteFile(hOutputFile, buffer, counter*sizeof(buffer[0]), &amp;bytesWritten, NULL);<\/span><br style=\"font-family: Verdana;\" \/><br \/>\n<br style=\"font-family: Comic Sans MS;\" \/><br \/>\n<span style=\"font-family: Comic Sans MS;\">That fixed it. But there was still a problem, I was still getting boxes. What else could be wrong now! So I binged around a little and found out that the first two bytes in a unicode file must always be <span style=\"font-family: Verdana;\">0xFF 0xFE<\/span>. Joel mentions this in his <a href=\"http:\/\/www.joelonsoftware.com\/articles\/Unicode.html\">post<\/a> on encoding schemes. The problem is that without the 0xFF 0xFE the text editor thought that it was a normal text file encoded in ANSI or something. Basically making the editor to process every byte of the data as an ANSI character and that turned all zeroes into boxes. 0xFF 0xFE told the editors to process them as UNICODE encoded files, thus making them interpret every two bytes of the file. I wrote the two bytes into the output file before writing anything else into it and it worked. The chinese characters showed correctly in the output file.<\/span><\/div>\n","protected":false},"excerpt":{"rendered":"<p>I was working on a XML parser that we had written some time back. We used Microsoft&#8217;s SAX (Simple API&#8217;s for XML) for parsing the xml. Here is a very useful and elaborate SAX tutorial. All was working fine until a few XML files with Chinese characters showed up. Well, basically the program revolved around: &hellip; <a href=\"https:\/\/techtwaddle.co.in\/blog\/2009\/06\/12\/problems-with-unicode-files-and-chinese-characters\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Problems with UNICODE files and chinese characters<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[1],"tags":[],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p1ktFF-4x","_links":{"self":[{"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/posts\/281"}],"collection":[{"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/comments?post=281"}],"version-history":[{"count":1,"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/posts\/281\/revisions"}],"predecessor-version":[{"id":282,"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/posts\/281\/revisions\/282"}],"wp:attachment":[{"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/media?parent=281"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/categories?post=281"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtwaddle.co.in\/blog\/wp-json\/wp\/v2\/tags?post=281"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}