Unicode Made Easy in Visual Studio .NET 2003
A quick Unicode tutorial ~ 29 June 2008 ~ by Ryan Geiss
What is Unicode?
Try loading a filename into your program with funky foreign characters in it -
it will probably break. You might be able to get away with some of the accented
characters in the latin alphabet, and a few other unusual symbols, but what about
a filename completely in Arabic, or Vietnamese, or Kanji? Unicode is the answer -
it's a 16-bit wide character (rather than the traditional 8-bit latin-based 'char')
that can encode any character from just about any language on the planet. And,
believe it or not, it's actually very easy to modify your program to work with it.
I did it recently, for the first time (shame on me for waiting so long), and was
very surprised at the ease with which the change was made.
In Visual Studio .NET 2003, writing an application in C/C++, here are the changes
I had to make in order for Unicode to work, so my application could handle strings
(filenames) in any language. Before beginning, realize that the type 'TCHAR'
maps to two different things, depending on your project settings... if Unicode is
turned on, it maps to WCHAR, which is an unsigned short; and if Unicode is
turned off, it maps to char. In [Dev Studio] Unicode, a null-terminated string
is just terminated by a 16-bit zero instead of an 8-bit zero. Whatsmore, the first
255 values in Unicode perfectly match the 255 bytes of ASCII, so when you check
if a TCHAR == 'a', it's going to work, either way. It's pleasantly surprising
how interchangeable the code is, once you switch to TCHARs.
Here are the steps:
1. In your project settings (ALT+F7), go to Configuration properties,
then General, and on the right, under "Character Set", select "Unicode".
(This will cause the _UNICODE preprocessor symbol to be defined
for your project.)
2. replace all of your 'char' variables with 'TCHAR' (...at least, the ones that
are used for string processing - not the ones where you meant to use BYTE).
TCHAR is a macro that maps to 'char' when _UNICODE is not #defined,
or wchar_t when _UNICODE is #defined.
3. put an L prefix, or _T() around, all string literals; i.e. change all string literals from
"my string" into L"my string" or _T("my_string"). This is the most time-consuming part.
4. paste this code into your 'global' header file, that all of your code can see:
#include <tchar.h>
#define my_strcmpi(x,y) _wcsicmp(x, y)
#define my_strcmp(x,y) wcscmp(x, y)
#define my_strrchr(x,y) wcsrchr(x, y)
#define my_strcat(x,y) wcscat(x,y)
#define my_strcpy(x,y) wcscpy(x,y)
// for these, be aware that 'n' is the # of CHARACTERS, not bytes!!
#define my_strlen_chars(x) wcslen(x)
#define my_strnicmp(x,y,n) _wcsnicmp(x, y, n)
#define my_strncat(x,y,n) wcsncat(x, y, n)
#define my_strncpy(x,y,n) wcsncpy(x, y, n)
#define my_sprintf swprintf
#define my_sscanf swscanf
#define my_fopen _wfopen
#define my_fclose fclose
5. go through your code and prefix all of your calls to string-manipulation
function (in the standard C library) with "my_" to work with the
remapping in the previous step. For example, strrchr() becomes my_strrchr(),
strcmpi() becomes my_strcmpi(), and so on.
6. All of your string manipulation code that used to work on char* 's will
still work just fine - because, when Unicode is enabled, a TCHAR maps to
WCHAR (when off, it maps to char) which is basically an unsigned short.
If you advance a TCHAR* by one (p++), it advances one character -
woohoo. If you check a TCHAR* to see if it's end-of-string (*p==0),
it works - woohoo again. However, there is one place you can
trip up... strlen.
7. Check everywhere you've used strlen(), and see if you were treating the return
value as a number of BYTES, or a number of CHARACTERS. I recommend rewriting
your code to treat it as the number of CHARACTERS. Keep an eye on what you
*do* with the result of strlen(). For example:
char szData[] = "This is a test string.";
char* p = szData;
p += strlen(szData);
strcpy(p, ".. and this is a second string.");
...made Unicode-friendly, using the steps above, it works just fine:
TCHAR szData[] = L"This is a test string.";
TCHAR* p = szData;
p += my_strlen_chars(szData);
my_strcpy(p, L".. and this is a second string.");
However, if we were using strlen to get the number of chars in a string,
and using that as if it were the number of BYTES in the string, THEN we'd
be in trouble. For example:
TCHAR szData[] = "This is a test string.";
TCHAR szCopy[256];
int nChars = strlen(szData); // WRONG BAD NO NO NO
char* p = szData;
memcpy(szCopy, szData, nChars+1); // OOPS - only got half the string!
The above code tries to copy a string, assuming that the number returned
by strlen() can be used as the # of bytes in the string. WRONG. It can't.
So - just double-check wherever you use strlen(), and you will probably be
just fine.
--------------------------------------------------------------------
Again, the great thing about these changes is that now, if you switch your
project back from Unicode to default character set (which is ASCII),
voila - you are back to regular non-unicode, and your code works fine.
No code changes at all. :) (Well, you also have to modify my list of
string-manipulation function mappings above to also remap to the non-unicode
case... using #ifdefs, a one-time, two-minute task.)
One more final tip: for lazy/easy conversion between ordinary strings
and WCHAR strings, use *sprintf with "%S" instead of "%s". The uppercase
%S instructs the function to convert the string to or from the opposite
encoding style.
sprintf(szDest, "%S", wszSrc); // convert simple string to WCHAR string
wsprintf(wszDest, "%S", szSrc); // convert WCHAR string to simple string (lossy)
That's about it for now... hopefully, several billion people will thank
you for taking the time to do this!
This document copyright (c)2008+ Ryan M. Geiss.
Return to Articles