Unicode Made Easy (Visual Studio .NET 2003)


Unicode Made Easy in Visual Studio .NET 2003
A quick Unicode tutorial ~ 29 June 2008 ~ by Ryan Geiss


  What is Unicode?
  
  Try loading a filename into your program with funky foreign characters in it - 
  it will probably break.  You might be able to get away with some of the accented
  characters in the latin alphabet, and a few other unusual symbols, but what about
  a filename completely in Arabic, or Vietnamese, or Kanji?  Unicode is the answer -
  it's a 16-bit wide character (rather than the traditional 8-bit latin-based 'char')
  that can encode any character from just about any language on the planet.  And,
  believe it or not, it's actually very easy to modify your program to work with it.  
  I did it recently, for the first time (shame on me for waiting so long), and was 
  very surprised at the ease with which the change was made.
  
  In Visual Studio .NET 2003, writing an application in C/C++, here are the changes
  I had to make in order for Unicode to work, so my application could handle strings
  (filenames) in any language.  Before beginning, realize that the type 'TCHAR'
  maps to two different things, depending on your project settings... if Unicode is
  turned on, it maps to WCHAR, which is an unsigned short; and if Unicode is 
  turned off, it maps to char.  In [Dev Studio] Unicode, a null-terminated string 
  is just terminated by a 16-bit zero instead of an 8-bit zero.  Whatsmore, the first 
  255 values in Unicode perfectly match the 255 bytes of ASCII, so when you check
  if a TCHAR == 'a', it's going to work, either way.  It's pleasantly surprising 
  how interchangeable the code is, once you switch to TCHARs.
  
  Here are the steps:
  
  1. In your project settings (ALT+F7), go to Configuration properties,
     then General, and on the right, under "Character Set", select "Unicode".
     (This will cause the _UNICODE preprocessor symbol to be defined 
     for your project.)
        
  2. replace all of your 'char' variables with 'TCHAR' (...at least, the ones that 
     are used for string processing - not the ones where you meant to use BYTE).
     TCHAR is a macro that maps to 'char' when _UNICODE is not #defined,
     or wchar_t when _UNICODE is #defined.
     
  3. put an L prefix, or _T() around, all string literals; i.e. change all string literals from 
     "my string" into L"my string" or _T("my_string").  This is the most time-consuming part.
     
  4. paste this code into your 'global' header file, that all of your code can see:
  
        #include <tchar.h>
        
        #define my_strcmpi(x,y)    _wcsicmp(x, y)
        #define my_strcmp(x,y)     wcscmp(x, y)
        #define my_strrchr(x,y)    wcsrchr(x, y)
        #define my_strcat(x,y)     wcscat(x,y)
        #define my_strcpy(x,y)     wcscpy(x,y)
        
        // for these, be aware that 'n' is the # of CHARACTERS, not bytes!!
        #define my_strlen_chars(x) wcslen(x)
        #define my_strnicmp(x,y,n) _wcsnicmp(x, y, n)
        #define my_strncat(x,y,n)  wcsncat(x, y, n)
        #define my_strncpy(x,y,n)  wcsncpy(x, y, n)
    
        #define my_sprintf         swprintf
        #define my_sscanf          swscanf
        #define my_fopen           _wfopen
        #define my_fclose          fclose
  
  5. go through your code and prefix all of your calls to string-manipulation 
     function (in the standard C library) with "my_" to work with the 
     remapping in the previous step.  For example, strrchr() becomes my_strrchr(), 
     strcmpi() becomes my_strcmpi(), and so on.
       
  6. All of your string manipulation code that used to work on char* 's will 
     still work just fine - because, when Unicode is enabled, a TCHAR maps to 
     WCHAR (when off, it maps to char) which is basically an unsigned short.  
     If you advance a TCHAR* by one (p++), it advances one character - 
     woohoo.  If you check a TCHAR* to see if it's end-of-string (*p==0), 
     it works - woohoo again.  However, there is one place you can 
     trip up... strlen.
     
  7. Check everywhere you've used strlen(), and see if you were treating the return
     value as a number of BYTES, or a number of CHARACTERS.  I recommend rewriting 
     your code to treat it as the number of CHARACTERS.  Keep an eye on what you
     *do* with the result of strlen().  For example:
     
        char szData[] = "This is a test string.";   
        char* p = szData;
        p += strlen(szData);
        strcpy(p, ".. and this is a second string.");

     ...made Unicode-friendly, using the steps above, it works just fine:
     
        TCHAR szData[] = L"This is a test string.";   
        TCHAR* p = szData;
        p += my_strlen_chars(szData);
        my_strcpy(p, L".. and this is a second string.");
        
     However, if we were using strlen to get the number of chars in a string,
     and using that as if it were the number of BYTES in the string, THEN we'd 
     be in trouble.  For example:

        TCHAR szData[] = "This is a test string.";
        TCHAR szCopy[256];
        int nChars = strlen(szData);  // WRONG BAD NO NO NO
        char* p = szData;
        memcpy(szCopy, szData, nChars+1);   // OOPS - only got half the string!
        
     The above code tries to copy a string, assuming that the number returned
     by strlen() can be used as the # of bytes in the string.  WRONG.  It can't.
     So - just double-check wherever you use strlen(), and you will probably be 
     just fine.

  --------------------------------------------------------------------        
  
  Again, the great thing about these changes is that now, if you switch your
  project back from Unicode to default character set (which is ASCII),
  voila - you are back to regular non-unicode, and your code works fine.  
  No code changes at all.  :)  (Well, you also have to modify my list of 
  string-manipulation function mappings above to also remap to the non-unicode 
  case... using #ifdefs, a one-time, two-minute task.)
  
  One more final tip: for lazy/easy conversion between ordinary strings 
  and WCHAR strings, use *sprintf with "%S" instead of "%s".  The uppercase
  %S instructs the function to convert the string to or from the opposite
  encoding style.
      sprintf(szDest, "%S", wszSrc);   // convert simple string to WCHAR string 
      wsprintf(wszDest, "%S", szSrc);  // convert WCHAR string to simple string (lossy)
    
  That's about it for now... hopefully, several billion people will thank 
  you for taking the time to do this!
   
   
   
This document copyright (c)2008+ Ryan M. Geiss.
Return to Articles