Monday, July 15, 2013

Strange characters in your Web apps? It could be Oracle's Unicode handling.

Do you have strange characters in your Web app? Like a reversed question mark? Or characters like these: ⁿ‼↨♫☼◙ Sometimes it happens where you should see an apostrophe or a quote character like : Daniel◙s Bar-B-Q .

Here's one reason why: Oracle and Unicode. My scenario went like this:
  1. A user has a Document with some Unicode characters on it like the fancy "smart quote" characters (a Unicode u2019 for apostrophe)
  2. The user does a copy & paste to put the text into the Web app's comments field (a Java String)
  3. The web app saves the String into an Oracle VARCHAR2(4000) column called comments
  4. The Oracle instance has a popular character setting: US-ASCII-7, which does not support Unicode characters
  5. Oracle converts the character by truncating the leftmost bits, rendering an unprintable character ASCII(19) into the comments column
  6. Later, the Webapp reads the Oracle column, displays it to the browser, and displays the unprintable ASCII(19) character. as something Strange like ♫ or ◙.

Truncate Bits!!! What the...? WHY??
Yeah, that's right. Oracle truncated the bits, losing the original character - silently killing those Unicode characters without so much as a "by your leave" . Isn't "character assassination" illegal? How could Oracle DO such a thing?! The answer is not a simple one. First you need to understand Unicode character sets, and then realize that Oracle has tons of options and support for different character sets. But we can imagine what Oracle designers were thinking when converting a big 16-bit character into a tiny 7-bit space:
  • Throw an error. This would rollback transactions and force developers to handle or prevent the errors - pbbbt... what else you got?
  • Remove the Unicode character(s). That would change the size of the value stored.
  • Convert the character to something else, quietly. This is actually what happens, but it has it's own pitfalls: silent but deadly... pitfalls
Oracle can actually do some impressive up-converts : like from Windows-1252 to UTF-8. But unfortunately, the down-convert smashing of "smart quote" chars into ASCII uses the most brain-dead algorithm - truncating the left most bits - and u2019 becomes x19, which is unprintable. And if you try to convert that to XML, x19 is not even a valid XML character and the parser crashes.

Why Unicode? What are Smart Quotes?
So how did these Unicode characters get into my App anyhow? "My app only uses English, it's a simple app", you might be saying. The answer is that popular apps like Microsoft Office will automatically change your " or ' characters into more fancy, curly quote characters as you type. Nice, eh? Why not make your text look better? Who wouldn't want that? So now millions of MS Word documents with Unicode chars are floating around the planet, even your users are typing them, and they are copy & pasting those strings into your app. So THAT's where the Unicode came from for my scenario above.
And truly this should be OK, because tons of systems support Unicode already, Java and other languages use Unicode for their Strings internally already. And lots of apps like your Web browser support Unicode too. Oracle supports Unicode as well, but you do have to be aware at installation time. Oracle's US7ASCII setting is very popular, including on my Oracle instance, but I wouldn't recommend it. In today's world, you need Unicode, UTF-8 is well supported, and would have helped me here.

Now What? How do I Fix It?
You have a few options:

1. Long-term, switch to UTF-8. UTF-8 is well-supported and it's a super-set of ASCII, so all of the old ASCII text should convert in all of your VARCHAR2 columns. For almost all characters, the size is still just 8-bits, but can be larger for smart-quotes and others. However, converting an entire Oracle instance is nuclear - if you have a large database, this can take time, and the conversion your DBA recommends may be akin to backing up everything, and re-importing all of your data: hours or days of effort.

2. Use NVARCHAR2. You might consider the Oracle NVARCHAR2 column type just for your large comment-style fields that need free-form text. This will only require an ALTER TABLE on that column. But you will have to remember this for new Tables with similar needs.

3. Use a BLOB. If you store the field in a BLOB column, Oracle will not do any conversion. For columns where you don't care about indexes and search, like a comments section on your app, this should be OK. In Java, for example, you may not have to change any code at all. This is because you can still directly use this column as a String, i.e. JDBC Statement.setString() and getString() methods work fine without the messy JDBC BLOB handling. Or, JPA's @Lob annotation can also be placed on a String for minimum coding effort. As an added benefit, your users can paste the entire contents of "War and Peace" without having a size limit of 4000 in VARCHAR2. Note: Oracle's CLOB column type will still do character conversion, so it will not help here.

4. Filter the characters. In the short term I ended up using a regex to filter out and remove any unprintable characters. Unfortunately this lost the character, turning "Dan's Bar-B-Q" into "Dans Bar-B-Q". Users were not too happy with that, but it did stop the bleeding until I can implement one of the above solutions.

Conclusion, Recommendation
If you have strange characters appearing in your application, check the character set in your database engine. Oracle, SQL Server, Postgres, MySQL - they can all support a wide range of settings. Also, think about the design of your fields, do you really want VARCHAR2(4000) for a comments field? Maybe BLOB would be better?
  • Use UTF-8 in your database engine
  • Use a BLOB (careful, not CLOB)for a comments column. You'll get a big size, and still use a simple String in memory
  • Make a separate table for all of the comment fields, you probably have many comment fields in your app:

Other Solutions
Windows-1252. This Microsoft-created character set looks attractive because it is still 8-bit, and it includes the curvy smart-quote characters and many other useful punctuation characters. Oracle supports it too, calling it WE8MSWIN1252 in their settings. However, this is really just a stop-gap, a hack, until good Unicode support arrived. So I cannot recommend this unless you are desperate. Technically it would work, but now that UTF-8 support is so good everywhere, and with disk space at pennies per GB, there's no need for it, and IMHO, Windows-1252 character set support should start to fade.


Anonymous said...

I ran into this and this nicely summarized my issue. Thank you!

Anonymous said...

Wonderful write, thanks

Anonymous said...

wonderful article