UTF-8/Unicode
- http://en.wikipedia.org/wiki/Unicode (‘Unicode is a computing industry standard for the consistent representation and handling of text expressed in most of the world’s writing systems. Developed in conjunction with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of a repertoire of more than 107,000 characters covering 90 scripts, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).’)
- http://www.unicode.org/faq/unicode_web.html
- http://jimmyg.org/work/code/stringconvert/index.html (‘Any application you are working with should deal with Unicode strings internally. You should never work with ordinary Python strings because as soon as someone enters a non-ASCII character in your application it is likely to break in an unpredictable way because ordinary 8-bit Python strings can’t handle these characters. Best practice is to always decode strings to Unicode objects from whatever encoding they are in (often UTF-8) as soon as they enter your application. You then work with Unicode throughout your application and then encode the Unicode back to whatever is needed (again often UTF-8) as the string leaves your application‘)
- http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode (‘Quite a few software professionals have learned that they need to worry about internationalizing software, and some of those have learned how to go about doing it. For those getting started, herewith a brief introduction to Unicode, the one technology that you have to get comfortable with if you’re going to do a good job as a software citizen of the world.’)
- http://pylonsbook.com/en/1.1/unicode.html (‘If you’ve ever come across text in a foreign language that contains lots of question mark characters in unexpected positions or if you’ve written Python code that causes an exception such as the following one to be raised, then chances are you have run into a problem with character sets, encodings, and Unicode…Encoding Unicode characters with a variable number of bytes for each character as UTF-8 has an interesting side effect. It means that UTF-8 encoded Unicode for the characters represented by the ASCII character set has the same binary representation as ASCII itself. This means computers can treat UTF-8 encoded Unicode as ASCII without any errors being raised as long as characters used are in the first 128 Unicode code points. This explains why your application might already be working perfectly well with certain Unicode strings even though you haven’t made a special effort to work with any character set except ASCII. This is also why as soon as a character such as £ or é is entered, the application will break because these are not ASCII characters; therefore, treating their UTF-8 encoded versions as ASCII will cause the kind of UnicodeDecodeError shown at the start of the chapter.’)
- http://diveintopython3.org/strings.html (‘…Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks, like the
ñcharacter in Spanish. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with ASCII in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), &c. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte…Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at least one of the world’s languages…’)
- http://rishida.net/scripts/uniview/ (from http://simonwillison.net/2009/Dec/15/unicode/ : ‘Fantastically useful tool to convert strings of characters in to every unicode and/or escaping syntax you can possibly imagine.’)
- http://rishida.net/scripts/uniview/help.html (‘UniView is an XHTML-based application to look up characters, character blocks, paste in and discover unknown characters, store your own info about characters, search on character data, do hex/dec/ncr conversions, highlight character types, etc. etc. It supports Unicode 5.2 (beta) and is written with Web Standards to work on a variety of browsers‘)
- http://www.stereoplex.com/two-voices/python-unicode-and-unicodedecodeerror (‘In the years I’ve been developing in Python, Unicode seems to be the topic which causes the greatest amount of confusion amongst developers. Hopefully much of this confusion should go away in Python 3, for reasons I’ll come to at the end; but until then, the UnicodeDecodeError is the bane of many developers’ lives.’)
- http://pyright.blogspot.com/2009/12/more-about-python-31-unicode.html (‘In a previous post, had a done a simple demonstration of Unicode decomposition and normalization with the latin character ä. This time I will do the same demonstration with non-latin characters beyond the range of 255′)
- http://french.joelonsoftware.com/Articles/Unicode.html
- http://plumberjack.blogspot.com/2010/07/using-custom-formatter-to-deal-with.html (‘Sometimes, you want to use Unicode in messages, and different logging handlers deal with Unicode in different ways. For example, FileHandler allows you to specify an encoding, which is then used to encode Unicode messages to bytes. In Python 2.x, SMTPHandler doesn’t do any encoding, which can lead to UnicodeEncodeErrors being raised when smtplib writes the message to a socket. To avoid this, you can use a Formatter which encodes the message for you, as in the following example...’)
