samedi 22 décembre 2007

Wrap my head around unicode for Python

Most interesting tutorial I've found: http://boodebr.org/main/python/all-about-python-and-unicode

Keypoints:

  • unicode are "symbols" or "objects" (no fixed computer representation, don't think bytes) and codecs transform them into binary strings (so you can print, store in disk, sent across network...).
  • a unicode string example with some greek characters: unicodeString = u"abc_\u03a0\u03a3\u03a9.txt"
  • you shouldn't 'print' a unicode string without encoding it first (by default Python will encode in ascii which can leads to errors if there are non ascii characters)
  • you can print a unicode "representation": print repr(unicodeString)
  • you encode with the .encode method: binary = unicodeString.encode("utf-8")
  • you can see the binary result like this: print "UTF-8", repr(unicodeString.encode('utf-8'))
  • print "ASCII",unicodeString.encode('ascii','replace') #will replace non-codable characters with '?'
  • from binary to unicode: unicode(utf8_string,'utf-8') # you must specify the encoding if not Python assumes it's ascii
  • once you have a Unicode object, it behaves exactly like a regular string object, so there is no new syntax to learn (other than the \u and \U escapes)
Other links:
* http://www.jorendorff.com/articles/unicode/python.html
* http://evanjones.ca/python-utf8.html
* http://vim.sourceforge.net/tips/tip.php?tip_id=246
* http://farmdev.com/thoughts/23/what-i-thought-i-knew-about-unicode-in-python-amounted-to-nothing/
from my wikinote

Aucun commentaire: