samedi 22 décembre 2007

Wrap my head around unicode for Python

Most interesting tutorial I've found:


  • unicode are "symbols" or "objects" (no fixed computer representation, don't think bytes) and codecs transform them into binary strings (so you can print, store in disk, sent across network...).
  • a unicode string example with some greek characters: unicodeString = u"abc_\u03a0\u03a3\u03a9.txt"
  • you shouldn't 'print' a unicode string without encoding it first (by default Python will encode in ascii which can leads to errors if there are non ascii characters)
  • you can print a unicode "representation": print repr(unicodeString)
  • you encode with the .encode method: binary = unicodeString.encode("utf-8")
  • you can see the binary result like this: print "UTF-8", repr(unicodeString.encode('utf-8'))
  • print "ASCII",unicodeString.encode('ascii','replace') #will replace non-codable characters with '?'
  • from binary to unicode: unicode(utf8_string,'utf-8') # you must specify the encoding if not Python assumes it's ascii
  • once you have a Unicode object, it behaves exactly like a regular string object, so there is no new syntax to learn (other than the \u and \U escapes)
Other links:
from my wikinote

Aucun commentaire: