u”In The Beginning”

I have to blog this; the story of where UTF-8 came from.  Linked from Joel Spolsky’s excellent article on Unicode, which has been in Favorites\Unicode like, forever (well, since late 2003).

On the subject of Python and Unicode; I find that none of the IDEs that don’t cost real money can handle Unicode paste on Windows XP.  Boa Constrictor, IDLE, Pythonwin, PyCrust etc – all fail when submitted to the Москва test (and if that word appears as a set of question marks or boxes, then some feed that doesn’t grok Unicode has mangled this posting).  The test is simple; copy the word Москва from Notepad (which handles UTF8 files very nicely, thank you) and paste into Python environment of your choice, in a command such as:

a = 'Москва'

or

a = u'Москва'

Several fail at this point, replacing the Unicode pasted string with question marks.  Those that pass then get subjected to Part 2, in which I grill them mercilessly with:

print a

None have so far succeeded in printing the string as it should be shown.  In the case of Pythonwin I’ve tracked through the source looking for how pasting is handled and become mired in a swamp of win32 integration, locale and pywin32 interactions.

Feel free to try different settings of the default encoding in site.py and if you get it to work, please, post it somewhere!

Let me not be misunderstood here; Python’s Unicode support is excellent.  The mismatch appears to be where the Python rubber meets the win32 road.

6 thoughts on “u”In The Beginning”

  1. You *should not* print unicode strings to a typical stdout. Never, ever, on any platform.

    import codecs, sys
    sys.stdout = codecs.getwriter(‘yourterminalencoding’)(sys.stdout)

    but you can with a snippet like that, given a correct terminal encoding.

    Also, it is not correct to do :
    a = ‘Москва’

    You should do:
    a = u’Москва’

    Though you need to make sure that the encoding of the .py file itself can be determined (there is some stuff you can put at the top of the file to ensure this).

    • Hang on – if the stdout can support Unicode, why shouldn’t it support printing to it? Should Unicode programs only ever work on ASCII platforms? What about all those Unix systems happily operating in Japan? Where does it say that stdout is an ASCII output stream?
      I know about the encoding of the .py file trick – as I said, I have no problems with Python’s support of Unicode, just with the way in which the interactive environments on Windows handle it.
      regards
      ben

      • you can only write str to stdout. unicode will be coerced to str using the defaultencoding, but that can never be expected to get set to anything useful, so you should ***ALWAYS*** wrap stdout if you are printing unicode to it.

      • Well, I think I see your general point Bob, but it’s not related to the point I’m making. In an interactive Python environment, on a Unicode-capable platform, should there not be a way in which I can enter Unicode literals copied and pasted from another Unicode application? I think so. That’s all I’m saying 🙂

        I’m happy to wrap stdout in any form of wrapper required (and stdin for that matter) if it will allow Pythonwin to exchange Unicode strings with the win32 system that supports them. In fact, thinking back, I don’t even think that Pythonwin uses stdout and stdin for the interactive window – it does its magic using the strings that are sent from the pywin32 layer, so there need be no question of encoding.

        Anyway, back to the issue about stdout for a moment, if it will only accept strings (types.StringType), then my question is; to what encoding do those strings need to be converted to work on a win32 platform? In that respect, you’re entirely correct; I should have been clearer in what I posted.

        best regards
        ben

      • I think win32 is typically latin-1 or utf-16-le (wchar?). I’m really not sure, I migrated off win32 several years ago. On OS X everything is UTF-8 (stdin/stdout of the terminal) or utf-16-be (NSString/CFString use it internally). As far as pasteboards go in OS X, they’re handled for you behind the scenes for textfields and such so no special code has to be written.

        When I did an interactive console for OS X, I wrote my own stdout/stdin objects that treated the data as UTF-8. Not sure how smart the win32 apps are.

      • Latin-1’s the general default for Western Europe/US, but you can have a variety of code pages set up – for example, for me (being a Brit), locale.getdefaultlocale() returns (‘en_GB’, ‘cp1252’). But that’s just a locale decision made by the C library – the win32 libraries can handle wide (Unicode) characters fine. I think it would be better if, a la OS X, strings were UTF8, but one works with what one has 🙂

        There should be a way to do this – pythonwin, for example, when pasted a Unicode string, will display it fine at the time of pasting, but the actual value that gets assigned is all ????, etc. So somewhere there’s a mismatch. I’ll keep digging…

        It’d be nice to migrate off win32, but sometimes these choices are not open to us…

        regards
        ben

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s