python的unicode问题实在是让人痛苦,本身要写一段小程序,时间都被浪费在处理unicode上面了。 我的python版本

python -c 'import sys;print sys.version'
2.6.4 (r264:75706, Dec  7 2009, 18:45:15) 
[GCC 4.4.1]



import os,sys
from BeautifulSoup import BeautifulSoup, SoupStrainer

def get_info(cont):
    print type(cont)
    soup = BeautifulSoup(cont)
    a = soup.findAll('a')
    print type(a)
    print(a)

if __name__ == "__main__":
    s = sys.stdin.read()
    s = unicode(s, 'utf-8')
    get_info(s)

出错信息:

505 ~/script/notsobad/python/tool>cat /tmp/book_2742/index.html | ./book_res.py 


Traceback (most recent call last):
  File "./book_res.py", line 28, in 
    get_info(s)
  File "./book_res.py", line 20, in get_info
    print(a)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 79-82: ordinal not in range(128)

type(a) 是unicode,但是print a却报错。 最后在python的mail list里找到了篇 http://mail.python.org/pipermail/tutor/2005-August/040991.html http://mail.python.org/pipermail/tutor/2005-August/040993.html

This is the first question in the BeautifulSoup FAQ at http://www.crummy.com/software/BeautifulSoup/FAQ.html Unfortunately the author of BS considers this a problem with your Python installation! So it seems he doesn’t have a good understanding of Python and Unicode. (OK, I can forgive him that, I think there are only a handful of people who really do understand it completely.) The first fix given doesn’t work. The second fix works but it is not a good idea to change the default encoding for your Python install. There is a hack you can use to change the default encoding just for one program; in your program put reload(sys); sys.setdefaultencoding(‘utf-8’) This seems to fix the problem you are having.

beautifulsoup的文档里也有,但是它认为是python安装的问题 http://www.crummy.com/software/BeautifulSoup/documentation.html#Why%20can%27t%20Beautiful%20Soup%20print%20out%20the%20non- ASCII%20characters%20I%20gave%20it? 原因似乎是getdefaultencoding,这个会在调用str()是使用。

511 ~/script/notsobad/python/tool>python -c 'import sys;print sys.getdefaultencoding()'
ascii

可以改变这个默认值,修改sitecustomize.py

512 ~/script/notsobad/python/tool>locate sitecustomize.py
/etc/python2.6/
/usr/lib/python2.6/sitecustomize.py
/usr/lib/python2.6/sitecustomize.pyc

加上这两行

import sys
sys.setdefaultencoding("utf-8")

或者在代码里: 加上这一行

reload(sys); sys.setdefaultencoding('utf-8')

这个世界清净了…….