python的unicode问题实在是让人痛苦,本身要写一段小程序,时间都被浪费在处理unicode上面了。 我的python版本
python -c 'import sys;print sys.version'
2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1]
import os,sys
from BeautifulSoup import BeautifulSoup, SoupStrainer
def get_info(cont):
print type(cont)
soup = BeautifulSoup(cont)
a = soup.findAll('a')
print type(a)
print(a)
if __name__ == "__main__":
s = sys.stdin.read()
s = unicode(s, 'utf-8')
get_info(s)
出错信息:
505 ~/script/notsobad/python/tool>cat /tmp/book_2742/index.html | ./book_res.py
Traceback (most recent call last):
File "./book_res.py", line 28, in
get_info(s)
File "./book_res.py", line 20, in get_info
print(a)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 79-82: ordinal not in range(128)
type(a) 是unicode,但是print a却报错。 最后在python的mail list里找到了篇 http://mail.python.org/pipermail/tutor/2005-August/040991.html http://mail.python.org/pipermail/tutor/2005-August/040993.html
This is the first question in the BeautifulSoup FAQ at http://www.crummy.com/software/BeautifulSoup/FAQ.html Unfortunately the author of BS considers this a problem with your Python installation! So it seems he doesn’t have a good understanding of Python and Unicode. (OK, I can forgive him that, I think there are only a handful of people who really do understand it completely.) The first fix given doesn’t work. The second fix works but it is not a good idea to change the default encoding for your Python install. There is a hack you can use to change the default encoding just for one program; in your program put reload(sys); sys.setdefaultencoding(‘utf-8’) This seems to fix the problem you are having.
beautifulsoup的文档里也有,但是它认为是python安装的问题 http://www.crummy.com/software/BeautifulSoup/documentation.html#Why%20can%27t%20Beautiful%20Soup%20print%20out%20the%20non- ASCII%20characters%20I%20gave%20it? 原因似乎是getdefaultencoding,这个会在调用str()是使用。
511 ~/script/notsobad/python/tool>python -c 'import sys;print sys.getdefaultencoding()'
ascii
可以改变这个默认值,修改sitecustomize.py
512 ~/script/notsobad/python/tool>locate sitecustomize.py
/etc/python2.6/
/usr/lib/python2.6/sitecustomize.py
/usr/lib/python2.6/sitecustomize.pyc
加上这两行
import sys
sys.setdefaultencoding("utf-8")
或者在代码里: 加上这一行
reload(sys); sys.setdefaultencoding('utf-8')
这个世界清净了…….