python实现一个strip_tags和unicode笔记

python实现一个strip_tags, 去处html标记

from HTMLParser import HTMLParser

def strip_tags(html):
     result = []
     parser = HTMLParser()
     parser.handle_data = result.append
     parser.feed(html)
     parser.close()
     return ''.join(result)

关于unicode：参考这里 http://evanjones.ca/python-utf8.html 总结：出现编码错误时，用type检查变量的类型，容易找出问题，print是不行的 unicote <--> str，互相转换

In [28]: a=‘a’ In [29]: type a -——> type(a) Out[29]: In [30]: b=unicode(a, ‘utf-8’) In [31]: type b -——> type(b) Out[31]: In [32]: b.decode(‘utf-8’) Out[32]: u’a’ In [33]: b.encode(‘utf-8’) Out[33]: ‘a’