调用BeautifuSoup进行html的文本内容提取问题
定义一个提取文本的函数
def gettext(html):
from bs4 import BeautifulSoup
soup= BeautifulSoup(html)
return soup.get_text()
这是我下载内容的函数
def downURL(url,filename):
print url
print filename
try:
fp = urllib2.urlopen(url)
except:
print 'download exception'
return 0
op = open(filename,"wb")
while 1:
s = fp.read()
if not s:
break
s=gettext(s)
op.write(s)
fp.close()
return 1
编译的时候就提示
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
还有
UnicodeEncodeError: 'ascii' codec can't encode characters in position 106-124: ordinal not in range(128)
[解决办法]
op.write(str(s))
[解决办法]
在百度知道,给你解释过了。
此处不再啰嗦,只贴相关内容供参考:
【总结】Python 2.x中常见字符编码和解码方面的错误及其解决办法
【总结】Python的第三方库BeautifulSoup的使用心得