调用BeautifuSoup进行html的文本内容提取有关问题

2013-03-06

调用BeautifuSoup进行html的文本内容提取问题定义一个提取文本的函数def gettext(html):from bs4 import B

调用BeautifuSoup进行html的文本内容提取问题
定义一个提取文本的函数
def gettext(html):
    from bs4 import BeautifulSoup
    soup= BeautifulSoup(html)
    return soup.get_text()
这是我下载内容的函数
def downURL(url,filename):
    print url
    print filename
    try:
        fp = urllib2.urlopen(url)
    except:
        print 'download exception'
        return 0
    op = open(filename,"wb")
    while 1:
        s = fp.read()
        if not s:
            break
        s=gettext(s)
        op.write(s)
        fp.close()

        return 1
编译的时候就提示
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
还有
UnicodeEncodeError: 'ascii' codec can't encode characters in position 106-124: ordinal not in range(128)
[解决办法]
op.write(str(s))
[解决办法]
在百度知道，给你解释过了。
此处不再啰嗦，只贴相关内容供参考：
【总结】Python 2.x中常见字符编码和解码方面的错误及其解决办法

【总结】Python的第三方库BeautifulSoup的使用心得

热点排行

perl python

调用BeautifuSoup进行html的文本内容提取有关问题