Python 正则提取网页信息有关问题

2012-02-09

Python 正则提取网页信息问题使用的是Python 3.21.网页UTF-8的编码，下面是提取网页源码的代码Python codeo

Python 正则提取网页信息问题
使用的是Python 3.2

1.网页UTF-8的编码，下面是提取网页源码的代码

Python code

        opener = ur.build_opener(ur.HTTPCookieProcessor(http.cookiejar.CookieJar()))        opener.addheaders=[('User-agent', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322)')]        html = opener.open(self.m_url)        html = html.read().decode('utf-8')

网页代码片段如下：

HTML code

  <div class="actifl">    <h2>中文文字</h2>

2.然后我用正则表达式提取我摘下来的网页片段 "中文文字" 可以正常提取

Python code

regx='<div class=\"actifl\">(.*?)<h2>(?P<title>.*?)</h2>'pattern = re.compile(regx,re.U|re.S)match1 = pattern.match('<div class=\"actifl\">  \n<h2>中文文字</h2>')

3.但是如果提取直接从网上down下来的html 总是失败

Python code

regx='<div class=\"actifl\">(.*?)<h2>(?P<title>.*?)</h2>'pattern2 = re.compile(regx,re.U|re.S)match2 = pattern2.match(html)

可能是变量html的问题，难道是编码的原因？还是正则的问题？

PS：
（1）.我如果读取一个文本文，open('1.txt','r')是不是默认用gbk解码？
我如果要读取的文件是utf-8的，是不是需要
f=open('1.txt','rb')
f=f.decode('utf-8')
这样f就是unicode字符串了吧？

（2）. print ()到底接受什么类型的参数啊？
print ('111')
假设 var 是bytes 类型的
print (var.decode('utf-8')) 为啥会报错呀~~

var.decode('utf-8') 和 '111' 不都是str类型的吗？

谢谢大家了

[解决办法]
贴你要匹配的网页的html代码,好不

热点排行

perl python

Python 正则提取网页信息有关问题