python抓取ASP网页的问题
我想抓取一个网站页面上数据,用我经常用的python代码,如下:
import pycurl
address=' http://xin.cz3.nus.edu.sg/group/cjttd/ZFTTDDRUG.asp?ID=DAP000001'
print address
c = pycurl.Curl()
html = StringIO.StringIO()
c.setopt(pycurl.URL, address)
c.setopt(pycurl.WRITEFUNCTION, html.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
page=html.getvalue()
可是得到的内容不是正常打开网页的内容
返回的内容是:
Sorry, the page cannot be displayed
There is a problem with the page you are trying to reach and it cannot be displayed.
--------------------------------------------
Please try the following:
?Contact the Web site administrator to let them know that this error has occured for this URL address.
HTTP 500.100 - Internal server error: ASP error.
Internet Information Services
--------------------------------------------
Technical Information (for support personnel)
Error Type:
SessionID, ASP 0164 (0x80004005)
An invalid TimeOut value was specified.
/LM/W3SVC/1/ROOT/global.asa, line 33
?Browser Type:
PycURL/7.18.2
?Page:
GET /group ttd/ZFTTDDRUG.asp
用urllib2也是一样。
很奇怪,抓其它网页都没问题,就是这个网站的网页抓不到。是不是他们加了什么设置不让抓取。还是ASP网页这样抓不行。
麻烦那位高手赐教?谢谢!!
[解决办法]
设下user agent:
c.setopt(pycurl.USERAGENT, 'any')