Python多线程抓取Google搜寻链接网页

2013-04-12

Python多线程抓取Google搜索链接网页（1）urllib2BeautifulSoup抓取Goolge搜索链接近期，参与的项目需要对Goo

Python多线程抓取Google搜索链接网页

（1）urllib2+BeautifulSoup抓取Goolge搜索链接

近期，参与的项目需要对Google搜索结果进行处理，之前学习了Python处理网页相关的工具。实际应用中，使用了urllib2和beautifulsoup来进行网页的抓取，但是在抓取google搜索结果的时候，发现如果是直接对google搜索结果页面的源代码进行处理，会得到很多“脏”链接。

看下图为搜索“titanic james”的结果：

Python多线程抓取Google搜寻链接网页

图中红色标记的是不需要的，蓝色标记的是需要抓取处理的。

这种“脏链接”当然可以通过规则过滤的方法来过滤掉，但是这样程序的复杂度就高了。正当自己愁眉苦脸的正在写过滤规则时。同学提醒说google应该提供相关的api，才恍然大明白。

（2）Google Web Search API+多线程

通过搜索，找到Google Web Search API是我要用到的，其中Google给出了详细的使用步骤（针对不同的语言都有说明）：https://developers.google.com/web-search/docs/

文档中给出使用Python进行搜索的例子：

#-*-coding:utf-8-*-import urllib2,urllibimport simplejsonimport os, time,threadingimport common, html_filter#input the keywordskeywords = raw_input('Enter the keywords: ')                                 #define rnum_perpage, pagesrnum_perpage=8pages=8                        #定义线程函数def thread_scratch(url, rnum_perpage, page): url_set = []  try:   request = urllib2.Request(url, None, {'Referer': 'http://www.sina.com'})   response = urllib2.urlopen(request)   # Process the JSON string.   results = simplejson.load(response)   info = results['responseData']['results'] except Exception,e:   print 'error occured'   print e else:   for minfo in info:      url_set.append(minfo['url'])      print minfo['url']  #处理链接 i = 0 for u in url_set:   try:     request_url = urllib2.Request(u, None, {'Referer': 'http://www.sina.com'})     request_url.add_header(     'User-agent',     'CSC'     )     response_data = urllib2.urlopen(request_url).read()     #过滤文件     #content_data = html_filter.filter_tags(response_data)     #写入文件     filenum = i+page     filename = dir_name+'/related_html_'+str(filenum)     print '  write start: related_html_'+str(filenum)     f = open(filename, 'w+', -1)     f.write(response_data)     #print content_data     f.close()     print '  write down: related_html_'+str(filenum)   except Exception, e:     print 'error occured 2'     print e   i = i+1 return #创建文件夹dir_name = 'related_html_'+urllib.quote(keywords)if os.path.exists(dir_name):   print 'exists  file'   common.delete_dir_or_file(dir_name)os.makedirs(dir_name)#抓取网页print 'start to scratch web pages:'for x in range(pages):  print "page:%s"%(x+1)  page = x * rnum_perpage  url = ('https://ajax.googleapis.com/ajax/services/search/web'                  '?v=1.0&q=%s&rsz=%s&start=%s') % (urllib.quote(keywords), rnum_perpage,page)  print url  t = threading.Thread(target=thread_scratch, args=(url,rnum_perpage, page))  t.start()#主线程等待子线程抓取完main_thread = threading.currentThread()for t in threading.enumerate():  if t is main_thread:    continue  t.join()print '抓取完毕'

欢迎关注本人个人博客：cuishichao.com

热点排行

perl python

Python多线程抓取Google搜寻链接网页