基于web-webhavrest抓取百度搜寻结果

2012-09-04

基于web-webhavrest抓取百度搜索结果由于公司业务需要，需要做一个关键词信息抓取程序，就利用web-webharves

基于web-webhavrest抓取百度搜索结果

由于公司业务需要，需要做一个关键词信息抓取程序，就利用web-webharvest做了一个抓取程序。
java程序：

import java.io.IOException;

import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;

public class BaiduQ {
public static void main(String[] args) throws IOException {?

? ScraperConfiguration config = new ScraperConfiguration("E:/webharvest/baiduQ.xml");?
? Scraper scraper = new Scraper(config, "E:/webharvest");
? String sd = "玩具";
scraper.addVariableToContext("baiduURL", new String("http://www.baidu.com/s?wd="+sd));//设置查询地址
scraper.addVariableToContext("fileName", new String(sd));//设置查询结果保存文件
? scraper.setDebug(true);??
?? scraper.execute();?
? ?
? }?

}

对应baiduQ.xml：

<config charset="gbk">?

?
<var-def name="start" id="startpage">?

<var-def name="baiduURL" overwrite="false"></var-def>

<var-def name="fileName" overwrite="false"></var-def>
<html-to-xml>?
<http url="${baiduURL}"/>?
</html-to-xml>?
</var-def>?

?
<var-def name="urlList" id="urlList">?
<xpath expression="//table[@class='result']">?
<var name="start"/>?
</xpath>?
</var-def>?

?
? <file action="write" path="${fileName}.xml" charset="utf-8">?
? <![CDATA[ <catalog> ]]>?
? <loop item="item" index="i">?
? <list><var name="urlList"/></list>?
? <body>?
? <xquery>?
? <xq-param name="item" type="node()"><var name="item"/></xq-param>?
? <xq-expression><![CDATA[?
? declare variable $item as node() external;?

? let $name := data($item//tr/td/a/font) ?
? let $url := data($item//tr/td/a[1]/@href)?
? return?
? <website>?
? <name>{normalize-space($name)}</name>?
? <url>{normalize-space($url)}</url>?
? </website>?
? ]]></xq-expression>?
? </xquery>?
? </body>?
? </loop>?
? <![CDATA[ </catalog> ]]>?
? </file>?
</config>

该程序运行后是把结果存入xml中，基本解决了抓取的问题。

热点排行

Web前端

基于web-webhavrest抓取百度搜寻结果