首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > 其他教程 > 互联网 >

抓取口碑网店铺文件

2012-08-07 
抓取口碑网店铺资料?? 呵呵,只为自己玩,哈哈。?? 技术难度:?? 1)快速高效的抓取记录,并去重,和以后的增量抓

抓取口碑网店铺资料

?? 呵呵,只为自己玩,哈哈。

?? 技术难度:

?? 1)快速高效的抓取记录,并去重,和以后的增量抓取。

?? 2)因为口碑网的联系方式是图片的,如何批量的完成OCR的转换

?

?? 本文只是一个实验,不建议使用在项目当中,如下是部分代码。

?

?? 涉及的开源代码:

?? crawler4j

?? AspriseOCR

?

?? 资源包,把crawler4j所有jar包放在你的应用目录中。

?

Berkeley DB Java Edition 4.0.71 or higher fastutil 5.1.5 DSI Utilities 1.0.10 or higher Apache HttpClient 4.0.1 Apache Log4j 1.2.15 Apache Commons Logging 1.1.1 Apache Commons Codec 1.4

? 如上的JAR包都必须包括在你的项目lib当中,否则会出错。

?

? 第一步,抓取口碑网店铺信息。
?? 如下是抓取的部分代码:

? 主程序入口:Controller.java

package com.aragon.crawdata;import org.junit.Test;import edu.uci.ics.crawler4j.crawler.CrawlController;public class Controller {private static String CRAWSITE = "http://beijing.koubei.com/";@Testpublic void test(){CrawlController controller;try {controller = new CrawlController("/data/crawl/root");controller.addSeed(CRAWSITE);controller.start(MyCrawler.class, 3);} catch (Exception e) {// TODO Auto-generated catch blocke.printStackTrace();}}}

??? 抓取主程序 MyCrawler.java

package com.aragon.crawdata;import java.util.ArrayList;import java.util.regex.Pattern;import edu.uci.ics.crawler4j.crawler.Page;import edu.uci.ics.crawler4j.crawler.WebCrawler;import edu.uci.ics.crawler4j.url.WebURL;public class MyCrawler extends WebCrawler {private static String CRAWSITE = "http://beijing.koubei.com/store/";Pattern filters = Pattern.compile(".(detail--storeId-*)$");SaveInDb savein = new SaveInDb();public MyCrawler() {}public boolean shouldVisit(WebURL url) {String href = url.getURL().toLowerCase();if (filters.matcher(href).matches()) {return false;}if (href.startsWith(CRAWSITE)) {return true;}return false;}public void visit(Page page) {int docid = page.getWebURL().getDocid();String url = page.getWebURL().getURL();System.out.println("Url:" + url);Template4Koubei koubei = new Template4Koubei();try {CrawDataModel model = koubei.translationData(page.getHTML());if (model != null) {//保存在数据库当中。savein.update("insert into info(type,name,address,tel1,tel2,tel3,otherinfo,othertype,ownername) value ('"+ model.getType()+ "','"+ model.getName()+ "','"+ model.getAddress()+ "','"+ model.getTel1()+ "','"+ model.getTel2()+ "','"+ model.getTel3()+ "','"+ model.getOtherInfo()+ "','"+ model.getOtherType()+ "','"+ model.getOwnerName() + "')");}} catch (Exception e) {// TODO Auto-generated catch blocke.printStackTrace();}String text = page.getText();ArrayList<WebURL> links = page.getURLs();}}

?? 真正解析口碑网店铺的类Template4Koubei.java

package com.aragon.crawdata;import java.util.ArrayList;import java.util.List;import org.apache.commons.lang.StringUtils;import org.htmlparser.Node;import org.htmlparser.NodeFilter;import org.htmlparser.Parser;import org.htmlparser.filters.HasAttributeFilter;import org.htmlparser.filters.NodeClassFilter;import org.htmlparser.tags.Div;import org.htmlparser.tags.ImageTag;import org.htmlparser.tags.Span;import org.htmlparser.tags.TableColumn;import org.htmlparser.util.NodeList;public class Template4Koubei {private static String NAME_TAG = "title-wrap yk-fix-float";// 名称private static String TYPE_TAG = "detail-info-item";// 类型private static String ADDRESS_TAG = "info yk-fix-float";// 地址private static String TEL_TAG = "detail-info-item";// 电话,图像private static String MOB_TAG = "detail-info-item";private static String DETAIL_TAG = "detail-intro";private static String OTHERINFO_TAG = "detail-info-item";public CrawDataModel translationData(String webHtml) throws Exception {CrawDataModel model = new CrawDataModel();Object[] name = findCrawDataByCol(webHtml, NAME_TAG);if (name != null && name[0] != null) {model.setName(StringUtils.deleteWhitespace(name[0].toString()));Object[] type = findCrawDataByCol(webHtml, TYPE_TAG);if (type != null) {model.setType(StringUtils.deleteWhitespace(type[0].toString()));}Object[] tel = findCrawDataByImageFilter(webHtml, TEL_TAG);if (tel != null) {for (int i = 0; i < tel.length; i++) {if (i == 0) {model.setTel1(StringUtils.deleteWhitespace(tel[0].toString()));}if (i == 1) {model.setTel2(StringUtils.deleteWhitespace(tel[1].toString()));}if (i == 2) {model.setTel3(StringUtils.deleteWhitespace(tel[2].toString()));}}}Object[] otherType = findCrawDataByCol(webHtml, OTHERINFO_TAG);String otherTypeString = "";if (otherType != null) {for (int i = 0; i < otherType.length; i++) {if (i < 3) {otherTypeString += StringUtils.deleteWhitespace(otherType[i].toString());}}model.setOtherType(otherTypeString);}Object[] address = findCrawDataByCol(webHtml, ADDRESS_TAG);if (address != null) {model.setAddress(StringUtils.deleteWhitespace(address[0].toString()));}Object[] detail = findCrawDataByCol(webHtml, DETAIL_TAG);if (detail != null) {model.setOtherInfo(StringUtils.deleteWhitespace(detail[0].toString()));}} else {return null;}// check addressif (model.getAddress() != null) {if (model.getAddress().indexOf("电话") > 0) {model.setOtherType(model.getAddress().substring(model.getAddress().indexOf("电话")));model.setAddress(model.getAddress().substring(0,model.getAddress().indexOf("电话")));}}return model;}public Object[] findCrawDataByCol(String webHtml, String Tag)throws Exception {String name = "";List<String> resultTextList = new ArrayList<String>();Parser parser = new Parser();parser.setInputHTML(webHtml);NodeFilter filter = new HasAttributeFilter("class", Tag);/* NodeFilter filter 就是要解析的过滤器,实现有好多种,我采用的属性过滤,其他more api */NodeList nodeList = parser.extractAllNodesThatMatch(filter);/* * extractAllNodesThatAre(class)已经不被推荐使用,在1.6版本中,我感到更加体形了灵活性.更好的适用了自定义的tag */if (nodeList == null)return null;if (nodeList.size() == 0)return null;// System.out.println("start ============== ,size = "// + nodeList.size());Node[] nodes = nodeList.toNodeArray();String line = "";for (int i = 0; i < nodes.length; i++) {Node node = nodes[i]; /* 得到所以符合的节点,类型化做对应的标签类 */if (node instanceof Span) {Span spanTag = (Span) node;line = spanTag.toPlainTextString();} else if (node instanceof TableColumn) {TableColumn tableTag = (TableColumn) node;line = tableTag.toPlainTextString();} else if (node instanceof Div) {Div divTag = (Div) node;line = divTag.toPlainTextString();}if (StringUtil.isTrimEmpty(line)) {continue;} else {resultTextList.add(line);}}return resultTextList.toArray();// return name;}public Object[] findCrawDataByImageFilter(String webHtml, String Tag)throws Exception {List<String> resultTextList = new ArrayList<String>();Parser parser = new Parser();parser.setInputHTML(webHtml);NodeFilter filter = new NodeClassFilter(ImageTag.class);/* NodeFilter filter 就是要解析的过滤器,实现有好多种,我采用的属性过滤,其他more api */NodeList nodeList = parser.extractAllNodesThatMatch(filter);/* * extractAllNodesThatAre(class)已经不被推荐使用,在1.6版本中,我感到更加体形了灵活性.更好的适用了自定义的tag */if (nodeList == null)return null;if (nodeList.size() == 0)return null;// System.out.println("start ============== ,size = "// + nodeList.size());Node[] nodes = nodeList.toNodeArray();String line = "";for (int i = 0; i < nodes.length; i++) {Node node = nodes[i]; /* 得到所以符合的节点,类型化做对应的标签类 */if (node.getParent().getText().indexOf(Tag) > 0) {ImageTag imageTag = (ImageTag) node;line = imageTag.getImageURL();}if (StringUtil.isTrimEmpty(line)) {continue;} else {resultTextList.add(line);}}return resultTextList.toArray();// return name;}}

?Model对象:CrawDataModel.java

package com.aragon.crawdata;import java.io.Serializable;public class CrawDataModel implements Serializable{private Integer id;private String type;private String name;private String address;private String tel1;private String tel2;private String tel3;private String ownerName;private String otherInfo;private String otherType;public Integer getId() {return id;}public void setId(Integer id) {this.id = id;}public String getType() {return type;}public void setType(String type) {this.type = type;}public String getName() {return name;}public void setName(String name) {this.name = name;}public String getAddress() {return address;}public void setAddress(String address) {this.address = address;}public String getTel1() {return tel1;}public void setTel1(String tel1) {this.tel1 = tel1;}public String getTel2() {return tel2;}public void setTel2(String tel2) {this.tel2 = tel2;}public String getTel3() {return tel3;}public void setTel3(String tel3) {this.tel3 = tel3;}public String getOwnerName() {return ownerName;}public void setOwnerName(String ownerName) {this.ownerName = ownerName;}public String getOtherInfo() {return otherInfo;}public void setOtherInfo(String otherInfo) {this.otherInfo = otherInfo;}public String getOtherType() {return otherType;}public void setOtherType(String otherType) {this.otherType = otherType;}}
?

哈哈,如上的步骤就可以把口碑网的店铺信息抓入你的数据库当中,我在恶劣的2M宽带条件下,一个小时抓了2000多条信息,当然了,没有加入去重的功能。

?

第二步:把电话,手机信息转换成指定的数字号码

?

?? 从口碑网抓取的电话,手机信息都是图片信息,如何把他转换成数字号码,我们使用AspriseOCR。网上破解的方法很多。大家可以google一下。

? 如下是他的TEST实验。因为我的应用代码太多,所以就粘一个Test,从技术上没有问题。

package com.aragon.crawdata;import java.io.File;/** *  * 本程序实现功能:给出一个图片的具体网络地址,把该图片的进行解析,解析后把图片内容以字符串形式进行返回 */public class RecognisePicture {/** *  * create date:2009-5-22 author:Administrator *  *  *  * @param args */public static void main(String[] args) {// TODO Auto-generated method stubrecognise("http://count.koubei.com/showphone/showphone.php?f=jpg&amp;w=96&amp;h=10&amp;bc=255,255,255&amp;fc=0,0,0&amp;fs=10&amp;fn=arial&amp;phone=MTI5NDU5ODg0Mw%3D%3D%23OQ532L8m6okoAzY6".replaceAll("&amp;", "&"));}/** *  * 解析指定网络地址的图片信息 create date:2009-5-22 author:Administrator *  *  *  * @param fromUrl *  * @return */public static String recognise(String fromUrl) {String result = "";// 下载图片文件到本地磁盘File file = DownLoadWithUrl.downLoadFile(fromUrl);if (file != null) {// 解析下载到本地磁盘文件result = ParseJPEG_withOCR.getRecogniseStr(file);// 删除下载到本地磁盘的文件DownLoadWithUrl.delFile(file);}System.out.println("输出的电话号码是:"+result);return result;}}

如下是实验结果:

输出的电话号码是:01O 51402688
?

哈哈,这样一来。是不是能减少很多MM的工作量呀。enjoy it!

?

?

?

?

?

?

1 楼 changyuxin 2011-09-08   能把完整的代码发出来学习一下吗?

热点排行