可否帮忙写一个单页页的PHP采集程序,并附上实例
比方说,我要采集这个页面:http://news.163.com/12/0613/20/83TJ7PA700014JB6.html
要求:
采集标题
采集正文
谢谢!
[解决办法]
首先去http://simplehtmldom.sourceforge.net/index.htm(点击Download latest version form Sourceforge.)下载一个simple_html_dom.php,傻瓜式的正则,另官网上有详细教程,很容易看懂。
header("Content-type: text/html; charset=gb2312");
require dirname(__FILE__) . '/simple_html_dom.php';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://news.163.com/12/0613/20/83TJ7PA700014JB6.html');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$htmls = curl_exec($ch);
curl_close($ch);
$html = str_get_html($htmls);
foreach($html->find('#h1title') as $title){
echo strip_tags($title).'<br />';//标题
}
foreach($html->find('#endText') as $content){
echo strip_tags($content);//正文
}
25. curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
26. if( isset($array['header']) && $array['header'] ) {
27. curl_setopt($ch, CURLOPT_HEADER, 1);
28. }
29. if(isset($array['httpheader'])) {
30. curl_setopt($ch, CURLOPT_HTTPHEADER, $array['httpheader']);
31. }
32. if(isset($array['referer'])) {
33. curl_setopt($ch, CURLOPT_REFERER, $array['referer']);
34. }
35. if( isset($array['post']) ) {
36. curl_setopt($ch, CURLOPT_POST, 1 );
37. curl_setopt($ch, CURLOPT_POSTFIELDS, $array['post']);
38. }
39. if( isset($array['cookie']) ){
40. curl_setopt($ch, CURLOPT_COOKIE, $array['cookie']);
41. }
42. $r['erro'] = curl_error($ch);
43. $r['errno'] = curl_errno($ch);
44. $r['html'] = curl_exec($ch);
45. $r['http_code'] = curl_getinfo($ch, CURLINFO_HTTP_CODE);
46. curl_close($ch);
47. return $r;
48. }
49. /**
50. * 获取验证码图片和cookie
51. * @param Null
52. * @return array('img'=>String, 'cookie'=>String)
53. */
54. function getVFCode ()
55. {
56. $vfcode = array(
57. 'header' => true,
58. 'cookie' => false,
59. 'url'=>'http://ptlogin2.qq.com/getimage?aid='.$_GET['aid'].'&'.@$_GET['t'],
60. );
61. $r = $this->curlFunc($vfcode);
原文:http://www.phpnewer.com/index.php/Tszj/detail/id/436.html
[解决办法]
用抓取页面就可以,标题的话就是title标签之间的,正文是body之间的,用正则去掉一些不需要的内容