一个file_get_contents的有关问题

2012-12-22

一个file_get_contents的问题我在写19楼论坛的采集，为什么采集不到呢，谁能帮我看看，首先是用file_get_cont

一个file_get_contents的问题
我在写19楼论坛的采集，为什么采集不到呢，谁能帮我看看，首先是用file_get_contents，但是为什么我file_get_contents之后，就是一段乱码，什么意思呀！


$url="http://www.19lou.com/forum-1366-thread-9501352334616489-1-1.html";

$str=file_get_contents($url);

echo $str;

exit;

但是我采集别的论坛，是可以的，是不是19楼做防采集呢


$url="http://www.caomeihui.com/thread-252-1-1.html";

$str=file_get_contents($url);

echo $str;

exit;

[最优解释]
我跟你是一样的结果，确实没辙了
[其他解释]
在寻找答案中，
[其他解释]
咋不用采集器的呢？
[其他解释]
可能是编码不一致导致的。

[其他解释]

引用:

我在写19楼论坛的采集，为什么采集不到呢，谁能帮我看看，首先是用file_get_contents，但是为什么我file_get_contents之后，就是一段乱码，什么意思呀！
PHP code12345678$url="http://www.19lou.com/forum-1366-thread-9501352334616489-1-1.html"; $str=fi……

你还能采到啊为毛我用你的代码直接跳转到错误页面114导航上了坑爹啊
[其他解释]
这能说是乱码吗？

<script language="javascript">

var url=document.location.href;
var s = url.indexOf("?");

var redirectUrl='';
var domain=document.domain.substr(document.domain.indexOf(".")+1);

if(s>0)
{
redirectUrl=url.substr(s+1);
SetCookie("_Z3nY0d4C_","37XgPK9h",365,"/",domain);
document.location.href=redirectUrl;

}
else
{

document.location.href="http://www."+domain;
}

function SetCookie (name, value) {
var expdate = new Date();
var argv = SetCookie.arguments;
var argc = SetCookie.arguments.length;
var expires = (argc > 2) ? argv[2] : null;
var path = (argc > 3) ? argv[3] : null;
var domain = (argc > 4) ? argv[4] : null;
var secure = (argc > 5) ? argv[5] : false;

if(expires!=null && expires>=0) expdate.setTime(expdate.getTime() + ( expires * 24*60*60*1000 ));

document.cookie = name + "=" + escape (value) +((expires == null
[其他解释]
expires < 0) ? ((expires==-1)?"; expires=-1":"") : ("; expires="+ expdate.toGMTString()))
+((path == null) ? "" : ("; path=" + path)) +((domain == null) ? "" : ("; domain=" + domain))
+((secure == true) ? "; secure" : "");
}

</script>

[其他解释]
2楼的，人家19楼就是用了这个技术，才让我晕的不行行
[其他解释]

引用:

这能说是乱码吗？

<script language="javascript">

var url=document.location.href;
var s = url.indexOf("?");

var redirectUrl='';
var domain=document.domain.substr(document.domain.indexOf(".")+1);
……

版主，这个怎么用呢，鄙人有些愚钝呀
[其他解释]
自己顶了，大家帮帮忙喽
[其他解释]
莫非js限制住的
[其他解释]
我确定我采集不了，localhost直接变成http://www.localhost/这个了，别人做了防采集处理啊。那边发现发请求的是另外一个server啊
[其他解释]

$url = "http://www.19lou.com/forum-1366-thread-9501352334616489-1-1.html";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, true);
//curl_setopt($ch, CURLOPT_HEADER, array("Connection:keep-alive","Content-Encoding:gzip","Content-Language:zh-CN","Content-Type:text/html;charset=GBK"));
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4');
$html = curl_exec($ch);
url_close($ch);
echo $html;

试了一下CURL，CURLOPT_HEADER, true，打印出

HTTP/1.1 302 Moved Temporarily Server: nginx Date: Fri, 09 Nov 2012 16:40:25 GMT Content-Type: text/html Content-Length: 154 Connection: keep-alive Location: http://www.19lou.com/safeRedirect.htm?http://www.19lou.com/forum-1366-thread-9501352334616489-1-1.html Set-Cookie: BIGipServerforum_web_pool=151060746.20480.0000; path=/

如果要正确获取页面的话，需要设置 curl_setopt($ch, CURLOPT_HEADER, ARRAY);//模拟浏览器并验证
不过不是很熟悉CURLOPT_HEADER的设置，刚琢磨了一阵没有解决，希望能作为一个参考，帮你解决采集问题。

热点排行