首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > 网站开发 > asp.net >

100分 求个正则 抓取网页数据解决办法

2013-11-29 
100分 求个正则 抓取网页数据html xmlnshttp://www.w3.org/1999/xhtmlhead runatservertitle

100分 求个正则 抓取网页数据


<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title></title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
        <section class="main-sort" id="order">
            <div class="type-sort fn-left">
                <span>排序:</span><a href="javascript:;" data-seed="sort-tj" class="sort-tj cur"><span class="sort-tjIcon">推荐</span></a>
                <a href="javascript:;" data-seed="sort-price" class="sort-price" data-price="0"><span class="sort-priceIcon">价格</span></a>
                <input type="checkbox" class="J_sort J_sortCheck" data-seed="sort-yh" data-item="sale" data-val=""><label>优惠促销</label>
                <input type="checkbox" class="J_sort J_sortCheck" data-seed="sort-bzz" data-item="halftrip" data-val=""><label>半自助</label>
            </div>
            <div class="common-page fn-right">
                <span class="j-pageCurrent">1</span>/<span class="j-pageAll">1</span>
                <a href="javascript:;" class="page-prev no-page">上一页</a>
                <a href="javascript:;" class="page-next no-page">下一页</a>
            </div>
        </section>
        <section class="listbox">
            <div class="ListLoding" style="display: none;">数据正在加载。。。</div>
<ul class="listul J_pagelist">
                
<li data-info="[12]" data-days="6天" data-to="尊爵贵族线" data-halftrip="0" data-tag="尊爵贵族线" data-price="6466" data-chunk="line-SHS67909" class="lineitem cfix" style="display: list-item;">
                    <div class="img fn-left">
<a title="【纯净游】轻松台湾纯玩西线6日游" target="_blank" href="http://sh.uzai.com/tour-67909.html"><img width="125px" height="67px" alt="100分 求个正则 抓取网页数据解决办法" data-img="http://r.uzaicdn.com/pic/15434/m/w160/h120/t1" src="http://r.uzaicdn.com/pic/15434/m/w160/h120/t1"></a>
                        <div class="prd-num">产品编号:SHS67909</div>
</div>
<dl class="info fn-left">
<dt class="t">
                            <a title="【纯净游】轻松台湾纯玩西线6日游" target="_blank" href="http://sh.uzai.com/tour-67909.html">【纯净游】轻松台湾纯玩西线6日游</a>

                        </dt>
<dd class="desc">台北连住二晚市区5花酒店、尽享台北市区半天的自由活动,随心所欲安排您的行程</dd>
<dd class="moredesc">
                            
                                    <span>满意度:<span class="n">96%</span></span>
                                
                            <span class="pin"><span class="n">6</span>人点评</span>
                            
                                <span>最近出发班期:<span class="n">12/22</span></span>
                                


                            <a class="date" onclick="mychecks(67909,this,'http://sh.uzai.com/tour-67909.html')" href="javascript:;">全部班期</a>
                        </dd>
</dl>
<div class="detail fn-right">
                        
                        <p class="price"><span class="u">¥</span><span class="n">6466</span>起</p>
                        <span data-di="50" rel="J_popDisong" class="d J_powerFloat"><em class="dsnum">50</em></span>
                        <span data-song="200" rel="J_popDisong" class="s m-5 J_powerFloat"><em class="dsnum">200</em></span>
                    </div>
</li>
                
<li data-info="[12][1]" data-days="8天" data-to="尊爵贵族线" data-halftrip="0" data-tag="尊爵贵族线" data-price="7366" data-chunk="line-SHS67808" class="lineitem cfix" style="display: list-item;">
                    <div class="img fn-left">
<a title="【发现美味】台湾纯玩环岛8日(国航往返)" target="_blank" href="http://sh.uzai.com/tour-67808.html"><img width="125px" height="67px" alt="100分 求个正则 抓取网页数据解决办法" data-img="http://r.uzaicdn.com/pic/15451/m/w160/h120/t1" src="http://r.uzaicdn.com/pic/15451/m/w160/h120/t1"></a>
                        <div class="prd-num">产品编号:SHS67808</div>
</div>
<dl class="info fn-left">
<dt class="t">
                            <a title="【发现美味】台湾纯玩环岛8日(国航往返)" target="_blank" href="http://sh.uzai.com/tour-67808.html">【发现美味】台湾纯玩环岛8日(国航往返)</a>

                        </dt>
<dd class="desc">尝特色美食、台北连住二晚市区酒店、台南一晚升级五星香格里拉大饭店</dd>
<dd class="moredesc">
                            
                                    <span>满意度:<span class="n">96%</span></span>
                                
                            <span class="pin"><span class="n">6</span>人点评</span>
                            
                                <span>最近出发班期:<span class="n">12/29、1/5</span></span>
                                
                            <a class="date" onclick="mychecks(67808,this,'http://sh.uzai.com/tour-67808.html')" href="javascript:;">全部班期</a>
                        </dd>
</dl>
<div class="detail fn-right">
                        
                        <p class="price"><span class="u">¥</span><span class="n">7366</span>起</p>
                        <span data-di="50" rel="J_popDisong" class="d J_powerFloat"><em class="dsnum">50</em></span>


                        <span data-song="200" rel="J_popDisong" class="s m-5 J_powerFloat"><em class="dsnum">200</em></span>
                    </div>
</li>
                
</ul>
            <div class="noshuju" style="display: none;">
                <span>对不起,没有找到符合条件的产品!</span><a href="javascript:;">重新筛选</a>
            </div>
</section>
    </div>
    </form>
</body>
</html>



我要 统计 <ul class="listul J_pagelist"> 元素 下面<li>元素的总数
这样我能知道这个线路在网页中出现多少次。
求个正则匹配对应数据 抓到<li>元素总数
[解决办法]
HtmlAgilityPack

让使用HTML,跟操作XML一样简单. 使用XPath 的方式 来查找HTML内容
[解决办法]
引用:
Quote: 引用:

Quote: 引用:

Quote: 引用:

Regex.Matches(s, "<li data-info="\\[").Count

你好这个是能用,我现在的需求是要 获得  <ul class="listul J_pagelist"> </ul>里面的<li>的总数
,网页里面有很多html 标签 有相同节点的话,是得不到的


        void Test3()
        {
            string s = File.ReadAllText(@"E:\test\网页抓取测试\3.txt", Encoding.GetEncoding("gb2312"));
            int result = Regex.Matches(s, "<li data-info="\\[").Count;
            Console.WriteLine(result);
        }

总有不同部分的,用不同部分去区分就行了。
如果按你说的页面有多个 <ul class="listul J_pagelist">呢?那你也不知道要取哪个ul下的li啊

 <ul class="listul J_pagelist">这个样式,在整个网页出现一次,也是唯一标识,如果正则加这个怎么写。

如果你一定要加ul只能他两条写

 string html = Regex.Match(s, "(?is)<ul class="listul J_pagelist">.*?</ul>").Value;
Console.WriteLine(Regex.Matches(html, "<li data-info="\\[").Count);

热点排行