nutch1.6过滤URL
环境:window xp + nutch1.6源码 + eclipse
问题:
1:URL的过滤规则,顺序,配置文件
2:配置文件conf/regex-urlfilter.txt内容如下:
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
# accept URLs containing certain characters as probable queries, etc.
#+[?=&!#]
#-[~]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
#-.*(/.+?)/.*?\1/.*?\1/
+^http://list.taobao.com/[\s\S]*#![\s\S]*
问题:
抓取不到形如:http://list.taobao.com/xxxxxx#!http://xxxxxx的网页,
这是为什么?
是否配置错误,还是配置的地方不对,
请高手指教,
谢谢!
nutch1.6?过滤动态URL nutch1.6 过滤动态url
[解决办法]
nutch问题还是放到搜索版面吧,那里这方面高手比较多