Mahout 聚类 Nutch 爬取的网易快讯

2012-07-25

Mahout 聚类 Nutch 爬取的网易新闻?爬取163的新闻bin/nutch crawl urls/ -dir crawl/www.163.com/ -depth

Mahout 聚类 Nutch 爬取的网易新闻

爬取163的新闻

bin/nutch crawl urls/ -dir crawl/www.163.com/ -depth 5 -topN 1000 -threads 5

将获取的segments解析成文本放在parse_text文件夹中（见代码1），再传到hdfs的wjMahout/www.163.com/parse_text/single中：bin/hadoop fs -put ~/Downloads/parse_text/ wjMahout/www.163.com/parse_text/single/
利用Mahout的seqdirectory方法把single中的文本文件转换成SequenceFile：

bin/mahout seqdirectory -c UTF-8 -i wjMahout/www.163.com/parse_text/single -o wjMahout/www.163.com/seqfile

可以查看该SequenceFile：

bin/mahout seqdumper -s wjMahout/www.163.com/seqfile/chunk-0

再将该SequenceFile转换成Vector：

bin/mahout seq2sparse -i wjMahout/www.163.com/seqfile/ -o wjMahout/www.163.com/vectors/ -ow -chunk 5 -x 90 -seq -ml 50 -n 2 -nv

seq2sparse的-a参数可以用来指定Lucene分词器的class，默认是org.apache.lucene.analysis.standard.StandardAnalyzer。

我这里用IK分词器：加上 -a?org.wltea.analyzer.lucene.IKAnalyzer

（This uses the default analyzer and default TFIDF weighting, -n 2 is good for cosine distance, which we are using in clustering and for similarity, -x 90 meaning that if a token appears in 90% of the docs it is considered a stop word, -nv to get named vectors making further data files easier to inspect.）

对这些Vectors进行KMeans聚类：

bin/mahout kmeans -i wjMahout/www.163.com/vectors/tfidf-vectors -c wjMahout/www.163.com/kmeans-centroids -cl -o wjMahout/www.163.com/kmeans-clusters -k 5 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure

（If -c and -k are specified, kmeans will put random seed vectors into the -c directory, if -c is provided without -k then the -c directory is assumed to be input and kmeans will use each vector in it to seed the clustering. -cl tell kmeans to also assign the input doc vectors to clusters at the end of the process and put them in reuters-kmeans-clusters/clusteredPoints. if -cl is not specified then the documents will not be assigned to clusters.）

查看聚类结果：

bin/mahout clusterdump -d wjMahout/www.163.com/vectors/dictionary.file-0 -dt sequencefile -s wjMahout/www.163.com/kmeans-clusters/clusters-2-final/part-r-00000 -n 20 -b 100 -p wjMahout/www.163.com/kmeans-clusters/clusteredPoints > clusterdump-result

代码1：

String data= "/home/hadoop/program/apache-nutch-1.4-bin/runtime/local/crawl/www.163.com/segments/20120515154453/parse_text/part-00000/data";String dataEextracted= "/home/hadoop/Downloads/parse_text";Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(data), conf);Path path = new Path(data);File outputDir=new File(dataEextracted);File outputFileAll=new File(dataEextracted+"/all");if (outputDir.mkdir()) {if (!outputFileAll.exists()) {outputFileAll.createNewFile();}}FileWriter fileWriterAll=new FileWriter(outputFileAll);PrintWriter printWriterAll=new PrintWriter(fileWriterAll);SequenceFile.Reader reader = null;try {reader = new SequenceFile.Reader(fs, path, conf);Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);long position = reader.getPosition();while (reader.next(key, value)) {String syncSeen = reader.syncSeen() ? "*" : "";System.out.printf("[%s%s]\n%s\n%s\n", position, syncSeen, key, value);printWriterAll.printf("[%s%s]\n%s\n%s\n", position, syncSeen, key, value);File outputFileSingle=new File(dataEextracted+"/"+position);if (!outputFileSingle.exists()) {outputFileSingle.createNewFile();}FileWriter fileWriterSingle=new FileWriter(outputFileSingle);PrintWriter printWriterSingle=new PrintWriter(fileWriterSingle);printWriterSingle.print(value);printWriterSingle.flush();fileWriterSingle.flush();fileWriterSingle.close();printWriterSingle.close();position = reader.getPosition(); // beginning of next record}printWriterAll.flush();fileWriterAll.flush();fileWriterAll.close();printWriterAll.close();} finally {IOUtils.closeStream(reader);}

at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:189)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
3 楼 jayghost 2012-05-25 haojiubujianrr 写道你好，我看了你的文章，并罩着你的方法做了一遍，但当用到IK中文分词的时候却总抛出错误找不到IKAnalyzer类。我想请问一下你碰到过这个问题吗？你是怎么解决的？
我的QQ为447465164，希望能得到您的指导，谢谢。

我用的是mahout0.6的，bin/mahout seq2sparse执行的是mahout-examples-0.6-job.jar中的类，而这个jar中没有IK的，所以你要将IK的jar打到mahout-examples-0.6-job.jar中，我偷懒就直接复制进去的

热点排行

开源软件

Mahout 聚类 Nutch 爬取的网易快讯