Mahout 聚类 Nutch 爬取的网易新闻
?
爬取163的新闻
bin/nutch crawl urls/ -dir crawl/www.163.com/ -depth 5 -topN 1000 -threads 5
bin/mahout seqdirectory -c UTF-8 -i wjMahout/www.163.com/parse_text/single -o wjMahout/www.163.com/seqfile
可以查看该SequenceFile:
bin/mahout seqdumper -s wjMahout/www.163.com/seqfile/chunk-0
?
再将该SequenceFile转换成Vector:
bin/mahout seq2sparse -i wjMahout/www.163.com/seqfile/ -o wjMahout/www.163.com/vectors/ -ow -chunk 5 -x 90 -seq -ml 50 -n 2 -nv
seq2sparse的-a参数可以用来指定Lucene分词器的class,默认是org.apache.lucene.analysis.standard.StandardAnalyzer。
我这里用IK分词器:加上 -a?org.wltea.analyzer.lucene.IKAnalyzer
(This uses the default analyzer and default TFIDF weighting, -n 2 is good for cosine distance, which we are using in clustering and for similarity, -x 90 meaning that if a token appears in 90% of the docs it is considered a stop word, -nv to get named vectors making further data files easier to inspect.)
?
bin/mahout kmeans -i wjMahout/www.163.com/vectors/tfidf-vectors -c wjMahout/www.163.com/kmeans-centroids -cl -o wjMahout/www.163.com/kmeans-clusters -k 5 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
(If -c and -k are specified, kmeans will put random seed vectors into the -c directory, if -c is provided without -k then the -c directory is assumed to be input and kmeans will use each vector in it to seed the clustering. -cl tell kmeans to also assign the input doc vectors to clusters at the end of the process and put them in reuters-kmeans-clusters/clusteredPoints. if -cl is not specified then the documents will not be assigned to clusters.)
?
查看聚类结果:
bin/mahout clusterdump -d wjMahout/www.163.com/vectors/dictionary.file-0 -dt sequencefile -s wjMahout/www.163.com/kmeans-clusters/clusters-2-final/part-r-00000 -n 20 -b 100 -p wjMahout/www.163.com/kmeans-clusters/clusteredPoints > clusterdump-result
?
?
代码1:
?
String data= "/home/hadoop/program/apache-nutch-1.4-bin/runtime/local/crawl/www.163.com/segments/20120515154453/parse_text/part-00000/data";String dataEextracted= "/home/hadoop/Downloads/parse_text";Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(data), conf);Path path = new Path(data);File outputDir=new File(dataEextracted);File outputFileAll=new File(dataEextracted+"/all");if (outputDir.mkdir()) {if (!outputFileAll.exists()) {outputFileAll.createNewFile();}}FileWriter fileWriterAll=new FileWriter(outputFileAll);PrintWriter printWriterAll=new PrintWriter(fileWriterAll);SequenceFile.Reader reader = null;try {reader = new SequenceFile.Reader(fs, path, conf);Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);long position = reader.getPosition();while (reader.next(key, value)) {String syncSeen = reader.syncSeen() ? "*" : "";System.out.printf("[%s%s]\n%s\n%s\n", position, syncSeen, key, value);printWriterAll.printf("[%s%s]\n%s\n%s\n", position, syncSeen, key, value);File outputFileSingle=new File(dataEextracted+"/"+position);if (!outputFileSingle.exists()) {outputFileSingle.createNewFile();}FileWriter fileWriterSingle=new FileWriter(outputFileSingle);PrintWriter printWriterSingle=new PrintWriter(fileWriterSingle);printWriterSingle.print(value);printWriterSingle.flush();fileWriterSingle.flush();fileWriterSingle.close();printWriterSingle.close();position = reader.getPosition(); // beginning of next record}printWriterAll.flush();fileWriterAll.flush();fileWriterAll.close();printWriterAll.close();} finally {IOUtils.closeStream(reader);}?
?
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)