Mahout 聚类 Nutch 爬取的网易新闻
bin/nutch crawl urls/ -dir crawl/ -depth 5 -topN 1000 -threads 5
bin/mahout seqdirectory -c UTF-8 -i wjMahout/ -o wjMahout/
bin/mahout seqdumper -s wjMahout/
bin/mahout seq2sparse -i wjMahout/ -o wjMahout/ -ow -chunk 5 -x 90 -seq -ml 50 -n 2 -nv
我这里用IK分词器:加上 -a?org.wltea.analyzer.lucene.IKAnalyzer
(This uses the default analyzer and default TFIDF weighting, -n 2 is good for cosine distance, which we are using in clustering and for similarity, -x 90 meaning that if a token appears in 90% of the docs it is considered a stop word, -nv to get named vectors making further data files easier to inspect.)
bin/mahout kmeans -i wjMahout/ -c wjMahout/ -cl -o wjMahout/ -k 5 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
(If -c and -k are specified, kmeans will put random seed vectors into the -c directory, if -c is provided without -k then the -c directory is assumed to be input and kmeans will use each vector in it to seed the clustering. -cl tell kmeans to also assign the input doc vectors to clusters at the end of the process and put them in reuters-kmeans-clusters/clusteredPoints. if -cl is not specified then the documents will not be assigned to clusters.)
bin/mahout clusterdump -d wjMahout/ -dt sequencefile -s wjMahout/ -n 20 -b 100 -p wjMahout/ > clusterdump-result
String data= "/home/hadoop/program/apache-nutch-1.4-bin/runtime/local/crawl/";String dataEextracted= "/home/hadoop/Downloads/parse_text";Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(data), conf);Path path = new Path(data);File outputDir=new File(dataEextracted);File outputFileAll=new File(dataEextracted+"/all");if (outputDir.mkdir()) {if (!outputFileAll.exists()) {outputFileAll.createNewFile();}}FileWriter fileWriterAll=new FileWriter(outputFileAll);PrintWriter printWriterAll=new PrintWriter(fileWriterAll);SequenceFile.Reader reader = null;try {reader = new SequenceFile.Reader(fs, path, conf);Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);long position = reader.getPosition();while (, value)) {String syncSeen = reader.syncSeen() ? "*" : "";System.out.printf("[%s%s]\n%s\n%s\n", position, syncSeen, key, value);printWriterAll.printf("[%s%s]\n%s\n%s\n", position, syncSeen, key, value);File outputFileSingle=new File(dataEextracted+"/"+position);if (!outputFileSingle.exists()) {outputFileSingle.createNewFile();}FileWriter fileWriterSingle=new FileWriter(outputFileSingle);PrintWriter printWriterSingle=new PrintWriter(fileWriterSingle);printWriterSingle.print(value);printWriterSingle.flush();fileWriterSingle.flush();fileWriterSingle.close();printWriterSingle.close();position = reader.getPosition(); // beginning of next record}printWriterAll.flush();fileWriterAll.flush();fileWriterAll.close();printWriterAll.close();} finally {IOUtils.closeStream(reader);}?