Twenty Newsgroups Classification任务之二seq2sparse（二）

2013-09-06

Twenty Newsgroups Classification任务之二seq2sparse（2）接上篇，SequenceFileTokenizerMapper的输出文件在

Twenty Newsgroups Classification任务之二seq2sparse（2）

接上篇，SequenceFileTokenizerMapper的输出文件在/home/mahout/mahout-work-mahout0/20news-vectors/tokenized-documents/part-m-00000文件即可查看，同时可以编写下面的代码来读取该文件（该代码是根据前面读出聚类中心点文件改编的），如下：

 List<Path> chunkPaths = Lists.newArrayList();        Configuration conf = new Configuration(baseConf);        FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;    int chunkIndex = 0;    Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);    chunkPaths.add(chunkPath);        SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);    try {      long currentChunkSize = 0;      Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);      int i = 0;      for (Pair<Writable,Writable> record           : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {        if (currentChunkSize > chunkSizeLimit) {          Closeables.closeQuietly(dictWriter);          chunkIndex++;          chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);          chunkPaths.add(chunkPath);          dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);          currentChunkSize = 0;        }        Writable key = record.getFirst();        int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8;        currentChunkSize += fieldSize;        dictWriter.append(key, new IntWritable(i++));      }      maxTermDimension[0] = i;    } finally {      Closeables.closeQuietly(dictWriter);    }

这里看到新建了一个Writer，然后遍历该文件的key和value，但是只读取key值，即单词，然后把这些单词进行编码，即第一个单词用0和它对应，第二个单词用1和它对应。

上面代码使用的dictWriter查看变量并没有看到哪个属性是存储单词和对应id的，所以这里的写入文件的机制是append就写入？还是我没有找到正确的属性？待查。。。

分享，快乐，成长

转载请注明出处：http://blog.csdn.net/fansy1990

热点排行

云计算

Twenty Newsgroups Classification任务之二seq2sparse（二）