首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > 服务器 > 云计算 >

Twenty Newsgroups Classification任务之二seq2sparse(二)

2013-09-06 
Twenty Newsgroups Classification任务之二seq2sparse(2)接上篇,SequenceFileTokenizerMapper的输出文件在

Twenty Newsgroups Classification任务之二seq2sparse(2)

接上篇,SequenceFileTokenizerMapper的输出文件在/home/mahout/mahout-work-mahout0/20news-vectors/tokenized-documents/part-m-00000文件即可查看,同时可以编写下面的代码来读取该文件(该代码是根据前面读出聚类中心点文件改编的),如下:

 List<Path> chunkPaths = Lists.newArrayList();        Configuration conf = new Configuration(baseConf);        FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;    int chunkIndex = 0;    Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);    chunkPaths.add(chunkPath);        SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);    try {      long currentChunkSize = 0;      Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);      int i = 0;      for (Pair<Writable,Writable> record           : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {        if (currentChunkSize > chunkSizeLimit) {          Closeables.closeQuietly(dictWriter);          chunkIndex++;          chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);          chunkPaths.add(chunkPath);          dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);          currentChunkSize = 0;        }        Writable key = record.getFirst();        int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8;        currentChunkSize += fieldSize;        dictWriter.append(key, new IntWritable(i++));      }      maxTermDimension[0] = i;    } finally {      Closeables.closeQuietly(dictWriter);    }
这里看到新建了一个Writer,然后遍历该文件的key和value,但是只读取key值,即单词,然后把这些单词进行编码,即第一个单词用0和它对应,第二个单词用1和它对应。

上面代码使用的dictWriter查看变量并没有看到哪个属性是存储单词和对应id的,所以这里的写入文件的机制是append就写入?还是我没有找到正确的属性?待查。。。

分享,快乐,成长


转载请注明出处:http://blog.csdn.net/fansy1990 


热点排行