实现地图reduce多文件自定义输出

2012-08-08

实现mapreduce多文件自定义输出普通maprduce中通常是有map和reduce两个阶段，在不做设置的情况下，计算结果

实现mapreduce多文件自定义输出

普通maprduce中通常是有map和reduce两个阶段，在不做设置的情况下，计算结果会以part-000*输出成多个文件，并且输出的文件数量和reduce数量一样，文件内容格式也不能随心所欲。这样不利于后续结果处理。

在hadoop中，reduce支持多个输出,输出的文件名也是可控的，就是继承MultipleTextOutputFormat类，重写generateFileNameForKey方法。如果只是想做到输出结果的文件名可控，实现自己的LogNameMultipleTextOutputFormat类，设置jobconf.setOutputFormat(LogNameMultipleTextOutputFormat.class);就可以了，但是这种方式只限于使用旧版本的hadoop api.如果想采用新版本的api接口或者自定义输出内容的格式等等更多的需求，那么就要自己动手重写一些hadoop api了。

首先需要构造一个自己的MultipleOutputFormat类实现FileOutputFormat类（注意是org.apache.hadoop.mapreduce.lib.output包的FileOutputFormat）

 public static class VVLogNameMultipleTextOutputFormat extends MultipleOutputFormat<Text, NullWritable> {                @Override        protected String generateFileNameForKeyValue(Text key, NullWritable value, Configuration conf) {             String sp[] = key.toString().split(",");            String filename = sp[1];            try {                Long.parseLong(sp[1]);            } catch (NumberFormatException e) {                filename = "000000000000";            }            return filename;        }    }

最后就是在job调用时设置了

Configuration conf = getConf();
Job job = new Job(conf);
job.setNumReduceTasks(12);
......
job.setMapperClass(VVEtlMapper.class);
job.setReducerClass(EtlReducer.class);
job.setOutputFormatClass(VVLogNameMultipleTextOutputFormat.class);//设置自定义的多文件输出类
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
FileOutputFormat.setCompressOutput(job, true);//设置输出结果采用压缩
FileOutputFormat.setOutputCompressorClass(job, LzopCodec.class); //设置输出结果采用lzo压缩

ok，这样你就完成了支持新的hadoop api自定义的多文件输出mapreduce编写。

1楼yao710452238昨天 23:54

这个如果是多个reduces的话，输出文件中数据会变少，一个reduce不会，不知你有没有测试过？

Re: liuzhoulong5小时前: 回复yao710452238n我这是12个reduce,不会变少啊，怎么会这么说呢？

热点排行

云计算

实现地图reduce多文件自定义输出