Hadoop小试牛刀之dump数据
?
各位好,最近小弟接到一个需求,需要dump数据从云梯(淘宝的hadoop环境)到TAIR(淘宝的缓存系统)。原先的大概设计:先从云梯原始表分析出部分数据,再用Datax(数据同步工具)将这些数据导入mysql单表,再起一个单机多线程任务跑(这里需要8个多小时)。我的一个简单想法,既然数据在云梯上,为什么不直接在云梯上直接跑MapReduce任务呢?然后我就开始这么干了。。。set mapred.reduce.tasks=45;//产生的hive文件个数控制set hive.job.hooks.autored.enable=false;create external table if not exists ecrm_buyer_benefit ( buyer_id string,seller_id string,end_time string)comment 'stat_buyer_benefit'row format delimitedfields terminated by ','lines terminated by '\n'stored as TEXTFILE;INSERT OVERWRITE TABLE ecrm_buyer_benefitselect buyer_id,seller_id,max(end_time)from r_ecrm_buyer_benefit bwhereb.pt=20121126000000andb.end_time>'2012-11-27 00:00:00'and b.promotion_type='3'and (b.status='1' or b.status='0')group by buyer_id,seller_id;?
private void recordError(OutputCollector<LongWritable,Text> output,LongWritable key,String line,Reporter reporter){reporter.getCounter("counter_group", "error_counter").increment(1l);try {output.collect(key, new Text(line));} catch (IOException e) {e.printStackTrace();System.out.println("output error,line:"+line);}}?reducer只输出异常信息:
public void reduce(LongWritable key, Iterator<Text> values,OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {while (values.hasNext()) {output.collect(key, values.next());}}?运行结果1.2G条记录起了45个mapper同时跑,在TAIR 1wTPS下,运行了8个半小时。当然这个后续还有很多可以优化的,比如使用map多线程(默认单线程)等。
<plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.6</source> <target>1.6</target> <encoding>UTF-8</encoding> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>com.taobao.ump.dump.DumpStarter</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins>?执行打包:mvn clean compile assembly:single?,在target下生成10M的jar包,提交hadoop执行即可