Hadoop学习笔记(三)一个实例
Hadoop学习笔记(三)一个实例1.辅助类GenericOptionsParser,Tool和ToolRunner
上一章使用了GenericOptionsParser这个类,它用来解释常用的Hadoop命令行选项,并根据需要,对Configuration对象设置相应的值。通常不直接使用GenericOptionsParser类,更方便的方法是:实现Tool接口,通过ToolRunner来调用,而ToolRunner内部最终还是调用的GenericOptionsParser类。
JobContext它提供只读属性,它含有两个成员JobConf以及JobID。它提供的只读属性除了jobId保存在JobID中以外,所有的属性都是从jobConf中读取,包括以下属性:
1.mapred.reduce.tasks,默认值为1
2.mapred.working.dir,文件系统的工作目录
3.mapred.job.name,用户设置的job 的名字
4.mapreduce.map.class
5.mapreduce.inputformat.class
6.mapreduce.combine.class
7.mapreduce.reduce.class
8.mapreduce.outputformat.class
9.mapreduce.partitioner.class
Job:The job submitter's view of the Job. It allows the user toconfigure the job, submit it, control its execution, and query the state. The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.它可以对job进行配置,提交,控制其执行,查询状态。
它有两个成员:JobClient和RunningJob
然后我们来看一下Mapper
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
这是Mapper类,它内置了一个Context类
register /home/app_admin/scripts/piglib/*.jar/** SearchLog类在这里面 */register /home/app_admin/scripts/loadplainlog/loadplainlog.jarrmf /result/keyword/$logdatea = LOAD '/log/$logdate/access/part-m-*' USING com.searchlog.AccessLogPigLoader;/** 然后过滤。 当前页域名是www.****.com,关键字不为空,第一页,有直达区结果或者普通搜索结果,当前页url是特定路径*/a_all = FILTER a BY location.host == 'www.****.com' and keyword is not null and keyword != '' and pageno == 1 ;/** 按照keyword做分组 */b_all = GROUP a_all BY keyword;/** 对分组后的结果计算count */c_all = FOREACH b_all GENERATE group, COUNT(a_all.id) as keywordSearchCount, MAX(a_all.vs) as vs;/** 把计算结果按照搜索次数排序 */d_all = ORDER c_all by keywordSearchCount DESC;result = FOREACH d_all GENERATE group,keywordSearchCount,vs;/** 把结果存文件中 */store result into '/result/keyword/$logdate/' USING PigStorage();