hadoop学习笔记之-pig安装及操作实例
Pig概述:
Pig可以看做hadoop的客户端软件,可以连接到hadoop集群进行数据分析工作,是一种探索大规模数据集的脚本语言。
pig是在HDFS和MapReduce之上的数据流处理语言,它将数据流处理翻译成多个map和reduce函数,提供更高层次的抽象将程序员从具体的编程中解放出来,对于不熟悉java的用户,使用一种较为简便的类似于SQL的面向数据流的语言pig latin进行数据处理
Pig包括两部分:用于描述数据流的语言,称为Pig Latin;和用于运行Pig Latin程序的执行环境。
Pig Latin程序有一系列的operation和transformation组成,可以进行排序、过滤、求和、分组、关联等常用操作,还可以自定义函数,这是一种面向数据分析处理的轻量级脚本语言
。每个操作或变换对输入进行数据处理,然后产生输出结果。这些操作整体上描述了一个数据流。Pig内部,这些变换操作被转换成一系列的MapReduce作业。
pig可以看做是pig latin到map-reduce的映射器。
Pig不适合所有的数据处理任务,和MapReduce一样,它是为数据批处理而设计的。如果只想查询大数据集中的一小部分数据,pig的实现不会很好,因为它要扫描整个数据集或绝大部分。
pig安装:
1. 下载并解压
下载地址:http://mirror.bjtu.edu.cn/apache/pig/pig-0.9.2/
[grid@gc ~]$ tar xzvf pig-0.9.2.tar.gz
[grid@gc ~]$ pwd
/home/grid
[grid@gc ~]$ ls
abcd Desktop eclipse hadoop hadoop-0.20.2 hadoop-code hbase-0.90.5 input javaer.log javaer.log~ pig-0.9.2 workspace
2. pig本地模式配置环境
所有文件和执行过程都在本地,一般用于测试程序
--编辑环境变量
[grid@gc ~]$ vi .bash_profile
PATH=$PATH:$HOME/bin:/usr/java/jdk1.6.0_18/bin:/home/grid/pig-0.9.2/bin
JAVA_HOME=/usr #注意是java目录的上级目录
export PATH
export LANG=zh_CN
[grid@gc ~]$ source .bash_profile
--进入grunt shell
[grid@gc ~]$ pig -x local
2013-01-09 13:29:10,959 [main] INFO org.apache.pig.Main - Logging error messages to: /home/grid/pig_1357709350959.log
2013-01-09 13:29:13,080 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
grunt>
3. pig的map-reduce模式配置环境
实际工作的环境
--编辑环境变量
PATH=$PATH:$HOME/bin:/usr/java/jdk1.6.0_18/bin:/home/grid/pig-0.9.2/bin:/home/grid/hadoop-0.20.2/bin
export JAVA_HOME=/usr
export PIG_CLASSPATH=/home/grid/pig-0.9.2/conf
export PATH
export LANG=zh_CN
--进入grunt shell
[grid@gc ~]$ pig
2013-01-09 13:55:42,303 [main] INFO org.apache.pig.Main - Logging error messages to: /home/grid/pig_1357710942292.log
2013-01-09 13:55:45,432 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://gc:9000
2013-01-09 13:55:47,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: gc:9001
grunt>
注意:因为pig要对hdfs进行操作,在启动grunt shell时之前要必需确保hadoop已经启动。
4. Pig的运行方法
? pig脚本:将程序写入.pig文件中
? Grunt:运行Pig命令的交互式shell环境
? 嵌入式方式
Grunt方法:
--自动补全机制
grunt> s --按tab键
set split store
grunt> l --按tab键
load long ls
--Autocomplete文件
--Eclipse插件PigPen
5. pig常用命令
grunt> ls
hdfs://gc:9000/user/grid/input/Test_1<r 3> 328
hdfs://gc:9000/user/grid/input/Test_2<r 3> 134
hdfs://gc:9000/user/grid/input/abcd<r 2> 11
grunt> copyToLocal Test_1 ttt
grunt> quit
[grid@gc ~]$ ll ttt
-rwxrwxrwx 1 grid hadoop 328 01-11 05:53 ttt
Pig操作实例:
--首先在hadoop中建立week8目录,并将access_log.txt文件传入hadoop
grunt> ls
hdfs://gc:9000/user/grid/.Trash <dir>
hdfs://gc:9000/user/grid/input <dir>
hdfs://gc:9000/user/grid/out <dir>
hdfs://gc:9000/user/grid/output <dir>
hdfs://gc:9000/user/grid/output2 <dir>
grunt> pwd
hdfs://gc:9000/user/grid
grunt> mkdir access
grunt> cd access
grunt> copyFromLocal /home/grid/access_log.txt access.log
grunt> ls
hdfs://gc:9000/user/grid/access/access.log<r 2> 7118627
--将log文件load进表a
grunt> a = load '/user/grid/access/access.log'
>> using PigStorage(' ')
>> as (ip,a1,a3,a4,a5,a6,a7,a8);
--对a进行过滤只保留ip字段
grunt> b = foreach a generate ip;
--按ip做group by
grunt> c = group b by ip;
--按ip对c进行统计
grunt> d = foreach c generate group,COUNT($1);
--显示结果:
grunt> dump d;
2013-01-12 12:07:51,482 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY
2013-01-12 12:07:51,827 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for a: $1, $2, $3, $4, $5, $6, $7
2013-01-12 12:07:54,727 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-01-12 12:07:54,775 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner
2013-01-12 12:07:55,003 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2013-01-12 12:07:55,007 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2013-01-12 12:07:56,316 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2013-01-12 12:07:56,683 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2013-01-12 12:07:56,701 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job9027177661900375605.jar
2013-01-12 12:08:12,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job9027177661900375605.jar created
2013-01-12 12:08:13,040 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2013-01-12 12:08:13,359 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=7118627
2013-01-12 12:08:13,360 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 1
2013-01-12 12:08:13,616 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2013-01-12 12:08:14,164 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-01-12 12:08:19,125 [Thread-21] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-01-12 12:08:19,154 [Thread-21] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-01-12 12:08:19,231 [Thread-21] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2013-01-12 12:08:30,207 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201301091247_0001
2013-01-12 12:08:30,208 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://gc:50030/jobdetails.jsp?jobid=job_201301091247_0001
2013-01-12 12:10:28,459 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 6% complete
2013-01-12 12:10:34,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 29% complete
2013-01-12 12:10:38,567 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2013-01-12 12:11:28,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-01-12 12:11:28,367 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2 0.9.2 grid 2013-01-12 12:07:56 2013-01-12 12:11:28 GROUP_BY
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201301091247_0001 1 1 58 58 58 25 25 25 a,b,c,d GROUP_BY,COMBINER hdfs://gc:9000/tmp/temp-1148213696/tmp-241551689,
Input(s):
Successfully read 28134 records (7118627 bytes) from: "/user/grid/access/access.log"
Output(s):
Successfully stored 476 records (14039 bytes) in: "hdfs://gc:9000/tmp/temp-1148213696/tmp-241551689"
Counters:
Total records written : 476
Total bytes written : 14039
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201301091247_0001
2013-01-12 12:11:28,419 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2013-01-12 12:11:28,760 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-01-12 12:11:28,761 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(127.0.0.1,2)
(1.59.65.67,2)
(112.4.2.19,9)
(112.4.2.51,80)
(60.2.99.33,42)
(69.28.58.5,1)
(69.28.58.6,9)
(69.28.58.8,5)
(1.193.3.227,3)
(1.202.221.3,6)
(117.136.9.4,6)
(121.31.62.3,26)
(182.204.8.4,59)
(183.9.112.2,25)
(221.12.37.6,25)
(223.4.16.88,2)
(27.9.110.75,122)
(61.189.63.2,24)
(69.28.58.12,3)
(111.161.72.7,1)
(117.136.12.6,61)
(117.136.19.9,4)
(117.136.2.98,1)
(117.136.20.3,1)
(117.136.20.9,1)
(117.136.3.46,5)
(117.136.4.18,5)
(117.136.4.19,1)
(117.136.5.39,9)
(117.136.5.70,1)
(117.136.5.73,17)
(117.136.7.67,5)
(117.136.8.11,32)
(117.136.8.48,1)
(117.136.8.52,1)
(117.136.8.97,2)
(117.136.9.52,2)
(117.136.9.68,7)
(117.24.22.57,2)
(121.28.95.48,1597)
(122.96.75.15,6)
(124.42.114.9,1)
(125.46.45.78,236)
(125.88.73.21,1)
(182.48.112.2,870)
(183.12.74.40,31)
(207.46.13.95,2)
(210.51.195.5,77)
(218.1.115.55,1)
(218.5.72.173,3)
(27.188.55.59,96)
(27.9.230.128,25)
(59.41.62.100,339)
(61.213.92.56,2)
(61.55.185.38,25)
(61.55.185.61,25)
(61.55.186.17,14)
(61.55.186.20,14)
(61.55.186.22,14)
(65.52.108.66,2)
(71.45.41.139,45)
(72.14.202.80,4)
(72.14.202.81,2)
(72.14.202.82,4)
(72.14.202.83,3)
(72.14.202.84,2)
(72.14.202.85,5)
(72.14.202.86,2)
(72.14.202.87,3)
(72.14.202.89,2)
(72.14.202.90,1)
(72.14.202.91,5)
(72.14.202.92,3)
(72.14.202.93,2)
(72.14.202.94,2)
(72.14.202.95,5)
(74.125.16.80,1)
(89.126.54.40,305)
(99.76.10.239,60)
(1.192.138.149,31)
(110.16.198.88,27)
(110.17.170.72,1)
(111.161.72.31,25)
(112.101.64.91,3)
(112.224.3.119,58)
(112.242.79.68,24)
(112.64.190.54,25)
(112.97.192.54,1)
(112.97.24.116,7)
(112.97.24.178,15)
(112.97.24.243,18)
(114.80.93.215,1)
(114.81.255.37,5)
(116.116.8.161,46)
(116.228.79.74,1)
(116.30.81.181,111)
(116.48.155.51,3)
(116.7.101.166,97)
(117.136.1.247,1)
(117.136.10.44,5)
(117.136.10.51,6)
(117.136.10.53,1)
(117.136.12.91,1)
(117.136.14.41,1)
(117.136.14.45,1)
(117.136.14.78,13)
(117.136.15.59,1)
(117.136.15.96,25)
(117.136.19.11,1)
(117.136.2.131,7)
(117.136.2.142,1)
(117.136.2.230,1)
(117.136.2.237,37)
(117.136.20.10,1)
(117.136.20.78,55)
(117.136.20.86,17)
(117.136.22.31,3)
(117.136.23.47,7)
(117.136.24.85,32)
(117.136.24.98,1)
(117.136.25.43,9)
(117.136.26.16,1)
(117.136.30.47,9)
(117.136.30.66,1)
(117.136.30.79,8)
(117.136.31.34,1)
(117.136.31.57,5)
(117.136.31.94,4)
(117.136.32.23,30)
(117.136.33.82,1)
(117.136.33.83,4)
(117.136.33.84,1)
(117.136.33.85,1)
(117.136.33.86,1)
(117.136.33.87,1)
(117.136.35.46,1)
(117.136.36.15,1)
(117.136.36.63,1)
(117.136.5.208,5)
(117.136.5.221,9)
(117.136.6.232,8)
(117.136.8.186,24)
(117.136.9.198,9)
(117.136.9.222,1)
(117.26.107.22,4)
(119.164.105.9,53)
(120.197.26.43,9)
(120.68.17.229,26)
(120.84.24.200,773)
(121.14.162.28,125)
(121.14.77.216,1)
(121.14.95.213,11)
(121.41.128.23,27)
(122.118.190.5,1)
(122.241.54.67,1)
(122.89.138.26,54)
(123.125.71.15,1)
(123.125.71.96,1)
(124.115.0.111,7)
(124.115.0.169,13)
(124.72.71.149,1)
(124.74.27.218,84)
(125.77.31.163,25)
(125.89.75.100,5)
(14.217.19.126,80)
(157.55.17.200,3)
(180.95.186.78,26)
(183.13.196.98,11)
(183.60.140.16,1)
(183.60.193.30,2)
(202.194.31.45,1)
(210.72.33.200,3)
(211.137.59.23,4)
(211.137.59.33,43)
(211.139.92.11,40)
(211.140.5.100,34)
(211.140.5.103,43)
(211.140.5.114,9)
(211.140.5.116,6)
(211.140.5.122,1)
(211.140.7.199,4)
(211.140.7.200,4)
(211.141.86.10,9)
(218.1.102.166,27)
(218.16.245.42,24)
(218.19.42.168,181)
(218.20.24.203,4597)
(218.205.245.7,1)
(218.213.137.2,28)
(220.231.59.40,1)
(221.176.4.134,8)
(222.170.20.46,11)
(222.186.17.98,2)
(222.73.191.55,124)
(222.73.75.245,7)
(27.115.124.75,470)
(58.242.249.66,2)
(58.249.34.251,7)
(59.151.120.36,31)
(59.151.120.38,59)
(59.61.141.119,26)
(60.247.116.29,28)
(61.154.14.122,61)
(61.155.206.81,165)
(61.164.72.118,27)
(61.50.174.137,1)
(65.52.109.151,3)
(66.249.71.135,19)
(66.249.71.136,16)
(66.249.71.137,14)
(72.14.199.185,4)
(72.14.199.186,4)
(72.14.199.187,2)
(72.30.142.220,3)
(110.75.174.219,1)
(110.75.174.221,1)
(110.75.174.223,1)
(112.64.188.188,10)
(112.64.188.217,7)
(112.64.190.235,16)
(112.64.190.237,9)
(112.64.191.122,4)
(113.57.218.226,45)
(113.90.101.196,22)
(114.106.216.63,24)
(114.215.28.225,2)
(114.247.10.132,243)
(114.43.237.117,167)
(114.98.146.181,26)
(115.168.51.143,8)
(115.168.76.178,3)
(115.236.48.226,439)
(116.235.194.89,171)
(117.135.129.28,8)
(117.135.129.58,2)
(117.135.129.59,7)
(117.136.10.141,30)
(117.136.10.158,21)
(117.136.10.180,9)
(117.136.10.186,4)
(117.136.11.131,1)
(117.136.11.145,1)
(117.136.11.190,1)
(117.136.12.147,1)
(117.136.12.183,4)
(117.136.12.192,32)
(117.136.12.206,4)
(117.136.12.209,4)
(117.136.15.110,1)
(117.136.15.146,5)
(117.136.16.131,1)
(117.136.16.142,1)
(117.136.16.201,30)
(117.136.16.203,1)
(117.136.19.105,10)
(117.136.19.148,7)
(117.136.19.198,1)
(117.136.23.130,3)
(117.136.23.238,30)
(117.136.23.253,4)
(117.136.24.130,1)
(117.136.24.131,6)
(117.136.24.200,1)
(117.136.24.201,21)
(117.136.26.137,1)
(117.136.27.251,1)
(117.136.30.147,3)
(117.136.30.152,5)
(117.136.31.144,1647)
(117.136.31.147,65)
(117.136.31.149,1)
(117.136.31.150,1)
(117.136.31.152,6)
(117.136.31.158,7)
(117.136.31.177,1)
(117.136.33.188,1)
(117.136.33.206,5)
(117.136.37.132,1)
(118.192.33.111,4)
(119.146.220.12,1850)
(120.204.201.77,5)
(121.14.162.124,124)
(121.28.205.250,42)
(123.120.41.159,2)
(123.124.240.11,1)
(123.147.244.39,37)
(124.115.10.252,1)
(124.207.169.57,1)
(124.207.169.59,3)
(124.238.242.36,13)
(124.238.242.43,18)
(124.238.242.47,26)
(124.238.242.65,13)
(124.238.242.68,13)
(14.153.238.175,2)
(14.213.176.184,133)
(159.226.202.12,2)
(159.226.202.13,2)
(175.136.16.158,2)
(180.153.201.34,12)
(180.153.201.35,9)
(180.153.227.27,3)
(180.153.227.28,4)
(180.153.227.29,5)
(180.153.227.31,2)
(180.153.227.32,3)
(180.153.227.34,2)
(180.153.227.36,2)
(180.153.227.37,4)
(180.153.227.40,2)
(180.153.227.41,3)
(180.153.227.42,1)
(180.153.227.44,3)
(180.153.227.45,1)
(180.153.227.47,1)
(180.153.227.52,1)
(180.153.227.53,5)
(180.153.227.54,1)
(180.153.227.55,6)