In my previous blogs , I have been exploring data consumption with variety of data formats in big data world. Today we will exploring the process of the analyzing the data using apache pig and finally storing in HBase in few easy steps. Later in the next post.. I will try to demonstrate to how to access the same data using Hive through an simple SQL query.
Let us assume that we have following file, which we will be using pig to process it..
cust_id,last_name,first_name,age,skill_type,skill_set
--------- ------------ ------------- ---- ------------ ----------
1001,jones,mike,40,Tech Skills,java-perl-ruby-linux-oracle
1006,persons,lee,50,Soft-Managerial Skills,Team Lead-Manager
1002,woods,craig,44,Tech Skills,c,c++,java-sql-oracle-teradata-dwh
Note: My file do not include the header. I am showing the header here just for demonstration purpose.
Let us step by step..
1. Let us create the hbase table with two column families cust_info & cust_prof for information and profile attributes
hbase > create 'cust_master','cust_info','cust_prof'
hbase > list
2. Now, Let us load the data into pig
grunt > a = LOAD '/user/cloudera/pig_hbase/cust_info.txt' USING PigStorage(',') AS ( cust_id:chararray,
last_name:chararray,
first_name:chararray,
age:int,
skill_type:chararray,
skill_set:chararray );
after successfully loaded , you should see the information like below..
you can also verify as below
grunt> dump a;
3. Now load the data that we loaded into pig into the hbase table ..
grunt > STORE a INTO 'hbase://cust_master' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cust_info:last_name
cust_info:first_name
cust_info:age
cust_prof:skill_type
cust_prof:skill_set'
);
Hope now the data is loaded into the hbase table cust_master
4. Lets go and check in hbase ..
hbase(main):004:0> scan 'cust_master'
ROW COLUMN+CELL
1001 column=cust_ai:age, timestamp=1438109850189, value=40
1001 column=cust_ai:first_name, timestamp=1438109850189, value=mike
1001 column=cust_ai:last_name, timestamp=1438109850189, value=jones
1001 column=cust_prof:skill_set, timestamp=1438109850189, value=java-perl-ruby-linux-oracle 1001 column=cust_prof:skill_type, timestamp=1438109850189, value=Tech Skills 1002 column=cust_ai:age, timestamp=1438109850218, value=44
1002 column=cust_ai:first_name, timestamp=1438109850218, value=craig
1002 column=cust_ai:last_name, timestamp=1438109850218, value=woods
1002 column=cust_prof:skill_set, timestamp=1438109850218, value=c
1002 column=cust_prof:skill_type, timestamp=1438109850218, value=Tech Skills 1006 column=cust_ai:age, timestamp=1438109850217, value=50
1006 column=cust_ai:first_name, timestamp=1438109850217, value=lee
1006 column=cust_ai:last_name, timestamp=1438109850217, value=persons
1006 column=cust_prof:skill_set, timestamp=1438109850217, value=Team Lead-Manager 1006 column=cust_prof:skill_type, timestamp=1438109850217, value=Soft-Managerial Skills
3 row(s) in 0.3510 seconds
In my next blog, let us see how we can create a Hive table on top of hbase table for easy querying the data stored in HBase... stay tuned.
Let us assume that we have following file, which we will be using pig to process it..
cust_id,last_name,first_name,age,skill_type,skill_set
--------- ------------ ------------- ---- ------------ ----------
1001,jones,mike,40,Tech Skills,java-perl-ruby-linux-oracle
1006,persons,lee,50,Soft-Managerial Skills,Team Lead-Manager
1002,woods,craig,44,Tech Skills,c,c++,java-sql-oracle-teradata-dwh
Note: My file do not include the header. I am showing the header here just for demonstration purpose.
Let us step by step..
1. Let us create the hbase table with two column families cust_info & cust_prof for information and profile attributes
hbase > create 'cust_master','cust_info','cust_prof'
hbase > list
2. Now, Let us load the data into pig
grunt > a = LOAD '/user/cloudera/pig_hbase/cust_info.txt' USING PigStorage(',') AS ( cust_id:chararray,
last_name:chararray,
first_name:chararray,
age:int,
skill_type:chararray,
skill_set:chararray );
after successfully loaded , you should see the information like below..
you can also verify as below
grunt> dump a;
3. Now load the data that we loaded into pig into the hbase table ..
grunt > STORE a INTO 'hbase://cust_master' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cust_info:last_name
cust_info:first_name
cust_info:age
cust_prof:skill_type
cust_prof:skill_set'
);
Hope now the data is loaded into the hbase table cust_master
4. Lets go and check in hbase ..
hbase(main):004:0> scan 'cust_master'
ROW COLUMN+CELL
1001 column=cust_ai:age, timestamp=1438109850189, value=40
1001 column=cust_ai:first_name, timestamp=1438109850189, value=mike
1001 column=cust_ai:last_name, timestamp=1438109850189, value=jones
1001 column=cust_prof:skill_set, timestamp=1438109850189, value=java-perl-ruby-linux-oracle 1001 column=cust_prof:skill_type, timestamp=1438109850189, value=Tech Skills 1002 column=cust_ai:age, timestamp=1438109850218, value=44
1002 column=cust_ai:first_name, timestamp=1438109850218, value=craig
1002 column=cust_ai:last_name, timestamp=1438109850218, value=woods
1002 column=cust_prof:skill_set, timestamp=1438109850218, value=c
1002 column=cust_prof:skill_type, timestamp=1438109850218, value=Tech Skills 1006 column=cust_ai:age, timestamp=1438109850217, value=50
1006 column=cust_ai:first_name, timestamp=1438109850217, value=lee
1006 column=cust_ai:last_name, timestamp=1438109850217, value=persons
1006 column=cust_prof:skill_set, timestamp=1438109850217, value=Team Lead-Manager 1006 column=cust_prof:skill_type, timestamp=1438109850217, value=Soft-Managerial Skills
3 row(s) in 0.3510 seconds
In my next blog, let us see how we can create a Hive table on top of hbase table for easy querying the data stored in HBase... stay tuned.
No comments:
Post a Comment