Tuesday, July 28, 2015

HBase - Loading data using Apache Pig

In my previous blogs , I have been exploring data consumption with variety of data formats in big data world. Today we will exploring the process of the analyzing the data using apache pig and finally storing  in HBase in few easy steps.  Later in the next post.. I will try to demonstrate to how to access the same data using Hive  through an simple SQL query.

Let us assume that we have following file, which we will be using pig to process it..

cust_id,last_name,first_name,age,skill_type,skill_set
--------- ------------ ------------- ---- ------------ ----------
1001,jones,mike,40,Tech Skills,java-perl-ruby-linux-oracle
1006,persons,lee,50,Soft-Managerial Skills,Team Lead-Manager
1002,woods,craig,44,Tech Skills,c,c++,java-sql-oracle-teradata-dwh


Note:  My file do not include the header. I am showing the header here just for demonstration purpose.


Let us step  by step..


1.  Let us create the hbase table  with two column families cust_info & cust_prof  for information and profile attributes

hbase >  create 'cust_master','cust_info','cust_prof'

hbase >  list


2.  Now, Let us load the data into pig

grunt > a = LOAD '/user/cloudera/pig_hbase/cust_info.txt' USING PigStorage(',') AS ( cust_id:chararray,
last_name:chararray,
first_name:chararray,
age:int,
skill_type:chararray,
skill_set:chararray );

 after successfully loaded , you should see the information like below..


you can also verify as below

grunt> dump a;


3.  Now load the data that we loaded into pig into the hbase table ..

 grunt >  STORE a INTO 'hbase://cust_master' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cust_info:last_name
 cust_info:first_name
 cust_info:age
 cust_prof:skill_type
 cust_prof:skill_set'
);


  Hope now the data is loaded into the hbase table  cust_master


4.  Lets go and check in hbase ..

hbase(main):004:0> scan 'cust_master'

ROW                                     COLUMN+CELL                                                                                                      
 1001                                   column=cust_ai:age, timestamp=1438109850189, value=40                        
 1001                                   column=cust_ai:first_name, timestamp=1438109850189, value=mike        
 1001                                   column=cust_ai:last_name, timestamp=1438109850189, value=jones          
 1001                                   column=cust_prof:skill_set, timestamp=1438109850189, value=java-perl-ruby-linux-oracle        1001                                   column=cust_prof:skill_type, timestamp=1438109850189, value=Tech Skills                                1002                                   column=cust_ai:age, timestamp=1438109850218, value=44                        
 1002                                   column=cust_ai:first_name, timestamp=1438109850218, value=craig        
 1002                                   column=cust_ai:last_name, timestamp=1438109850218, value=woods        
 1002                                   column=cust_prof:skill_set, timestamp=1438109850218, value=c              
 1002                                   column=cust_prof:skill_type, timestamp=1438109850218, value=Tech Skills                                1006                                   column=cust_ai:age, timestamp=1438109850217, value=50                        
 1006                                   column=cust_ai:first_name, timestamp=1438109850217, value=lee            
 1006                                   column=cust_ai:last_name, timestamp=1438109850217, value=persons      
 1006                                   column=cust_prof:skill_set, timestamp=1438109850217, value=Team Lead-Manager                  1006                                   column=cust_prof:skill_type, timestamp=1438109850217, value=Soft-Managerial Skills                              
3 row(s) in 0.3510 seconds



 In my next blog,  let us see how we can create a  Hive table on top of hbase table for easy querying the data stored in HBase...  stay tuned.

No comments:

Post a Comment