Tuesday, July 28, 2015

HBase - Querying through Hive Tables


 In my previous blog link , we discussed about how to load and process the data using Apache Pig and finally load into Hbase table.  In this article we are going to explore how to query the HBase table using Hive. This provides an easy way to provide access to the end-users in your organization.

 Many of you may raise a question why storing the data in HBase and then query through Hive, rather you directly store the data in Hive tables.  The reason is simple, when the data grows huge, the adhoc queries are very slow and some it may make no sense to query from Hive Tables.   Instead HBase promises high speed on Random access of data while querying.


 Now lets go to the technical part..

 1. Let us create an hive table using the HBaseStorageHandler as shown below

hive> 
    > 
    > CREATE EXTERNAL TABLE customers(
    >  custmer_id string,
    >  last_name string,
    >  first_name string,
    >  age int,
    >  skill_type string,
    >  skill_set string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key, cust_info:last_name, cust_info:first_name, cust_info:age, cust_prof:skill_type, cust_prof:skill_set')
    > TBLPROPERTIES ('hbase.table.name' = 'cust_master');
OK

Time taken: 14.419 seconds


    The key here is to map the Hive table columns accordingly to the columns in HBase Table with the right column family qualifiers as shown above in the  WITH SERDEPROPERTIES  section.


2.  Now you are good to query the data..

hive> select first_name,last_name,skill_type from customers where age >40;
OK
craig woods Tech Skills
lee persons Soft-Managerial Skills
Time taken: 0.463 seconds, Fetched: 2 row(s)



Hope this helps, see you next time with another interesting snippet.





HBase - Loading data using Apache Pig

In my previous blogs , I have been exploring data consumption with variety of data formats in big data world. Today we will exploring the process of the analyzing the data using apache pig and finally storing  in HBase in few easy steps.  Later in the next post.. I will try to demonstrate to how to access the same data using Hive  through an simple SQL query.

Let us assume that we have following file, which we will be using pig to process it..

cust_id,last_name,first_name,age,skill_type,skill_set
--------- ------------ ------------- ---- ------------ ----------
1001,jones,mike,40,Tech Skills,java-perl-ruby-linux-oracle
1006,persons,lee,50,Soft-Managerial Skills,Team Lead-Manager
1002,woods,craig,44,Tech Skills,c,c++,java-sql-oracle-teradata-dwh


Note:  My file do not include the header. I am showing the header here just for demonstration purpose.


Let us step  by step..


1.  Let us create the hbase table  with two column families cust_info & cust_prof  for information and profile attributes

hbase >  create 'cust_master','cust_info','cust_prof'

hbase >  list


2.  Now, Let us load the data into pig

grunt > a = LOAD '/user/cloudera/pig_hbase/cust_info.txt' USING PigStorage(',') AS ( cust_id:chararray,
last_name:chararray,
first_name:chararray,
age:int,
skill_type:chararray,
skill_set:chararray );

 after successfully loaded , you should see the information like below..


you can also verify as below

grunt> dump a;


3.  Now load the data that we loaded into pig into the hbase table ..

 grunt >  STORE a INTO 'hbase://cust_master' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cust_info:last_name
 cust_info:first_name
 cust_info:age
 cust_prof:skill_type
 cust_prof:skill_set'
);


  Hope now the data is loaded into the hbase table  cust_master


4.  Lets go and check in hbase ..

hbase(main):004:0> scan 'cust_master'

ROW                                     COLUMN+CELL                                                                                                      
 1001                                   column=cust_ai:age, timestamp=1438109850189, value=40                        
 1001                                   column=cust_ai:first_name, timestamp=1438109850189, value=mike        
 1001                                   column=cust_ai:last_name, timestamp=1438109850189, value=jones          
 1001                                   column=cust_prof:skill_set, timestamp=1438109850189, value=java-perl-ruby-linux-oracle        1001                                   column=cust_prof:skill_type, timestamp=1438109850189, value=Tech Skills                                1002                                   column=cust_ai:age, timestamp=1438109850218, value=44                        
 1002                                   column=cust_ai:first_name, timestamp=1438109850218, value=craig        
 1002                                   column=cust_ai:last_name, timestamp=1438109850218, value=woods        
 1002                                   column=cust_prof:skill_set, timestamp=1438109850218, value=c              
 1002                                   column=cust_prof:skill_type, timestamp=1438109850218, value=Tech Skills                                1006                                   column=cust_ai:age, timestamp=1438109850217, value=50                        
 1006                                   column=cust_ai:first_name, timestamp=1438109850217, value=lee            
 1006                                   column=cust_ai:last_name, timestamp=1438109850217, value=persons      
 1006                                   column=cust_prof:skill_set, timestamp=1438109850217, value=Team Lead-Manager                  1006                                   column=cust_prof:skill_type, timestamp=1438109850217, value=Soft-Managerial Skills                              
3 row(s) in 0.3510 seconds



 In my next blog,  let us see how we can create a  Hive table on top of hbase table for easy querying the data stored in HBase...  stay tuned.