BigData World: July 2015

In my previous blog link , we discussed about how to load and process the data using Apache Pig and finally load into Hbase table. In this article we are going to explore how to query the HBase table using Hive. This provides an easy way to provide access to the end-users in your organization.

Many of you may raise a question why storing the data in HBase and then query through Hive, rather you directly store the data in Hive tables. The reason is simple, when the data grows huge, the adhoc queries are very slow and some it may make no sense to query from Hive Tables. Instead HBase promises high speed on Random access of data while querying.

Now lets go to the technical part..

1. Let us create an hive table using the HBaseStorageHandler as shown below

hive>
>
> CREATE EXTERNAL TABLE customers(
> custmer_id string,
> last_name string,
> first_name string,
> age int,
> skill_type string,
> skill_set string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key, cust_info:last_name, cust_info:first_name, cust_info:age, cust_prof:skill_type, cust_prof:skill_set')
> TBLPROPERTIES ('hbase.table.name' = 'cust_master');
OK

Time taken: 14.419 seconds

The key here is to map the Hive table columns accordingly to the columns in HBase Table with the right column family qualifiers as shown above in the WITH SERDEPROPERTIES section.

2. Now you are good to query the data..

hive> select first_name,last_name,skill_type from customers where age >40;
OK
craig woods Tech Skills
lee persons Soft-Managerial Skills
Time taken: 0.463 seconds, Fetched: 2 row(s)

Hope this helps, see you next time with another interesting snippet.

In my previous blogs , I have been exploring data consumption with variety of data formats in big data world. Today we will exploring the process of the analyzing the data using apache pig and finally storing in HBase in few easy steps. Later in the next post.. I will try to demonstrate to how to access the same data using Hive through an simple SQL query.

Let us assume that we have following file, which we will be using pig to process it..

cust_id,last_name,first_name,age,skill_type,skill_set
--------- ------------ ------------- ---- ------------ ----------
1001,jones,mike,40,Tech Skills,java-perl-ruby-linux-oracle
1006,persons,lee,50,Soft-Managerial Skills,Team Lead-Manager
1002,woods,craig,44,Tech Skills,c,c++,java-sql-oracle-teradata-dwh

Note: My file do not include the header. I am showing the header here just for demonstration purpose.

Let us step by step..

1. Let us create the hbase table with two column families cust_info & cust_prof for information and profile attributes

hbase > create 'cust_master','cust_info','cust_prof'

hbase > list

2. Now, Let us load the data into pig

grunt > a = LOAD '/user/cloudera/pig_hbase/cust_info.txt' USING PigStorage(',') AS ( cust_id:chararray,
last_name:chararray,
first_name:chararray,
age:int,
skill_type:chararray,
skill_set:chararray );

after successfully loaded , you should see the information like below..

you can also verify as below

grunt> dump a;

3. Now load the data that we loaded into pig into the hbase table ..

grunt > STORE a INTO 'hbase://cust_master' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cust_info:last_name
cust_info:first_name
cust_info:age
cust_prof:skill_type
cust_prof:skill_set'
);

Hope now the data is loaded into the hbase table cust_master

4. Lets go and check in hbase ..

hbase(main):004:0> scan 'cust_master'

ROW COLUMN+CELL
1001 column=cust_ai:age, timestamp=1438109850189, value=40
1001 column=cust_ai:first_name, timestamp=1438109850189, value=mike
1001 column=cust_ai:last_name, timestamp=1438109850189, value=jones
1001 column=cust_prof:skill_set, timestamp=1438109850189, value=java-perl-ruby-linux-oracle 1001 column=cust_prof:skill_type, timestamp=1438109850189, value=Tech Skills 1002 column=cust_ai:age, timestamp=1438109850218, value=44
1002 column=cust_ai:first_name, timestamp=1438109850218, value=craig
1002 column=cust_ai:last_name, timestamp=1438109850218, value=woods
1002 column=cust_prof:skill_set, timestamp=1438109850218, value=c
1002 column=cust_prof:skill_type, timestamp=1438109850218, value=Tech Skills 1006 column=cust_ai:age, timestamp=1438109850217, value=50
1006 column=cust_ai:first_name, timestamp=1438109850217, value=lee
1006 column=cust_ai:last_name, timestamp=1438109850217, value=persons
1006 column=cust_prof:skill_set, timestamp=1438109850217, value=Team Lead-Manager 1006 column=cust_prof:skill_type, timestamp=1438109850217, value=Soft-Managerial Skills
3 row(s) in 0.3510 seconds

In my next blog, let us see how we can create a Hive table on top of hbase table for easy querying the data stored in HBase... stay tuned.

BigData World

Tuesday, July 28, 2015

HBase - Querying through Hive Tables

HBase - Loading data using Apache Pig

Blog Archive