Sunday, September 13, 2015

Using ORC - Hive and SparkSQL

Few days ago in my previous articles I had discussed about the advanced file formats  like Parquet and ORC and the performance improvement in the hadoop ecosystem. We had also seen the improvements of Parquet format.  Today I would like to explore more on the ORC compression format and how it is compatible with Hive and Spark SQL.


Again, I am going to use the same data set (NYC taxi)  and create the   ORC_NYCTAXI table so that I can compare with the TXT_NYCTAXI.


I am going to use SparkSQL to create the ORC_NYCTAXI table, just to show to we can integrate the SparkSQL with Hive/Hcatalog

Initiate the spark program with an hive context as shown below


import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._

val hiveC = new org.apache.spark.sql.hive.HiveContext(sc)



create the ORC format table

hiveC.sql("create table ORC_NYCTAXI stored as orc as select * from TXT_NYCTAXI")












Let us see how we can easily access ORC format data files from HDFS without any Hive Catalog from SparkSQL. 



import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._

val hiveC = new org.apache.spark.sql.hive.HiveContext(sc)
val load_orcnyc = hiveC.orcFile("/apps/hive/warehouse/orc_nyctaxi")
load_orcnyc.registerTempTable("ORCNycTaxi")
hiveC.sql("select * from ORCNycTaxi").take(5).foreach(println)












This shows how easy and flexible in reading the data in ORC format , where one has no need to remember the schema definition and can still work on the data with very few lines of code using Spark SQL.




Now, let us examine the storage 












The compression is about 17%-18%  of the orginal size. Where as we had seen that  Parquet format is about 30% of the original size.  So ORC has  better compression ratio than Parquet format.


Now Let us see the query performance




   It has been observed that  both the Parquet and ORC are the same in read performance using Spark SQL, where ORC has better compression in storage.  It depends on how you are querying the data.  If you are using Hive and SparkSQL  ORC is better or the same as Parquet. If you are using Impala, Parquet gives better read performance.


Hope this interests you . See you next time with another interesting topic.





Wednesday, August 19, 2015

Accessing Hive Tables using R for Data mining



         Hi friends,  today  we will be discussing yet another interesting topic.  In today's world the analytics has been so advanced where most of the organisations are investing in Data Mining /Predictive/Prescriptive Analytics. Many Data scientists use R as an preferred language for their Data Mining needs.  In this Big Data world,  most of the data is being processed in the hadoop ecosystem.  Now, Let us see how can we connect to Hive Tables from R and perform some modeling.


Make sure you have installed R and the relevent packages for  JDBC connections.

If you have already installed R, make sure you have updated  java settings.

    # R CMD javareconf


Now, Launch R shell  and install the 'rJava' and 'RJDBC' packages that are for jdbc connections

install.packages("rJava")

install.packages("RJDBC",dep=TRUE)


Once completed !, you are good to start accessing the data from Hive tables as below


options( java.parameters = "-Xmx8g" )

library(DBI)
library(rJava)
library(RJDBC)


The 'RJDBC' library is dependent on 'DBI' so we included the package.


Setup the necessary jar files required for Hadoop and Hive.


Assign the HiveDriver


Define the JDBC connection as shown
 
     conn <- dbConnect(drv, "jdbc:hive2://<hostname>:10000/default", "username", "password")




Query the data from R on Hive Table and assign to a variable. R automatically assigns to an DataFrame in R. Now,we can do any R related operations as usual.






I have the call center data, which has length of the call and the no of units sold in the call. Let us run an regression model against the data.






Hope this keeps interesting....

Thanks
Venkat

Tuesday, August 11, 2015

Using Parquet - Hive, Impala and SparkSQL

Hello Friends, We had discussed about some of the new file formats in Hadoop ecosystem in my previous blog.  One of the them is Parquet which has high potentially to change the landscape of hadoop storage and at the same time accelerates  read and write performance.  Lets explore and compare  and see how effective it is.  Also we will accessing the same data across Hive/Spark etc..

For this example, I am using the Cloudera's sandbox. Also I am using the NYC taxi dataset which is about 75 MB with 250k records.

To make it simpler I had create tables with different storage options. 

TXT_NYCTAXI  with text format
PAR_NYCTAXI  with Parquet format
H_NYCTAXI    with Hbase/HFiles format

Note : All three tables are registered in Hcatalog and available for querying, but with different storages.


Table with Text Format

CREATE  TABLE TXT_NYCTAXI ( 
id  string,
rev  string,
dropoff_datetime  string,
dropoff_latitude  string,
dropoff_longitude  string,
hack_license  string,
medallion  string,
passenger_count  int,
pickup_datetime  string,
pickup_latitude  string,
pickup_longitude  string,
rate_code  string,
store_and_fwd_flag  string,
trip_distance  string,
trip_time_in_secs  string,
vendor_id  string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Loading the data into table from CSV fie , which took about 53 seconds  and this the same storage space of the file.  75 MB.

LOAD DATA INPATH "/user/cloudera/pig_hbase/nyctaxi.csv" INTO TABLE TXT_NYCTAXI;





Now Let us create a table using parquet format

CREATE EXTERNAL TABLE PAR_NYCTAXI (
id  string,
rev  string,
dropoff_datetime  string,
dropoff_latitude  string,
dropoff_longitude  string,
hack_license  string,
medallion  string,
passenger_count  int,
pickup_datetime  string,
pickup_latitude  string,
pickup_longitude  string,
rate_code  string,
store_and_fwd_flag  string,
trip_distance  string,
trip_time_in_secs  string,
vendor_id  string)

  STORED AS parquetfile;


Loading the data into the table is very fast compared to the Text format., and also took very less space compared to the text format (21 MB /75 MB).






I had also create the hbase table and integrated with Hive to compare with HFile storage. Please refer my previous blog on integrating hive with hbase.

Hbase Table

hbase(main):001:0> create 'nyc_cab','taxidetails'


Load the hbase table using pig,

STORE nycab_data INTO 'hbase://nyc_cab' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
 'taxidetails:rev,  taxidetails:dropoff_datetime,  taxidetails:dropoff_latitude,
 taxidetails:dropoff_longitude,  taxidetails:hack_license, taxidetails:medallion,
 taxidetails:passenger_count, taxidetails:pickup_datetime, taxidetails:pickup_latitude,
 taxidetails:pickup_longitude, taxidetails:rate_code, taxidetails:store_and_fwd_flag,
 taxidetails:trip_distance, taxidetails:trip_time_in_secs, taxidetails:vendor_id'

);

The loading took about 42 seconds.


Create Hive table on HBase table  H_NYCTAXI.  The Store in Hbase table is far too high than the (2x) Text format. i,e 8x  of Parquet format


    Overall,  The parquet format takes less space on HDFS  and less time to load the data  ( 5  vs 42 vs 53 seconds )  compared to  Text and HFile



Before jumping into measuring the read performances.. let us make sure that all three tables are available in hcatalog  and we can go and access them using Hive , Impala and SparkSQL




Querying from Hive..




  Text Storage





Querying from Impala..

Text format



HFile format


Parquet format




Accessing from Spark..

   You can also access the data from sparkSQL and further processing just as shown below..


import org.apache.spark.sql.hive._
import org.apache.spark.sql._


val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

val sc_taxiDF = sqlC.parquetFile("/user/hive/warehouse/par_nyctaxi/")

sc_taxiDF.registerTempTable("sc_nyctaxi")

val pass_count = hc.sql("select count(id),sum(passenger_count) from sc_nyctaxi")


pass_count.collect.foreach(println)



Parquet format




We had observed that  Parquet format is performing better and taking less storage than others.  Now, Let us conclude with the observations.





  It can be recommended that to use Impala in cloudera env for any reporting tools through jdbc connections, else for processing  SparkSQL is the best option and more flexible than MapReduce.



Thanks Again and I hope this helps, Next we will discuss another important topic.



Monday, August 10, 2015

Advanced File Formats in Big Data


From the early days of Hadoop,  there has been several enhancements  to process the data in various file formats and with various compression techniques.  Some of the formats includes Sequential File (Text,XML and Json) formats and non-sequential format like Avro.  The days of xml and json formats are past now, though some of the major applications still use them.  Avro was great and was very useful as it always carries the schema along with the data.  However, In the recent years 2014/2015 we are witnessing more enhancements in storage which are promising in greater compression rates and  even blazing read performances.  The new kids in the block are  ‘Parquet’  and ORC’ (Optimized Row Columnar) formats.
   

Both are columnar data formats and  provides great options , but depends on the distribution that you have in place.  The Parquet was primarily developed from Twitter and cloudera and   Cloudera has heavily invested in this technology and it continues to do so.  Where as the Hortonworks focuses on the ORC format which was developed as part of  Stinger initiative to replace the RCFile format.   ORC is also equally  promising with the same or better compression ratios as Parquet.  But,  it is very difficult to compare both on the same lines,  we will try to do it with few examples in my coming blog posts.



Parquet paves the way for better Hadoop data storage     

Hadoop was built for managing large sets of data, so a columnar store is a natural complement.  Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, Spark and Apache Drill projects already do this, as well as conventional MapReduce.
Parquet implements column compression so that it gives great compression rates  to  decrease  the storage space and at the same time accelerates performance. Cloudera, the progenitor  uses Parquet  as  a native storage format for its impala.

Parquet  can be integrated with existing type systems and processing frameworks :
  • ·         Pig
  • ·         Impala
  • ·         Thrift  for M/R, Cascading and Scalding
  • ·         Avro
  • ·         Hive
  • ·         Spark

For more information on Parquet  :  Http://parquet.io



ORC, An Intelligent Big Data File Format

Hortonworks,  in parallel has developed ORC file format as part of its Stinger Project. ORC goes beyond RCFile and uses specific encoders for different column data types to improve compression further. ORC introduces a lightweight indexing that enables skipping of complete blocks of rows  that do not match a query.  Each file with the columnar layout is optimized for compression and skipping the  columns to reduce read and decompression load.

Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly.   The Spark  related data processing programs can also benefit , as SparkSQL can also be integrated HCatalog with its HiveContext. We can also go through this with one of the example in the coming blog.  The comparison of various formats and their compression rations is depicted in the below diagram.




Hive will handle all the details of the conversion  to ORCFile and you are free to delete the old  table to free up loads of space. When you create an ORC  Hive table there are number of  table properties we can use to further tune the way ORC works.




File formats compared

   The following table summarizes the performance amount the AVRO, ORC and Parquet formats  with metrics like Storage space (LOAD_MB_WRITTEN),  LOAD_TIME, ANALYZE_TIME, QUERY_TIME  on a 400G  volume of data.
  




Hope this keeps interesting for you,   next time we will discuss  another important topic.

Tuesday, July 28, 2015

HBase - Querying through Hive Tables


 In my previous blog link , we discussed about how to load and process the data using Apache Pig and finally load into Hbase table.  In this article we are going to explore how to query the HBase table using Hive. This provides an easy way to provide access to the end-users in your organization.

 Many of you may raise a question why storing the data in HBase and then query through Hive, rather you directly store the data in Hive tables.  The reason is simple, when the data grows huge, the adhoc queries are very slow and some it may make no sense to query from Hive Tables.   Instead HBase promises high speed on Random access of data while querying.


 Now lets go to the technical part..

 1. Let us create an hive table using the HBaseStorageHandler as shown below

hive> 
    > 
    > CREATE EXTERNAL TABLE customers(
    >  custmer_id string,
    >  last_name string,
    >  first_name string,
    >  age int,
    >  skill_type string,
    >  skill_set string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key, cust_info:last_name, cust_info:first_name, cust_info:age, cust_prof:skill_type, cust_prof:skill_set')
    > TBLPROPERTIES ('hbase.table.name' = 'cust_master');
OK

Time taken: 14.419 seconds


    The key here is to map the Hive table columns accordingly to the columns in HBase Table with the right column family qualifiers as shown above in the  WITH SERDEPROPERTIES  section.


2.  Now you are good to query the data..

hive> select first_name,last_name,skill_type from customers where age >40;
OK
craig woods Tech Skills
lee persons Soft-Managerial Skills
Time taken: 0.463 seconds, Fetched: 2 row(s)



Hope this helps, see you next time with another interesting snippet.





HBase - Loading data using Apache Pig

In my previous blogs , I have been exploring data consumption with variety of data formats in big data world. Today we will exploring the process of the analyzing the data using apache pig and finally storing  in HBase in few easy steps.  Later in the next post.. I will try to demonstrate to how to access the same data using Hive  through an simple SQL query.

Let us assume that we have following file, which we will be using pig to process it..

cust_id,last_name,first_name,age,skill_type,skill_set
--------- ------------ ------------- ---- ------------ ----------
1001,jones,mike,40,Tech Skills,java-perl-ruby-linux-oracle
1006,persons,lee,50,Soft-Managerial Skills,Team Lead-Manager
1002,woods,craig,44,Tech Skills,c,c++,java-sql-oracle-teradata-dwh


Note:  My file do not include the header. I am showing the header here just for demonstration purpose.


Let us step  by step..


1.  Let us create the hbase table  with two column families cust_info & cust_prof  for information and profile attributes

hbase >  create 'cust_master','cust_info','cust_prof'

hbase >  list


2.  Now, Let us load the data into pig

grunt > a = LOAD '/user/cloudera/pig_hbase/cust_info.txt' USING PigStorage(',') AS ( cust_id:chararray,
last_name:chararray,
first_name:chararray,
age:int,
skill_type:chararray,
skill_set:chararray );

 after successfully loaded , you should see the information like below..


you can also verify as below

grunt> dump a;


3.  Now load the data that we loaded into pig into the hbase table ..

 grunt >  STORE a INTO 'hbase://cust_master' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cust_info:last_name
 cust_info:first_name
 cust_info:age
 cust_prof:skill_type
 cust_prof:skill_set'
);


  Hope now the data is loaded into the hbase table  cust_master


4.  Lets go and check in hbase ..

hbase(main):004:0> scan 'cust_master'

ROW                                     COLUMN+CELL                                                                                                      
 1001                                   column=cust_ai:age, timestamp=1438109850189, value=40                        
 1001                                   column=cust_ai:first_name, timestamp=1438109850189, value=mike        
 1001                                   column=cust_ai:last_name, timestamp=1438109850189, value=jones          
 1001                                   column=cust_prof:skill_set, timestamp=1438109850189, value=java-perl-ruby-linux-oracle        1001                                   column=cust_prof:skill_type, timestamp=1438109850189, value=Tech Skills                                1002                                   column=cust_ai:age, timestamp=1438109850218, value=44                        
 1002                                   column=cust_ai:first_name, timestamp=1438109850218, value=craig        
 1002                                   column=cust_ai:last_name, timestamp=1438109850218, value=woods        
 1002                                   column=cust_prof:skill_set, timestamp=1438109850218, value=c              
 1002                                   column=cust_prof:skill_type, timestamp=1438109850218, value=Tech Skills                                1006                                   column=cust_ai:age, timestamp=1438109850217, value=50                        
 1006                                   column=cust_ai:first_name, timestamp=1438109850217, value=lee            
 1006                                   column=cust_ai:last_name, timestamp=1438109850217, value=persons      
 1006                                   column=cust_prof:skill_set, timestamp=1438109850217, value=Team Lead-Manager                  1006                                   column=cust_prof:skill_type, timestamp=1438109850217, value=Soft-Managerial Skills                              
3 row(s) in 0.3510 seconds



 In my next blog,  let us see how we can create a  Hive table on top of hbase table for easy querying the data stored in HBase...  stay tuned.