BigData World: August 2015

Wednesday, August 19, 2015

Accessing Hive Tables using R for Data mining

Hi friends, today we will be discussing yet another interesting topic. In today's world the analytics has been so advanced where most of the organisations are investing in Data Mining /Predictive/Prescriptive Analytics. Many Data scientists use R as an preferred language for their Data Mining needs. In this Big Data world, most of the data is being processed in the hadoop ecosystem. Now, Let us see how can we connect to Hive Tables from R and perform some modeling.

Make sure you have installed R and the relevent packages for JDBC connections.

If you have already installed R, make sure you have updated java settings.

# R CMD javareconf

Now, Launch R shell and install the 'rJava' and 'RJDBC' packages that are for jdbc connections

> install.packages("rJava")

> install.packages("RJDBC",dep=TRUE)

Once completed !, you are good to start accessing the data from Hive tables as below

> options( java.parameters = "-Xmx8g" )

> library(DBI)
> library(rJava)
> library(RJDBC)

The 'RJDBC' library is dependent on 'DBI' so we included the package.

Setup the necessary jar files required for Hadoop and Hive.

Assign the HiveDriver

Define the JDBC connection as shown

conn <- dbConnect(drv, "jdbc:hive2://<hostname>:10000/default", "username", "password")

Query the data from R on Hive Table and assign to a variable. R automatically assigns to an DataFrame in R. Now,we can do any R related operations as usual.

I have the call center data, which has length of the call and the no of units sold in the call. Let us run an regression model against the data.

Hope this keeps interesting....

Thanks
Venkat

Tuesday, August 11, 2015

Using Parquet - Hive, Impala and SparkSQL

Hello Friends, We had discussed about some of the new file formats in Hadoop ecosystem in my previous blog. One of the them is Parquet which has high potentially to change the landscape of hadoop storage and at the same time accelerates read and write performance. Lets explore and compare and see how effective it is. Also we will accessing the same data across Hive/Spark etc..

For this example, I am using the Cloudera's sandbox. Also I am using the NYC taxi dataset which is about 75 MB with 250k records.

To make it simpler I had create tables with different storage options.

TXT_NYCTAXI with text format
PAR_NYCTAXI with Parquet format
H_NYCTAXI with Hbase/HFiles format

Note : All three tables are registered in Hcatalog and available for querying, but with different storages.

Table with Text Format

CREATE TABLE TXT_NYCTAXI (
id string,
rev string,
dropoff_datetime string,
dropoff_latitude string,
dropoff_longitude string,
hack_license string,
medallion string,
passenger_count int,
pickup_datetime string,
pickup_latitude string,
pickup_longitude string,
rate_code string,
store_and_fwd_flag string,
trip_distance string,
trip_time_in_secs string,
vendor_id string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Loading the data into table from CSV fie , which took about 53 seconds and this the same storage space of the file. 75 MB.

LOAD DATA INPATH "/user/cloudera/pig_hbase/nyctaxi.csv" INTO TABLE TXT_NYCTAXI;

Now Let us create a table using parquet format

CREATE EXTERNAL TABLE PAR_NYCTAXI (
id string,
rev string,
dropoff_datetime string,
dropoff_latitude string,
dropoff_longitude string,
hack_license string,
medallion string,
passenger_count int,
pickup_datetime string,
pickup_latitude string,
pickup_longitude string,
rate_code string,
store_and_fwd_flag string,
trip_distance string,
trip_time_in_secs string,
vendor_id string)

STORED AS parquetfile;

Loading the data into the table is very fast compared to the Text format., and also took very less space compared to the text format (21 MB /75 MB).

I had also create the hbase table and integrated with Hive to compare with HFile storage. Please refer my previous blog on integrating hive with hbase.

Hbase Table

hbase(main):001:0> create 'nyc_cab','taxidetails'

Load the hbase table using pig,

STORE nycab_data INTO 'hbase://nyc_cab' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'taxidetails:rev, taxidetails:dropoff_datetime, taxidetails:dropoff_latitude,
taxidetails:dropoff_longitude, taxidetails:hack_license, taxidetails:medallion,
taxidetails:passenger_count, taxidetails:pickup_datetime, taxidetails:pickup_latitude,
taxidetails:pickup_longitude, taxidetails:rate_code, taxidetails:store_and_fwd_flag,
taxidetails:trip_distance, taxidetails:trip_time_in_secs, taxidetails:vendor_id'

);

The loading took about 42 seconds.

Create Hive table on HBase table H_NYCTAXI. The Store in Hbase table is far too high than the (2x) Text format. i,e 8x of Parquet format

Overall, The parquet format takes less space on HDFS and less time to load the data ( 5 vs 42 vs 53 seconds ) compared to Text and HFile

Before jumping into measuring the read performances.. let us make sure that all three tables are available in hcatalog and we can go and access them using Hive , Impala and SparkSQL

Querying from Hive..

Text Storage

Querying from Impala..

Text format

HFile format

Parquet format

Accessing from Spark..

You can also access the data from sparkSQL and further processing just as shown below..

import org.apache.spark.sql.hive._
import org.apache.spark.sql._

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

val sc_taxiDF = sqlC.parquetFile("/user/hive/warehouse/par_nyctaxi/")

sc_taxiDF.registerTempTable("sc_nyctaxi")

val pass_count = hc.sql("select count(id),sum(passenger_count) from sc_nyctaxi")

pass_count.collect.foreach(println)

Parquet format

We had observed that Parquet format is performing better and taking less storage than others. Now, Let us conclude with the observations.

It can be recommended that to use Impala in cloudera env for any reporting tools through jdbc connections, else for processing SparkSQL is the best option and more flexible than MapReduce.

Thanks Again and I hope this helps, Next we will discuss another important topic.

Monday, August 10, 2015

Advanced File Formats in Big Data

From the early days of Hadoop, there has been several enhancements to process the data in various file formats and with various compression techniques. Some of the formats includes Sequential File (Text,XML and Json) formats and non-sequential format like Avro. The days of xml and json formats are past now, though some of the major applications still use them. Avro was great and was very useful as it always carries the schema along with the data. However, In the recent years 2014/2015 we are witnessing more enhancements in storage which are promising in greater compression rates and even blazing read performances. The new kids in the block are ‘Parquet’ and ‘ORC’ (Optimized Row Columnar) formats.

Both are columnar data formats and provides great options , but depends on the distribution that you have in place. The Parquet was primarily developed from Twitter and cloudera and Cloudera has heavily invested in this technology and it continues to do so. Where as the Hortonworks focuses on the ORC format which was developed as part of Stinger initiative to replace the RCFile format. ORC is also equally promising with the same or better compression ratios as Parquet. But, it is very difficult to compare both on the same lines, we will try to do it with few examples in my coming blog posts.

Parquet paves the way for better Hadoop data storage

Hadoop was built for managing large sets of data, so a columnar store is a natural complement. Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, Spark and Apache Drill projects already do this, as well as conventional MapReduce.

Parquet implements column compression so that it gives great compression rates to decrease the storage space and at the same time accelerates performance. Cloudera, the progenitor uses Parquet as a native storage format for its impala.

Parquet can be integrated with existing type systems and processing frameworks :

· Pig
· Impala
· Thrift for M/R, Cascading and Scalding
· Avro
· Hive
· Spark

For more information on Parquet : Http://parquet.io

ORC, An Intelligent Big Data File Format

Hortonworks, in parallel has developed ORC file format as part of its Stinger Project. ORC goes beyond RCFile and uses specific encoders for different column data types to improve compression further. ORC introduces a lightweight indexing that enables skipping of complete blocks of rows that do not match a query. Each file with the columnar layout is optimized for compression and skipping the columns to reduce read and decompression load.

Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly. The Spark related data processing programs can also benefit , as SparkSQL can also be integrated HCatalog with its HiveContext. We can also go through this with one of the example in the coming blog. The comparison of various formats and their compression rations is depicted in the below diagram.

Hive will handle all the details of the conversion to ORCFile and you are free to delete the old table to free up loads of space. When you create an ORC Hive table there are number of table properties we can use to further tune the way ORC works.

File formats compared

The following table summarizes the performance amount the AVRO, ORC and Parquet formats with metrics like Storage space (LOAD_MB_WRITTEN), LOAD_TIME, ANALYZE_TIME, QUERY_TIME on a 400G volume of data.

Hope this keeps interesting for you, next time we will discuss another important topic.