Hello Friends, We had discussed about some of the new file formats in Hadoop ecosystem in my previous blog. One of the them is Parquet which has high potentially to change the landscape of hadoop storage and at the same time accelerates read and write performance. Lets explore and compare and see how effective it is. Also we will accessing the same data across Hive/Spark etc..
For this example, I am using the Cloudera's sandbox. Also I am using the NYC taxi dataset which is about 75 MB with 250k records.
To make it simpler I had create tables with different storage options.
TXT_NYCTAXI with text format
PAR_NYCTAXI with Parquet format
H_NYCTAXI with Hbase/HFiles format
Note : All three tables are registered in Hcatalog and available for querying, but with different storages.
Table with Text Format
CREATE TABLE TXT_NYCTAXI (
id string,
rev string,
dropoff_datetime string,
dropoff_latitude string,
dropoff_longitude string,
hack_license string,
medallion string,
passenger_count int,
pickup_datetime string,
pickup_latitude string,
pickup_longitude string,
rate_code string,
store_and_fwd_flag string,
trip_distance string,
trip_time_in_secs string,
vendor_id string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Loading the data into table from CSV fie , which took about 53 seconds and this the same storage space of the file. 75 MB.
LOAD DATA INPATH "/user/cloudera/pig_hbase/nyctaxi.csv" INTO TABLE TXT_NYCTAXI;
Now Let us create a table using parquet format
CREATE EXTERNAL TABLE PAR_NYCTAXI (
id string,
rev string,
dropoff_datetime string,
dropoff_latitude string,
dropoff_longitude string,
hack_license string,
medallion string,
passenger_count int,
pickup_datetime string,
pickup_latitude string,
pickup_longitude string,
rate_code string,
store_and_fwd_flag string,
trip_distance string,
trip_time_in_secs string,
vendor_id string)
STORED AS parquetfile;
Loading the data into the table is very fast compared to the Text format., and also took very less space compared to the text format (21 MB /75 MB).
I had also create the hbase table and integrated with Hive to compare with HFile storage. Please refer my previous blog on integrating hive with hbase.
Hbase Table
hbase(main):001:0> create 'nyc_cab','taxidetails'
Load the hbase table using pig,
STORE nycab_data INTO 'hbase://nyc_cab' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'taxidetails:rev, taxidetails:dropoff_datetime, taxidetails:dropoff_latitude,
taxidetails:dropoff_longitude, taxidetails:hack_license, taxidetails:medallion,
taxidetails:passenger_count, taxidetails:pickup_datetime, taxidetails:pickup_latitude,
taxidetails:pickup_longitude, taxidetails:rate_code, taxidetails:store_and_fwd_flag,
taxidetails:trip_distance, taxidetails:trip_time_in_secs, taxidetails:vendor_id'
);
The loading took about 42 seconds.
Create Hive table on HBase table H_NYCTAXI. The Store in Hbase table is far too high than the (2x) Text format. i,e 8x of Parquet format
Overall, The parquet format takes less space on HDFS and less time to load the data ( 5 vs 42 vs 53 seconds ) compared to Text and HFile
Before jumping into measuring the read performances.. let us make sure that all three tables are available in hcatalog and we can go and access them using Hive , Impala and SparkSQL
Querying from Hive..
Text Storage
Querying from Impala..
Text format
HFile format
Accessing from Spark..
You can also access the data from sparkSQL and further processing just as shown below..
import org.apache.spark.sql.hive._
import org.apache.spark.sql._
val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._
val sc_taxiDF = sqlC.parquetFile("/user/hive/warehouse/par_nyctaxi/")
sc_taxiDF.registerTempTable("sc_nyctaxi")
val pass_count = hc.sql("select count(id),sum(passenger_count) from sc_nyctaxi")
pass_count.collect.foreach(println)
Parquet format
We had observed that Parquet format is performing better and taking less storage than others. Now, Let us conclude with the observations.
It can be recommended that to use Impala in cloudera env for any reporting tools through jdbc connections, else for processing SparkSQL is the best option and more flexible than MapReduce.
Thanks Again and I hope this helps, Next we will discuss another important topic.
For this example, I am using the Cloudera's sandbox. Also I am using the NYC taxi dataset which is about 75 MB with 250k records.
To make it simpler I had create tables with different storage options.
TXT_NYCTAXI with text format
PAR_NYCTAXI with Parquet format
H_NYCTAXI with Hbase/HFiles format
Note : All three tables are registered in Hcatalog and available for querying, but with different storages.
Table with Text Format
CREATE TABLE TXT_NYCTAXI (
id string,
rev string,
dropoff_datetime string,
dropoff_latitude string,
dropoff_longitude string,
hack_license string,
medallion string,
passenger_count int,
pickup_datetime string,
pickup_latitude string,
pickup_longitude string,
rate_code string,
store_and_fwd_flag string,
trip_distance string,
trip_time_in_secs string,
vendor_id string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Loading the data into table from CSV fie , which took about 53 seconds and this the same storage space of the file. 75 MB.
LOAD DATA INPATH "/user/cloudera/pig_hbase/nyctaxi.csv" INTO TABLE TXT_NYCTAXI;
Now Let us create a table using parquet format
CREATE EXTERNAL TABLE PAR_NYCTAXI (
id string,
rev string,
dropoff_datetime string,
dropoff_latitude string,
dropoff_longitude string,
hack_license string,
medallion string,
passenger_count int,
pickup_datetime string,
pickup_latitude string,
pickup_longitude string,
rate_code string,
store_and_fwd_flag string,
trip_distance string,
trip_time_in_secs string,
vendor_id string)
STORED AS parquetfile;
Loading the data into the table is very fast compared to the Text format., and also took very less space compared to the text format (21 MB /75 MB).
I had also create the hbase table and integrated with Hive to compare with HFile storage. Please refer my previous blog on integrating hive with hbase.
Hbase Table
hbase(main):001:0> create 'nyc_cab','taxidetails'
Load the hbase table using pig,
STORE nycab_data INTO 'hbase://nyc_cab' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'taxidetails:rev, taxidetails:dropoff_datetime, taxidetails:dropoff_latitude,
taxidetails:dropoff_longitude, taxidetails:hack_license, taxidetails:medallion,
taxidetails:passenger_count, taxidetails:pickup_datetime, taxidetails:pickup_latitude,
taxidetails:pickup_longitude, taxidetails:rate_code, taxidetails:store_and_fwd_flag,
taxidetails:trip_distance, taxidetails:trip_time_in_secs, taxidetails:vendor_id'
);
The loading took about 42 seconds.
Create Hive table on HBase table H_NYCTAXI. The Store in Hbase table is far too high than the (2x) Text format. i,e 8x of Parquet format
Overall, The parquet format takes less space on HDFS and less time to load the data ( 5 vs 42 vs 53 seconds ) compared to Text and HFile
Before jumping into measuring the read performances.. let us make sure that all three tables are available in hcatalog and we can go and access them using Hive , Impala and SparkSQL
Querying from Hive..
Text Storage
Querying from Impala..
Text format
HFile format
Parquet format
Accessing from Spark..
You can also access the data from sparkSQL and further processing just as shown below..
import org.apache.spark.sql.hive._
import org.apache.spark.sql._
val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._
val sc_taxiDF = sqlC.parquetFile("/user/hive/warehouse/par_nyctaxi/")
sc_taxiDF.registerTempTable("sc_nyctaxi")
val pass_count = hc.sql("select count(id),sum(passenger_count) from sc_nyctaxi")
pass_count.collect.foreach(println)
Parquet format
We had observed that Parquet format is performing better and taking less storage than others. Now, Let us conclude with the observations.
It can be recommended that to use Impala in cloudera env for any reporting tools through jdbc connections, else for processing SparkSQL is the best option and more flexible than MapReduce.
Thanks Again and I hope this helps, Next we will discuss another important topic.
No comments:
Post a Comment