From the early days of
Hadoop, there has been several
enhancements to process the data in
various file formats and with various compression techniques. Some of the formats includes Sequential File (Text,XML
and Json) formats and non-sequential format like Avro. The days of xml and json formats are past now,
though some of the major applications still use them. Avro was great and was very useful as it
always carries the schema along with the data.
However, In the recent years 2014/2015 we are witnessing more
enhancements in storage which are promising in greater compression rates
and even blazing read performances. The new kids in the block are ‘Parquet’
and ‘ORC’ (Optimized Row Columnar)
formats.
Both are columnar data formats
and provides great options , but depends
on the distribution that you have in place.
The Parquet was primarily developed from Twitter and cloudera and Cloudera has heavily invested in this
technology and it continues to do so.
Where as the Hortonworks focuses on the ORC format which was developed
as part of Stinger initiative to replace
the RCFile format. ORC is also equally promising with the same or better compression
ratios as Parquet. But, it is very difficult to compare
both on the same lines, we will try to do it with few examples in my coming
blog posts.
Parquet paves the
way for better Hadoop data storage
Hadoop was built for managing large sets of data, so a
columnar store is a natural complement. Most Hadoop projects can read and
write data to and from Parquet; the Hive, Pig,
Spark and Apache Drill projects already do this, as well as
conventional MapReduce.
Parquet implements column compression so that it gives great
compression rates to decrease
the storage space and at the same time accelerates performance.
Cloudera, the progenitor uses Parquet as a
native storage format for its impala.
Parquet can be
integrated with existing type systems and processing frameworks :
- · Pig
- ·
Impala
- ·
Thrift
for M/R, Cascading and Scalding
- ·
Avro
- ·
Hive
- ·
Spark
For more information on
Parquet : Http://parquet.io
ORC, An Intelligent
Big Data File Format
Hortonworks, in
parallel has developed ORC file format as part of its Stinger Project. ORC goes beyond RCFile and uses specific encoders for different
column data types to improve compression further. ORC introduces a
lightweight indexing that enables skipping of complete blocks of rows that do not match a query. Each file with the columnar layout is optimized
for compression and skipping the columns
to reduce read and decompression load.
Data stored in ORCFile can be read or written through
HCatalog, so any Pig or Map/Reduce process can play along seamlessly. The
Spark related data processing programs
can also benefit , as SparkSQL can also be integrated HCatalog with its HiveContext.
We can also go through this with one of the example in the coming blog. The comparison of various formats and their
compression rations is depicted in the below diagram.
Hive will handle all the details of the conversion to ORCFile and you are free to delete the old
table to free up loads of space. When
you create an ORC Hive table there are
number of table properties we can use to
further tune the way ORC works.
File formats
compared
The following table summarizes the performance
amount the AVRO, ORC and Parquet formats
with metrics like Storage space (LOAD_MB_WRITTEN), LOAD_TIME, ANALYZE_TIME, QUERY_TIME on a 400G volume of data.
Hope this keeps interesting for you, next time we will discuss another important topic.