Monday, August 10, 2015

Advanced File Formats in Big Data

From the early days of Hadoop,  there has been several enhancements  to process the data in various file formats and with various compression techniques.  Some of the formats includes Sequential File (Text,XML and Json) formats and non-sequential format like Avro.  The days of xml and json formats are past now, though some of the major applications still use them.  Avro was great and was very useful as it always carries the schema along with the data.  However, In the recent years 2014/2015 we are witnessing more enhancements in storage which are promising in greater compression rates and  even blazing read performances.  The new kids in the block are  ‘Parquet’  and ORC’ (Optimized Row Columnar) formats.

Both are columnar data formats and  provides great options , but depends on the distribution that you have in place.  The Parquet was primarily developed from Twitter and cloudera and   Cloudera has heavily invested in this technology and it continues to do so.  Where as the Hortonworks focuses on the ORC format which was developed as part of  Stinger initiative to replace the RCFile format.   ORC is also equally  promising with the same or better compression ratios as Parquet.  But,  it is very difficult to compare both on the same lines,  we will try to do it with few examples in my coming blog posts.

Parquet paves the way for better Hadoop data storage     

Hadoop was built for managing large sets of data, so a columnar store is a natural complement.  Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, Spark and Apache Drill projects already do this, as well as conventional MapReduce.
Parquet implements column compression so that it gives great compression rates  to  decrease  the storage space and at the same time accelerates performance. Cloudera, the progenitor  uses Parquet  as  a native storage format for its impala.

Parquet  can be integrated with existing type systems and processing frameworks :
  • ·         Pig
  • ·         Impala
  • ·         Thrift  for M/R, Cascading and Scalding
  • ·         Avro
  • ·         Hive
  • ·         Spark

For more information on Parquet  :  Http://

ORC, An Intelligent Big Data File Format

Hortonworks,  in parallel has developed ORC file format as part of its Stinger Project. ORC goes beyond RCFile and uses specific encoders for different column data types to improve compression further. ORC introduces a lightweight indexing that enables skipping of complete blocks of rows  that do not match a query.  Each file with the columnar layout is optimized for compression and skipping the  columns to reduce read and decompression load.

Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly.   The Spark  related data processing programs can also benefit , as SparkSQL can also be integrated HCatalog with its HiveContext. We can also go through this with one of the example in the coming blog.  The comparison of various formats and their compression rations is depicted in the below diagram.

Hive will handle all the details of the conversion  to ORCFile and you are free to delete the old  table to free up loads of space. When you create an ORC  Hive table there are number of  table properties we can use to further tune the way ORC works.

File formats compared

   The following table summarizes the performance amount the AVRO, ORC and Parquet formats  with metrics like Storage space (LOAD_MB_WRITTEN),  LOAD_TIME, ANALYZE_TIME, QUERY_TIME  on a 400G  volume of data.

Hope this keeps interesting for you,   next time we will discuss  another important topic.


  1. Deep machine learning is the way to understand machine learning in a better way. By this, you can verify machine learning solution provider’s quality in many different terms. You need to be alert and updated to avoid any kind of misleading.

  2. It was interesting to read this article, thanks!
