BigData World: Advanced File Formats in Big Data

From the early days of Hadoop, there has been several enhancements to process the data in various file formats and with various compression techniques. Some of the formats includes Sequential File (Text,XML and Json) formats and non-sequential format like Avro. The days of xml and json formats are past now, though some of the major applications still use them. Avro was great and was very useful as it always carries the schema along with the data. However, In the recent years 2014/2015 we are witnessing more enhancements in storage which are promising in greater compression rates and even blazing read performances. The new kids in the block are ‘Parquet’ and ‘ORC’ (Optimized Row Columnar) formats.

Both are columnar data formats and provides great options , but depends on the distribution that you have in place. The Parquet was primarily developed from Twitter and cloudera and Cloudera has heavily invested in this technology and it continues to do so. Where as the Hortonworks focuses on the ORC format which was developed as part of Stinger initiative to replace the RCFile format. ORC is also equally promising with the same or better compression ratios as Parquet. But, it is very difficult to compare both on the same lines, we will try to do it with few examples in my coming blog posts.

Parquet paves the way for better Hadoop data storage

Hadoop was built for managing large sets of data, so a columnar store is a natural complement. Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, Spark and Apache Drill projects already do this, as well as conventional MapReduce.

Parquet implements column compression so that it gives great compression rates to decrease the storage space and at the same time accelerates performance. Cloudera, the progenitor uses Parquet as a native storage format for its impala.

Parquet can be integrated with existing type systems and processing frameworks :

· Pig
· Impala
· Thrift for M/R, Cascading and Scalding
· Avro
· Hive
· Spark

For more information on Parquet : Http://parquet.io

ORC, An Intelligent Big Data File Format

Hortonworks, in parallel has developed ORC file format as part of its Stinger Project. ORC goes beyond RCFile and uses specific encoders for different column data types to improve compression further. ORC introduces a lightweight indexing that enables skipping of complete blocks of rows that do not match a query. Each file with the columnar layout is optimized for compression and skipping the columns to reduce read and decompression load.

Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly. The Spark related data processing programs can also benefit , as SparkSQL can also be integrated HCatalog with its HiveContext. We can also go through this with one of the example in the coming blog. The comparison of various formats and their compression rations is depicted in the below diagram.

Hive will handle all the details of the conversion to ORCFile and you are free to delete the old table to free up loads of space. When you create an ORC Hive table there are number of table properties we can use to further tune the way ORC works.

File formats compared

The following table summarizes the performance amount the AVRO, ORC and Parquet formats with metrics like Storage space (LOAD_MB_WRITTEN), LOAD_TIME, ANALYZE_TIME, QUERY_TIME on a 400G volume of data.

Hope this keeps interesting for you, next time we will discuss another important topic.

2 comments:

NikishaDecember 19, 2019 at 9:50 PM
Deep machine learning is the way to understand machine learning in a better way. By this, you can verify machine learning solution provider’s quality in many different terms. You need to be alert and updated to avoid any kind of misleading.
dandanJanuary 28, 2021 at 4:30 AM
It was interesting to read this article, thanks!
ML

BigData World

Monday, August 10, 2015

Advanced File Formats in Big Data

2 comments:

Blog Archive