Wednesday, December 3, 2014

JSON Data processing using Apache Spark







Apache Spark is a powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics. It is an open-source in-memory data analytics cluster computing framework originally developed in UC Berkeley’s AMPLab.  It has the ability to cache datasets in memory for interactive data analysis: extract a working set, cache it, query it repeatedly. Provided such a powerful application to process the huge amounts of data, let us explore how to process the JSON dataset and query the data using spark.

JSON(JavaScript Object Notation)  format is becoming very popular these days as it takes less space compared to other format.  It is used primarily to transmit data between a server and web application, as an alternative to XML.  It is human-readable and can be processed using various programming languages. 

The following example shows a possible JSON representation describing a person.


















Lets us take an dataset in json format and process using the Apache Spark. For this article I have taken an stocks dataset which is available at stocks dataset. Load the json file into hdfs.




Start the spark shell to run the below commands to initate the Spark SQL Context







Load the stocks.json file  into spark RDD 







The above has loaded as Schema RDD. You can also print the schema of the RDD by typing the command..

jsonData.printSchema

The next step is to register the Schema RDD as an Table in Spark as..

jsonData.registerTempTable("stocks")

Now we are ready to query the data. Let write an sql to query the data from the above registered table  "stocks".

sqlContext.sql("select Ticker,Volume,Price,PEG,Sector from stocks limit 10").collect.foreach(println)

The result return only the top 10 rows as we put the limit in the select statement.

[A,1847978,50.44,2.27,Healthcare]
[AA,14600992,9.02,2.06,Basic Materials]
[AADR,6660,36.4,null,Financial]
[AAIT,250,29.95,null,Financial]
[AAMC,543,597.98,null,Financial]
[AAME,17754,4.03,null,Financial]
[AAN,1190872,30.15,2.14,Services]
[AAOI,12203,12.28,null,Technology]
[AAON,43938,27.57,3.48,Industrial Goods]
[AAP,220670,98.92,1.39,Services]






Hope this is useful.