Wednesday, August 19, 2015

Accessing Hive Tables using R for Data mining



         Hi friends,  today  we will be discussing yet another interesting topic.  In today's world the analytics has been so advanced where most of the organisations are investing in Data Mining /Predictive/Prescriptive Analytics. Many Data scientists use R as an preferred language for their Data Mining needs.  In this Big Data world,  most of the data is being processed in the hadoop ecosystem.  Now, Let us see how can we connect to Hive Tables from R and perform some modeling.


Make sure you have installed R and the relevent packages for  JDBC connections.

If you have already installed R, make sure you have updated  java settings.

    # R CMD javareconf


Now, Launch R shell  and install the 'rJava' and 'RJDBC' packages that are for jdbc connections

install.packages("rJava")

install.packages("RJDBC",dep=TRUE)


Once completed !, you are good to start accessing the data from Hive tables as below


options( java.parameters = "-Xmx8g" )

library(DBI)
library(rJava)
library(RJDBC)


The 'RJDBC' library is dependent on 'DBI' so we included the package.


Setup the necessary jar files required for Hadoop and Hive.


Assign the HiveDriver


Define the JDBC connection as shown
 
     conn <- dbConnect(drv, "jdbc:hive2://<hostname>:10000/default", "username", "password")




Query the data from R on Hive Table and assign to a variable. R automatically assigns to an DataFrame in R. Now,we can do any R related operations as usual.






I have the call center data, which has length of the call and the no of units sold in the call. Let us run an regression model against the data.






Hope this keeps interesting....

Thanks
Venkat

No comments:

Post a Comment