Saturday, 30 April 2016

Installing Spark/Pyspark


In this post, I will summarize the steps to install Spark and calling its libraries in a Python stand alone application.  Hopefully, this guide will help those who are just starting and will save them time in solving problems that I have faced when doing the installation. The installation was done in Ubuntu 14. Below are the steps:

1 - Installing Anaconda Python (Optional): Ubuntu comes shipped with Python 2.7. I usually use Anaconda for python, as it comes with many Python modules. If you have decided to install Anaconda, make sure to change PATH and PYTHONPATH environment variables to point to Anconda directory

2 - Installing Java: I used Orcale java from here. Download and extract the file in the folder ( /usr/local/lib )  using this command

     sudo tar -zxvf jre-8u77-linux-x64.tar.gz

set the PATH and JAVA_HOME variables to point to java folder:

sudo echo "export PATH=$PATH:/usr/local/lib/jre-8u77-linux-x64/bin" >> /etc/profile

sudo echo "export JAVA_HOME=/usr/local/lib/jre-8u77-linux-x64" >> /etc/profile

logout and login again to activate the profile file

Note: You need to download the correct version of java for your system. Otherwise you might get error.

3 - Installing Scala: Download Scala and place it in /usr/local/lib, extract the tar file using the following command.

     sudo tar -zxvf  scala-2.11.8.tgz

Add SPARK_HOME to the profile file

sudo echo "export SPARK_HOME=/usr/local/lib/scala-2.11.8" >> /etc/profile

4 - Installing Spark:Download from here. Extract it in /usr/local/lib as below:

     sudo tar -zxvf  spark-1.6.1-bin-hadoop2.6.tgz

      Add Spark Python directory to $PATH and $PYTHONPATH:

    echo 'export PATH=$PATH:/usr/local/lib/spark-1.6.1-bin-hadoop2.6/bin' >>  /etc/profile
    echo 'export PYTHONPATH=$PYTHONPATH:/usr/local/lib/spark-1.6.1-bin-hadoop2.6/lib/python' >> /etc/profile

That is it ! You can start pyspark from the command line using pyspark.

 5 - Setting ipython for pyspark. It is quite convenient to use interactive python (ipython) for testing pieces of codes. Assuming you have ipython, you can create a ipython profile for spark. A profile is a directory containing configuration and runtime files, such as logs, connection info for the parallel apps, and your IPython command history.

ipython profile create pyspark

To test, start ipython --profile=pyspark

6 - Installing mysql-connector for Java: Download mysql-connector for java from here. Extract the file to (/usr/share/java). You also need to change spark config file spark-defaults.conf, as follows:

spark.driver.extraClassPath        /usr/share/java/mysql-connector-java.jar
 


7 - Simple Python application: In this step, below is a simple python application that calls spark modules. This application, will open a connection to  a local database and retrieve a table. 


from pyspark import SparkConf, SparkContext, SQLContext

def get_table(details={}):
  
        conf = SparkConf().setMaster('local').setAppName('Test')
        sc = SparkContext(conf=conf)
        sqlc = SQLContext(sc)
        table = sqlc.read.format("jdbc").options(
                          url=details["url"],
                          driver="com.mysql.jdbc.Driver",
                          dbtable=details["table"],
                          user="user",
                          password="password")
      
        return table

if __name__ == '__main__':
    conn_details = {'url':"jdbc:mysql://localhost:3306/employees","table":"employees"}
  
    table = get_table(conn_details)
   


That is it !