In this post, I will summarize the steps to install Spark and calling its libraries in a Python stand alone application. Hopefully, this guide will help those who are just starting and will save them time in solving problems that I have faced when doing the installation. The installation was done in Ubuntu 14. Below are the steps:
1 - Installing Anaconda Python (Optional): Ubuntu comes shipped with Python 2.7. I usually use Anaconda for python, as it comes with many Python modules. If you have decided to install Anaconda, make sure to change PATH and PYTHONPATH environment variables to point to Anconda directory
2 - Installing Java: I used Orcale java from here. Download and extract the file in the folder ( /usr/local/lib ) using this command
sudo tar -zxvf jre-8u77-linux-x64.tar.gz
set the PATH and JAVA_HOME variables to point to java folder:
sudo echo "export PATH=$PATH:/usr/local/lib/jre-8u77-linux-x64/bin" >> /etc/profile
sudo echo "export JAVA_HOME=/usr/local/lib/jre-8u77-linux-x64" >> /etc/profile
logout and login again to activate the profile file
Note: You need to download the correct version of java for your system. Otherwise you might get error.
3 - Installing Scala: Download Scala and place it in /usr/local/lib, extract the tar file using the following command.
sudo tar -zxvf scala-2.11.8.tgz
Add SPARK_HOME to the profile file
sudo echo "export SPARK_HOME=/usr/local/lib/scala-2.11.8" >> /etc/profile
4 - Installing Spark:Download from here. Extract it in /usr/local/lib as below:
sudo tar -zxvf spark-1.6.1-bin-hadoop2.6.tgz
Add Spark Python directory to $PATH and $PYTHONPATH:
echo 'export PATH=$PATH:/usr/local/lib/spark-1.6.1-bin-hadoop2.6/bin' >> /etc/profile
echo 'export PYTHONPATH=$PYTHONPATH:/usr/ local/lib/spark-1.6.1-bin- hadoop2.6/lib/python' >> /etc/profile
That is it ! You can start pyspark from the command line using pyspark.
5 - Setting ipython for pyspark. It is quite convenient to use interactive python (ipython) for testing pieces of codes. Assuming you have ipython, you can create a ipython profile for spark. A profile is a directory containing configuration and runtime files, such as logs, connection info for the parallel apps, and your IPython command history.
ipython profile create pyspark
To test, start ipython --profile=pyspark
6 - Installing mysql-connector for Java: Download mysql-connector for java from here. Extract the file to (/usr/share/java). You also need to change spark config file spark-defaults.conf, as follows:
spark.driver.extraClassPath /usr/share/java/mysql-connector-java.jar
7 - Simple Python application: In this step, below is a simple python application that calls spark modules. This application, will open a connection to a local database and retrieve a table.
from pyspark import SparkConf, SparkContext, SQLContext
def get_table(details={}):
conf = SparkConf().setMaster('local').setAppName('Test')
sc = SparkContext(conf=conf)
sqlc = SQLContext(sc)
table = sqlc.read.format("jdbc").options(
url=details["url"],
driver="com.mysql.jdbc.Driver",
dbtable=details["table"],
user="user",
password="password")
return table
if __name__ == '__main__':
conn_details = {'url':"jdbc:mysql://localhost:3306/employees","table":"employees"}
table = get_table(conn_details)
That is it !
sudo echo "export SPARK_HOME=/usr/local/lib/scala-2.11.8" >> /etc/profile
4 - Installing Spark:Download from here. Extract it in /usr/local/lib as below:
sudo tar -zxvf spark-1.6.1-bin-hadoop2.6.tgz
Add Spark Python directory to $PATH and $PYTHONPATH:
echo 'export PATH=$PATH:/usr/local/lib/spark-1.6.1-bin-hadoop2.6/bin' >> /etc/profile
echo 'export PYTHONPATH=$PYTHONPATH:/usr/
That is it ! You can start pyspark from the command line using pyspark.
5 - Setting ipython for pyspark. It is quite convenient to use interactive python (ipython) for testing pieces of codes. Assuming you have ipython, you can create a ipython profile for spark. A profile is a directory containing configuration and runtime files, such as logs, connection info for the parallel apps, and your IPython command history.
ipython profile create pyspark
To test, start ipython --profile=pyspark
6 - Installing mysql-connector for Java: Download mysql-connector for java from here. Extract the file to (/usr/share/java). You also need to change spark config file spark-defaults.conf, as follows:
spark.driver.extraClassPath /usr/share/java/mysql-connector-java.jar
7 - Simple Python application: In this step, below is a simple python application that calls spark modules. This application, will open a connection to a local database and retrieve a table.
from pyspark import SparkConf, SparkContext, SQLContext
def get_table(details={}):
conf = SparkConf().setMaster('local').setAppName('Test')
sc = SparkContext(conf=conf)
sqlc = SQLContext(sc)
table = sqlc.read.format("jdbc").options(
url=details["url"],
driver="com.mysql.jdbc.Driver",
dbtable=details["table"],
user="user",
password="password")
return table
if __name__ == '__main__':
conn_details = {'url':"jdbc:mysql://localhost:3306/employees","table":"employees"}
table = get_table(conn_details)
That is it !
No comments:
Post a Comment