pySpark 3 Ubuntu 20.04 Installation
Nov 15, 2020
A quick note for the upcoming pySpark 3 series
Dependency
- Java (version 11.x)
sudo apt install default-jdk
- Scala (version 2.x)
sudo apt install scala
- spark package (version 3.0.x, hadoop 3.2)
wget <https://apache.mirror.digitalpacific.com.au/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
> (side note: EMR now supports Hadoop 3.2.1 now)
Setup and environment variables
setup your spark path:
tar xvf spark-3.0.1-bin-hadoop3.2.tgz
(check the version that you download)sudo mv spark-3.0.1-bin-hadoop3.2 /opt/spark
echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.bashrc
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.bashrc
source ~/.bashrc
Test run pySaprk
pyspark
- open http://localhost:4040/
- to quit:
quit()