1. Sample startup repository
https://github.com/zdjohn/spark-setup-workshop
2. Setup dev denpendency
python3
pip3
- install tox
pip install tox
(ref: https://tox.readthedocs.io/en/latest/index.html)
tox is not just a testing tool. It helps you to set up a dev
environment with common development library. of course, you can update/change based on your own preference.
- run
tox -e dev
dev dependency is configured inside tox.ini file: https://github.com/zdjohn/spark-setup-workshop/blob/master/tox.ini - source to tox dev virtual environment
source .tox\\dev\\bin\\activate
3. Setup project dependency
note: please make sure your dev venv is now activated
- install dependencies:
pip install -r requirements.txt
- add
dev
virtual environment to jupyter notebookpython -m ipykernel install --user --name=pyspark-sample
4. spark test run
jupyter notebook
spark-submit --master local[*] --deploy-mode client helloworld.py
5. package your project
- run
tox -e pack
A quick note for the upcoming pySpark 3 series
Dependency
- Java (version 11.x)
sudo apt install default-jdk
- Scala (version 2.x)
sudo apt install scala
- spark package (version 3.0.x, hadoop 3.2)
wget <https://apache.mirror.digitalpacific.com.au/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
> (side note: EMR now supports Hadoop 3.2.1 now)
Setup and environment variables
setup your spark path:
tar xvf spark-3.0.1-bin-hadoop3.2.tgz
(check the version that you download)sudo mv spark-3.0.1-bin-hadoop3.2 /opt/spark
echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.bashrc
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.bashrc
source ~/.bashrc
Test run pySaprk
pyspark
- open http://localhost:4040/
- to quit:
quit()