1. Sample startup repository

https://github.com/zdjohn/spark-setup-workshop

2. Setup dev denpendency

tox is not just a testing tool. It helps you to set up a dev environment with common development library. of course, you can update/change based on your own preference.

3. Setup project dependency

note: please make sure your dev venv is now activated

  • install dependencies: pip install -r requirements.txt
  • add dev virtual environment to jupyter notebook python -m ipykernel install --user --name=pyspark-sample

4. spark test run

  • jupyter notebook
  • spark-submit --master local[*] --deploy-mode client helloworld.py

5. package your project

  • run tox -e pack

--

--

A quick note for the upcoming pySpark 3 series

Dependency

  • Java (version 11.x) sudo apt install default-jdk
  • Scala (version 2.x) sudo apt install scala
  • spark package (version 3.0.x, hadoop 3.2) wget <https://apache.mirror.digitalpacific.com.au/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz> (side note: EMR now supports Hadoop 3.2.1 now)

Setup and environment variables

setup your spark path:

  • tar xvf spark-3.0.1-bin-hadoop3.2.tgz (check the version that you download)
  • sudo mv spark-3.0.1-bin-hadoop3.2 /opt/spark
  • echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
  • echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.bashrc
  • echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.bashrc
  • source ~/.bashrc

Test run pySaprk

Setup Databricks Community Accounts

https://databricks.com/try-databricks

--

--