Developing pyspark with jupyter notebook

John Di Zhang
1 min readJan 4, 2021

--

Jupiter notebook brings a number of benefits in data engineering development:

  1. An interactive dev environment
  2. Develop with mocked data, and translating code into package code or tests
  3. profile, tune and understand your spark job interactively

1. Pre-requisite

  • Install jupyter
  • Install load spark lib
  • Add your virtual environment into your notebook

2. Start your Jupiter

run: jupyter notebook

First, we need to locate your pyspark path with findspark

pip install findspark 
# or use your requirement.in file
import findspark
findspark.init()# todo code here

Import sibling package from your project:

Depending on where do you run your jupyter notebook you may come across the following error. No module named 'your_package_name'. The cause of it could be a long story, but, fortunately, the solution is simple:

import sys
sys.path.insert(0,'..')

If you would like to understand more how python package import works, check out this SO post: Sibling package imports

Now, you should be able to use the jupyter note book on your local machine.

See more examples in the code repository:

zdjohn/spark-setup-workshop

3. Tune and profile your code with magic commands

Jupyter notebook magic commands

  • %time
  • %prun
  • %memit

(check https://ipython.readthedocs.io/en/stable/interactive/magics.html for more)

4. Understand your dataframe and ETL Jobs with the following methods

  • .show()
  • .printSchema()
  • .explain()

--

--

John Di Zhang

a dad, a codesmith, a phd in process, a master of none