Developing pyspark with jupyter notebook

John Di Zhang

1 min readJan 4, 2021

Jupiter notebook brings a number of benefits in data engineering development:

An interactive dev environment
Develop with mocked data, and translating code into package code or tests
profile, tune and understand your spark job interactively

1. Pre-requisite

Install jupyter
Install load spark lib
Add your virtual environment into your notebook

2. Start your Jupiter

run: jupyter notebook

First, we need to locate your `pyspark` path with `findspark`

pip install findspark 
# or use your requirement.in fileimport findspark
findspark.init()# todo code here

Import sibling package from your project:

Depending on where do you run your jupyter notebook you may come across the following error. No module named 'your_package_name'. The cause of it could be a long story, but, fortunately, the solution is simple:

import sys
sys.path.insert(0,'..')

If you would like to understand more how python package import works, check out this SO post: Sibling package imports

Now, you should be able to use the jupyter note book on your local machine.

See more examples in the code repository:

zdjohn/spark-setup-workshop

3. Tune and profile your code with magic commands

Jupyter notebook magic commands

%time
%prun
%memit

(check https://ipython.readthedocs.io/en/stable/interactive/magics.html for more)

4. Understand your dataframe and ETL Jobs with the following methods

.show()
.printSchema()
.explain()