Developing pyspark with jupyter notebook
1 min readJan 4, 2021
Jupiter notebook brings a number of benefits in data engineering development:
- An interactive dev environment
- Develop with mocked data, and translating code into package code or tests
- profile, tune and understand your spark job interactively
1. Pre-requisite
- Install jupyter
- Install load spark lib
- Add your virtual environment into your notebook
2. Start your Jupiter
run: jupyter notebook
First, we need to locate your pyspark
path with findspark
pip install findspark
# or use your requirement.in fileimport findspark
findspark.init()# todo code here
Import sibling package from your project:
Depending on where do you run your jupyter notebook you may come across the following error. No module named 'your_package_name'.
The cause of it could be a long story, but, fortunately, the solution is simple:
import sys
sys.path.insert(0,'..')
If you would like to understand more how python package import works, check out this SO post: Sibling package imports
Now, you should be able to use the jupyter note book on your local machine.
See more examples in the code repository:
3. Tune and profile your code with magic commands
Jupyter notebook magic commands
%time
%prun
%memit
(check https://ipython.readthedocs.io/en/stable/interactive/magics.html for more)
4. Understand your dataframe and ETL Jobs with the following methods
.show()
.printSchema()
.explain()