pySpark 3 Ubuntu 20.04 Installation

John Di Zhang
Nov 15, 2020

--

A quick note for the upcoming pySpark 3 series

Dependency

  • Java (version 11.x) sudo apt install default-jdk
  • Scala (version 2.x) sudo apt install scala
  • spark package (version 3.0.x, hadoop 3.2) wget <https://apache.mirror.digitalpacific.com.au/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz> (side note: EMR now supports Hadoop 3.2.1 now)

Setup and environment variables

setup your spark path:

  • tar xvf spark-3.0.1-bin-hadoop3.2.tgz (check the version that you download)
  • sudo mv spark-3.0.1-bin-hadoop3.2 /opt/spark
  • echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
  • echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.bashrc
  • echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.bashrc
  • source ~/.bashrc

Test run pySaprk

Setup Databricks Community Accounts

https://databricks.com/try-databricks

--

--

John Di Zhang
John Di Zhang

Written by John Di Zhang

a dad, a codesmith, a phd in process, a master of none

No responses yet