Quickly Make a PySpark Session in Google Colab

Aaron Lee
3 min readOct 21, 2020
Photo by Markus Winkler on Unsplash

This is a quick start example of how to create a basic PySpark session in Google Colab. This is not a tutorial for the use PySpark or Colab, simply a quick example to get you up and running.

Why I use Google Colab for PySpark projects:

  • PySpark can be challenging to setup on some computers. I found Colab far easier to get ‘up and running’ than other options like Docker.
  • All the major libraries are already installed and ready to be imported. (Scikit-learn, Matplotlib, TensorFlow etc.)
  • You can use bash commands (just add ! before your command)
  • If you are already familiar with using ipynb files in Jupyter Notebook, you’ll feel right at home.
  • Google Colab gives you access to a GPU.
  • It’ s free!

Some downsides of Google Colab:

  • since you are using a VM on a server, the files are not persistent. You have to reload data files for each session or after inactivity. You also have to install all your libraries each time you start a session. That’s a small cost for so much upside though. I find my projects run faster and more reliably in Colab as compared to Jupyter Notebook.
  • It is a little extra work to use a Colab file with GitHub. However, if you are using Google Drive, it is made fairly easy.

My Quickstart Code

The link below is a project which gets a simple PySpark session up and running.

Below is the explanation of the code contained in the link.

1) Open a new Google Colab notebook.

You can simply make a copy of the Colab project in the link above if you’d like. Colab files are saved in Google Drive

By default, Google Colab sets you up to use a CPU. If you want to take advantage of a GPU (best for Deep Learning or AI), simply click on Edit>Notebook Settings>GPU inside Colab. It’s that easy.

2) Run bash commands inside Colab to install Java, Spark, and Findspark.

To use Google Colab, we have to install packages before we start each session. It’s quick and easy. The code below uses bash commands dirctly from Colab to install Apache Spark 2.4.7, Java 8, and Findspark. (Note: This is an older version of Spark that works well with my setup, you can use newer versions like 3.0.0 by altering the bash commands)

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

3) Set environment variables

This code sets up the locations of your Java and Spark installs. Make sure your version matches what you loaded above. If you have problems, look at the File tab in Colab to identify the package location.

import osos.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.8.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

4) Start Findspark

Findspark is an aptly named library that let’s Python easily find Spark. This just makes our lives easier, so we use it.

import findsparkfindspark.init()

5) Make a SparkSession

This is the big step that actually creates the PySpark session in Google Colab. This will create a session named ‘spark’ on the Google server.

from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext('local[*]')
spark = SparkSession(sc)

That’s it. You now have a working Spark session. You can now upload the data and start using Spark for Machine Learning. The link has a few extra lines to help you get started.

Good luck!

--

--