A quick story of PySpark on Docker

In my previous post, I showed you how to set up the Docker and run your first container quickly. I hope that the shortlisted cheat sheet with the most necessary commands allowed you to play around and make the first foothold in the containers world. In this post, I aim to introduce you to the real ease of deploying refined data science set-up in a few minutes.

Remember Susan? I introduced her a while ago. Her team, our analytics wizards, got a new project to handle. A large retail chain has a desire to create a tailored AI suite to manage its assortment automatically and change prices dynamically. It is a big challenge not only from the business perspective but also technically. The initial analysis includes wide preprocessing and exploration of massive transactional datasets. Traditional approaches fail when it comes to that significant volume of information. After multiple meetings and calls with the customer, Susan understands the technical landscape and business requirements. The decision she makes is to use Apache Spark for both preprocessing and modeling parts of the project.

To deliver the first results quickly, Sue asks DevOps engineer to set up the prototype environment for the data scientists swiftly. What would a guy like Will do? Yes, Docker is the right way to go!

What image to choose?

Will would know that he can make it promptly if he finds the right image with a ready environment so that the team can start working on the data immediately. He uses a map of the existing images created by Jupyter.

Source: www.jupyter-docker-stacks.readthedocs.io

Folks from Susan's team are from different backgrounds, and some prefer to make initial statistical reasoning using R before writing scalable code in Python. Some of them love to query the data using SQL without writing scripts. That would bring any DevOps engineer to all-spark-notebook.

The above image includes Python, R, and Apache Toree (link) to meet the data science team's expectations.

Apache Toree is a kernel for the Jupyter Notebook platform providing interactive access to Apache Spark. Apache Toree provides a set of magic that enhances the user experience manipulating data coming from Spark tables or data.

How to?

Run docker pull jupyter/all-spark-notebook in your command line terminal. It should pull and install the all-spark-notebook image in your Docker environment.

Check the image by running docker images:

To access Jupyter notebook with the relevant kernel, we need first to expose some Spark specific ports:

-p 4040:4040 - open SparkUI (link) - this option map 4040 port inside docker container to 4040 port on the host machine;
-p 8888:8888 - open Jupyter notebook

$ docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/all-spark-notebook

The notebooks engine can now be accessed at http://localhost:8888/.

The required token can be found by accessing the running container directly. Navigate to the Docker Dashboard and open CLI of the all-spark-notebook container.

Run jupyter notebook list command and copy the displayed token to the Jupyter notebook (4ae1d1d372501bda543180715c1b2b090d15a3637557917b).

After the login, you have access to the empty 'work' folder and the list of available kernels you can base your notebooks on.

Now let's create a test notebook and run a straightforward code to check if the Apache Spark works as expected.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
spark.sql('SELECT "Test" as c1').show()

Now you can check SparkUI (http://localhost:4040) for the above job. You see it in the Completed list. The default user 'jovyan' is available for all jupyter based docker images.

The team of Susan can now easily access the infrastructure to go ahead with the initial analysis and EDA. There was much time saved - no need to install separately:

operational system;
JVM;
Scala;
Python;
Jupyter.

Bravo, DevOps team!

#apachespark #docker #containers #bigdata #projects

A quick story of PySpark on Docker

What image to choose?

Recent Posts

Comments