Data Wrangling is a crucial stage in the data science workflow. Or in any workflow that starts from raw data and hopes to achieve business insight – and perhaps ready to run well trained machine learning models. Data wrangling encompasses various steps and activities -from gathering raw data, exploring, validating, filtering, filling in missing values, joining, enriching, aggregating, shape shifting, unifying. No clear break may be apparent between gathering, preparing, exploring, visualizing and modelling the data. These are iterative steps – with data wrangler moving back and forth in the process.
One of the most used data wrangling workbenches is the Jupyter Notebook – most commonly powered by a Python engine and leveraging the many libraries and frameworks that make the Python ecosystem such an attractive place for data professionals. The number of resources outlining the use of Jupyter Notebooks and demonstrating its power is vast. So I won’t go there. I will however discuss how you can quickly get going with your very own Jupyter Notebook environment, running in a Docker container either locally on your laptop or on a generic container platform located anywhere you want.
Note: this article is used as instruction for participants in the Conclusion Data Wrangling Meetup on February 21st.
Note 2: an alternative to your own installation of Jupyter Notebook, you can make use of this free Katacoda scenario that takes you by the hand through the installation steps and provisions an environment in a cloud environment accessible from the comfort of your own browser. This is probably the best balance between having full control and not having to do too much your self.
Note 3: Instead of running your own Jupyter Notebook environment, you can also make use of https://jupyter.org/try – and get a temporary Jupyter server just for you, running on mybinder.org. This is a great playground Unfortuately, for a workshop it cannot really be relied upon because when demand peaks, environments are simply not available.
Sources are in the GitHup Repo: https://github.com/lucasjellema/DataAnalytics–IntroductionDataWrangling-JupyterNotebooks
In this article, I will assume that you can start with a Docker host – an environment where you can start Docker containers. My steps will be on Linux – if you are on Windows running Docker for Windows, you need to convert to the Windows counterpart commands. Or, alternatively, like I did, use the combination of VirtualBox and Vagrant to manage Linux Ubuntu VMs with Docker inside and keep your Windows environment very uncluttered.
To quickly get going with that combination follow the instructions in this article: https://technology.amis.nl/2018/05/21/rapidly-spinning-up-a-vm-with-ubuntu-and-docker-on-my-windows-machine-using-vagrant-and-virtualbox/.
Note: you may have to install two vagrant plugins – in order to provide docker-compose into the VM and to allocate a greater than default disk size:
- vagrant plugin install vagrant-docker-compose
- vagrant plugin install vagrant-disksize
Running Jupyter on Docker
I will assume that you are at the command line with Linux at your fingertips and Docker running in the backgroud. Docker ps needs to return nothing at this point.
Many container images are available that contain Jupyter Notebooks in some form or shape. I will use the jupyter/scipy -notebook image from Jupyter Docker Stacks. See this article for details on this image and other images they make available. On the website, it reads:
Jupyter Docker Stacks are a set of ready-to-run Docker images containing Jupyter applications and interactive computing tools. You can use a stack image to do any of the following (and more):
- Start a personal Jupyter Notebook server in a local Docker container
- Run JupyterLab servers for a team using JupyterHub
- Write your own project Dockerfile
The jupyter/scipy -notebook image is fairly rich image. It contains:
- Minimally-functional Jupyter Notebook server
- Miniconda Python 3.6
- Pandoc and TeX Live for notebook document conversion
- git, emacs, jed, nano, and unzip
- pandas, numexpr, matplotlib, scipy, seaborn, scikit-learn, scikit-image, sympy, cython, patsy, statsmodel, cloudpickle, dill, numba, bokeh, sqlalchemy, hdf5, vincent, beautifulsoup, protobuf, and xlrd packages
- ipywidgets for interactive visualizations in Python notebooks
- Facets for visualizing machine learning datasets
We will add a few other libraries on this ‘base’ image.
As first step: run a Docker container based on the image
docker run -p 8888:8888 -d –name jupyter jupyter/scipy-notebook:83ed2c63671f
Note: the Docker image tag (id) is no strictly necessary; if you strip it off (jupyter/scipy-notebook) you will get the latest – which may do everything you need. This particular id is from early February 2019 and it seems to work for me. See all Docker image tags: https://hub.docker.com/r/jupyter/scipy-notebook/tags/ .
The container image is quite sizable – close to 2 GB. Downloading is bound to take a while – depending on the network capacity you can leverage.
When downloading and extracting is complete for all layers, the container will be running. It exposes port 8888. The Jupyter server is accessible at that port.
Access the Jupyter Notebook environment from a browser on your laptop; the endpoint depends on the IP address of the host running the Docker container. In my case, using the Vagrant file in the GitHup repo associated with this article, I will access the Jupyter Notebook at: http://192.168.188.144:8888 .
The Jupyter server will prompt you for a token – to ensure not just anyone can access the environment. The
When the container is running, execute this statement:
docker logs jupyter
This will show something like:
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
The token is the value behind `/?token=`. You need that for logging in.
After pasting the token, click on the Log In button:
At this point, you can start creating your own notebook or upload a notebook from your laptop’s file system. The container currently does not contain any Jupyter Notebooks that we can open and run. We will change that in our next section.
Adding Python Libraries and Sample Notebooks
This section shows you how to add Python libraries to the running Jupyter container. This is not particularly complex, but a useful thing to know how to do. Additionally, you will install a number of packages required for the workshop that this article was originally written for. We will also look at adding a Jupyter Notebook from a GitHub repository into the running container. Again, a useful trick – and a necessary step in our workshop.
You can easily add more notebooks to the container, by cloning them from GitHub straight into the container and subsequently opening them in Jupyter Notebook. For example – to grab the world’s most trivial notebook:
docker exec -it jupyter bash -c ‘cd ~/work && git clone https://github.com/Noura/hello-jupyter’
After executing this command, this notebook can be opened in the Jupyter Notebook browser window from the folder `work/hello-jupyter`.
For more extensive manipulation of the Docker container, we can use a script that we copy into the container and then execute inside it – using docker cp (new to me that you can copy files from the Docker host into a running container – or vice versa – so easily) and docker exec (to execute a command inside the container).
The GitHub repo for this article has a folder prepareContainer that contains two scripts. You can run runPrep.sh to copy the script prepareContainer.sh into the container and execute – to install some packages and git clone a few notebooks.
Make sure these two files are available to you – for example through:
git clone https://github.com/lucasjellema/DataAnalytics–IntroductionDataWrangling-JupyterNotebooks
Run the script:
The script prepareContainer.sh is copied into the container and made executable. Then it is executed. It install various Python packages using pip and git clones two Jupyter notebooks
When the actions inside the container are done – note: this can take a few minutes – the container is restarted to have the Jupyter Notebook server pick up all changes. You may need a new token from the restarted server to login to the Jupyter Notebook environment in the browser.
When you next enter the Jupyter Notebook environment in the browser, you will see a number of notebooks that were not there before.
For example: open and run pythonForDataAnalysis.ipynb in the work folder. Or open and run Example_word_clouds.ipynb. Or open folder learn-pandas/lessons and start with Python_101.ipynb or Cookbook – Select.ipynb.
The folder work/Data-Analysis contains many notebooks created by Will Koehrsen, who writes many great articles about Data Science and uses Jupyter Notebooks frequently (see his GitHub Repo: https://github.com/WillKoehrsen/Data-Analysis ).
A nice advanced feature in Jupyter Notebooks are the interactive widgets. To have a quick tour of what these widgets can add to a notebook, open
work\widgets\Widgets-Overview.ipynb. The code cell under the Data heading contains an erroneous file reference – or it did when I last checked. Change the contents of the cell to:
df = pd.read_parquet(‘https://github.com/WillKoehrsen/Data-Analysis/blob/master/medium/data/medium_data_2019_01_26?raw=true’)
Now the cell will correctly retrieve and process the data.
This article on Medium introduced the interactive widgets demonstrated in this notebook: Interactive Controls for Jupyter Notebooks.
Trick: How to install a Python package from a GitHub clone
I am by no means a Python expert. I am just a beginner. So when I had found a nice Python package for creating word clouds in my notebook, I was not sure how to install it. I could clone the GitHub repo – but how then to install it? It turns out to be not too hard: the command
pip install -e .
will install a Python package from the current directory. I presume the files setup.py and _config.yml are essential in this – but I cannot be sure. Not an expert, remember?