Earlier this week, I presented a workshop on Data Analytics. I wanted to provide each of the participants with a fully prepared environment, right on everyone’s own laptop (and optionally in a cloud environment such as Katacoda). The environment consisted of Python 3.7, Jupyter Labs (for Notebooks), many additional Python libraries (Pandas, Plotly, Chart Studio, Matrix Profile, SAX, Fuzzy Search and many more) and a number of my own GitHub repositories containing the workshop sources. I wondered about the best way of getting everyone started with this environment.
My initial thought was to bake my own custom Docker image – and push that to Docker.io. It might have worked, but I ended up struggling with docker build files, uploading huge images that needed changing every time I changed only a little bit in my environment and taking way more time than I was prepared to spend on this.
Then I ran into several blog articles describing how libraries can be installed from Jupyter Notebooks (pip install from a cell in the notebook) and how other OS commands also can be run from a Notebook, including a command for git cloning a GitHub repository. At this point, I believed I could do better in terms of environment preparation.
- leverage published Docker container image jupyter/scipy-notebook (with Jupyter Lab, ipython, pandas, git, ipywidgets and bunch of other libraries – see here for details on this image and others); run a container based on this image
- create a new notebook (the first bootstrap notebook) with one cell that clones a single GitHub repo; run the cell
- open the notebook downloaded from GitHub (the second bootstrap notebook) and run all cells (these will install the required Python libraries as well as clone the required GitHub repositories
- at this point, everything is ready for workshop action
Step by Step
The steps in a little more detail:
1. Run a Docker Container based on the jupyter/scipy-notebook image using the following command:
docker run –name timeseries-data-analytics -d -p=8888:8888 jupyter/scipy-notebook
This will pull a sizable image (1.5 GB I believe) and start a container called timeseries-data-analytics, in the background (as daemon). It will expose port 8888.
2. Check the logging from the container; it will show a security token that we need for accessing the Jupyter Notebooks environment from the browser:
docker logs timeseries-data-analytics –follow
3. Open Jupyter Notebooks in the browser, using this URL (replace IP-Docker-Host with the IP of your Docker host):
4. Create New Jupyter Notebook:
5. Type the following command in the cell and execute the cell (CTRL+Enter)
!git clone https://github.com/AMIS-Services/20190912-data-analytics-timeseries
This command will clone the real bootstrap notebook into the Docker container, called environment-setup.ipynb.
6. Execute the bootstrap notebook. Open environment-setup.ipynb in Jupyter Notebook. Execute all cells:
7. Done. Ready for action. All libraries have been installed, all Notebooks have been loaded from GitHub:
One environment where these instructions worked out very well for my students was on Katacoda. Simply open the Docker playground environment https://www.katacoda.com/courses/docker/playground, start the scenario and walk your way through these steps. In short time, you will have a prepared Jupyter Notebook environment, primed for the action in my workshop.