The Notebook Image
A SageMaker image is a file that identifies the kernels, language packages, and other dependencies required to run a Jupyter notebook in Amazon SageMaker Studio. These images are used to create an environment that you then run Jupyter notebooks from. Amazon SageMaker provides many built-in images for you to use. For the list of built-in images, see Available Amazon SageMaker Images.
The Notebook image
If you need different functionality, you can bring your own custom images to Studio. You can create images and image versions, and attach image versions to your domain or shared space, using the SageMaker control panel, the AWS SDK for Python (Boto3), and the AWS Command Line Interface (AWS CLI). You can also create images and image versions using the SageMaker console, even if you haven't onboarded to a SageMaker domain. SageMaker provides sample Dockerfiles to use as a starting point for your custom SageMaker images in the SageMaker Studio Custom Image Samples repository.
The following topics explain how to bring your own image using the SageMaker console or AWS CLI, then launch the image in Studio. For a similar blog article, see Bringing your own R environment to Amazon SageMaker Studio. For notebooks that show how to bring your own image for use in training and inference, see Amazon SageMaker Studio Container Build CLI.
Image version: An image version of a SageMaker image represents a Docker image and is stored in an Amazon ECR repository. Each image version is immutable. These image versions can be attached to a domain or shared space and used with Studio.
A code cell can also be used to embed images. To display the image, the Ipython.display() method necessitates the use of a function. In the notebook, you can also specify the width and height of the image.
In the prior post in this series I described the steps required to run the Jupyter Notebook images supplied by the Jupyter Project developers. When run, these notebook images provide an empty workspace with no initial notebooks to work with. Depending on the image used, they would include a range of pre-installed Python packages, but they may not have all packages installed that a user needs.
Although it was possible to attach a persistent volume so that the notebooks and data files, along with any of the changes made, were retained across a restart of the container, any additional Python packages would need to be reinstalled each time. This was necessary as the Python packages are installed into directories outside of the persistent volume.
To combat this, a user could create a custom Docker-formatted image themselves, which builds on the base image, but this means that they have to know how to create a Docker-formatted image and have the tools available to do it.
When using OpenShift, an alternative that exists which can make the life of a user much easier, is to Source-to-Image (S2I) enable base images which would commonly be extended by users. This is something OpenShift does for common programming languages such as Java, NodeJS, PHP, Python and Ruby.
The way in which the S2I builder images work is that they are run against a copy of a designated Git repository containing a users files. An assemble script within the image would do whatever is required to process those files to create a runnable image for an application. When that image is then started, a run script in the image would start up the application.
Using S2I, a user doesn't need to know what steps are necessary to convert their files into a runnable application. Instead all the smarts are embodied in the assemble and run scripts of the S2I builder image.
Although the S2I builder images are typically used to take application source code and create a runnable image for a web application, they can also be used to take any sort of input files and combine them with an existing application.
In our case, we can use an S2I enabled image to perform two tasks for us. The first is to install any Python packages required for a set of notebook files, and secondly to copy the notebook files and data files into the image. Using this it becomes very easy for a user to create a custom image containing everything they need. When the Jupyter Notebook instance is started up, everything will be pre-populated and they can start working straight away.
This sort of system is especially useful in a teaching environment as a way of distributing course material. This is because you know that students are going to have the correct versions of software and Python packages that are required by the notebooks.
To create a S2I enabled version of the Jupyter Notebook images for users, we are going to create a custom Docker-formatted image. To do this we start out with a Dockerfile. This will include a number of customisations, so we will go through them one at a time to understand what they do.
First up we need to indicate what base image were are building on top of. We are going to use the jupyter/minimal-notebook image. We start out with this image rather than the scipy-notebook image, as we will rely on the S2I build process to add additional packages that are needed, rather than bundling them all into the base image to begin with. This ensures the final image is as small as possible and isn't bloated in size due to Python packages being installed which are never used.
We now install those operating system packages. In this case we install the rsync package so that the OpenShift oc rsync command can be run to copy files from a local system into a running container if necessary. The libav-tools package is also installed. This is a system package which gets installed as part of the scipy-notebook image. Because we want to allow all the same Python packages that scipy-notebook image has pre-installed to be installed using the S2I enabled version of the minimal-notebook, we install it here.
These labels serve a number of purposes. This includes providing a name and description which can be displayed by OpenShift in the web console and on the command line when using the S2I enabled image. We also let OpenShift know which ports the running image will expose and most important of all, where the assemble and run scripts are installed in the image.
This isn't strictly needed as when the S2I build process is run to generate an image, it will again set the command to be run as the run script explicitly. We add it here so that if the image is itself run as an application, rather than as a builder image, it will still start up using our run script, rather than the original command setup in the minimal-notebook image. This enables us to still perform additional steps in this case before the Jupyter Notebook is started.
When the assemble script is run, the original contents of the Git repository the S2I builder image is run against, will have been copied into the /tmp/src directory. For our S2I builder this would comprise the notebooks and data files, along with a Python requirements.txt file listing the Python packages that need to be installed.
We copy the files rather than move them into place so they will be merged with anything already present in the directory. As we copy them, we remove the original contents of /tmp/src so we do not have duplicates of the files wasting space in the final image.
If there was a requirements.txt file, we now install any Python packages listed in it. As the Jupyter Project images use the Anaconda Python distribution rather than that from the Python Software Foundation, we use the conda package manager rather than pip. This will result in any packages being installed from the conda package index rather than PyPi.
At the end we remove the requirements.txt file. This is so that it will not interfere with anything if the resultant image is in turn used as a S2I builder. In other words, one can with S2I builders create layered builds just like with normal Docker builds. The requirements.txt file is removed so that such a subsequent build doesn't try and re-install all the packages again if no requirements.txt file was provided in the Git repository for the subsequent build.
The original base image used a script located at /usr/local/bin/start-notebook.sh. The run script wraps the execution of this script to make it easier to set a password for the Jupyter Notebook via an environment variable passed in by OpenShift.
Do note that as described in the previous post, the Jupyter Project images will not work when run with an assigned user ID. The changes we have made do not change that. As such, you still need to enable the service account which images are being run as, to run images as any UID by having a system administrator run:
In the previous post it was demonstrated how a persistent volume could be used in conjunction with a Jupyter Notebook image. To do this a persistent volume claim was made and the volume mounted at /home/jovyan/work inside the container. If we do that with the result of running the S2I enabled image, the volume will be mounted on top of the directory containing the notebooks which were copied into the image.
To be able to still use a persistent volume, such that we can pre-populate an image with notebooks and data files, but then be able to work on them and not lose any work, a different strategy is needed. I will explain how we can enhance the S2I builder to be aware of a persistent volume and copy any notebooks and data files into the persistent volume the first time the image is run, and then subsequently use that, in the next post.
Amazon SageMaker Studio Lab provides no-cost access to a machine learning (ML) development environment to everyone with an email address. Like the fully featured Amazon SageMaker Studio, Studio Lab allows you to customize your own Conda environment and create CPU- and GPU-scalable JupyterLab version 3 notebooks, with easy access to the latest data science productivity tools and open-source libraries. Moreover, Studio Lab free accounts include a minimum of 15 GB of persistent storage, enabling you to continuously maintain and expend your projects across multiple sessions and allowing you to instantly pick up where your left off and even share your ongoing work and work environments with others. 041b061a72