While doing research, it is often necessary to install your research environment on various computers. A good first step is to use Conda to manage your Python environment.
Even with Conda, it can happen that exactly the same installation
instructions yield different computing environments. In our group, for
instance, some installation instructions stopped working when a new
version of ASTRA-toolbox was released with support for a newer version
of CUDA. This version of CUDA was not yet installed all host machines,
and new installations of ASTRA-toolbox defaulted to the newest version
of cudatoolkit
. Therefore, the exact same installation instructions
could fail on one computer and succeed on another. These installation
instructions were not reproducible, since they did not result in the
same installed environment in all circumstances.
Reproducible installation instructions should yield the same installed computing environment even when the surrounding ecosystem changes. These changes include the publication of new versions of a package, or differences in the host computer (where you install the environment).
Conda can make it really easy to create reproducible installation instructions. This does require a few extra steps and best practices though.
The first step is to add an environment.yml
file to your research
repository. This file contains the name of your conda environment and
indicates which packages you want to have installed from which
channels. Below, I show an example of such an environment.yml
file:
name: my-research-environment-name
channels:
# The order of the channels is important! It indicates the relative
# priority of one channel over the other. When a package is
# available in multiple channels, the package from the first channel
# in this list is picked. I prefer to use defaults before installing
# something from conda-forge.
- defaults # Default conda channel
- conda-forge
- astra-toolbox/label/dev # You can indicate a channel label in the usual way
dependencies:
- astra-toolbox=1.9.0.dev11
- libastra=1.9.0.dev11
- cone_balls=0.3.1
- cudatoolkit=10.0
- pip:
- snakemake # Any package from pypi can be listed here.
# You can also install from a git repository like this:
- git+https://github.com/ahendriksen/sacred_utils
# And from a specific branch like this (the 'develop' branch in this
# case)
- git+https://github.com/ahendriksen/tomosipo.git@develop
To install this environment, run this command from the root directory of your repository:
conda env create -f environment.yml
If the environment already exists, conda
will signal an
error. You can force the creation of the environment by using:
conda env create --force -f environment.yaml
Using an environment.yml
file does not solve the reproducibility
problem yet. It does bring us closer to the solution. In the initial
stages of your research, you will probably to edit the
environment.yml
frequently to install additional packages. Once you
reach a more stable phase, you start running the experiments for your
paper and do not want anything package-related to break your
system. At that point, it is advisable to create a lock file to really
lock your environment down.
Locking your dependencies is inspired by the approach taken to manage dependencies in the Rust programming ecosystem. Here, you specify what packages you want to install in a broad sense in one file, and specify the exact dependencies in another file, the lock file.
To create a lock file using conda
, execute the following command:
conda env export -n my-research-environment-name --file environment_lock.yml
This command writes the exact package specifications of the current
environment to the environment_lock.yml
file. This includes packages
that you have not explicitly installed, but were installed as a
dependency of another package. I have included a shortened example
below:
name: my-research-environment-name
channels:
- defaults
- conda-forge
- astra-toolbox/label/dev
dependencies:
- astra-toolbox=1.9.0.dev11=np115py36_0
- cudatoolkit=10.0.130=0
- libastra=1.9.0.dev11=h28bbb66_0
- python=3.6.9=h265db76_0
- pip:
- snakemake==5.7.4
- git+https://github.com/ahendriksen/sacred_utils
- git+https://github.com/ahendriksen/tomosipo.git@develop
To install your environment from the environment_lock.yml
file,
execute
conda env create --force -f environment_lock.yml
This restores the exact environment from the lock file. Note that this command does overwrite the current environment, if it exists.
By maintaining an environment.yml
and environment_lock.yml
file,
you tackle two problems. First of all, the environment.yml
file
keeps track of the dependencies of your project in broad strokes. This
prevents you from forgetting what packages are required to run your
project. Secondly, the environment_lock.yml
file makes sure that
your computing environment is reproducible. This way, you will not be
unpleasantly surprised by surreptitious updates of packages you depend
on. I hope these best practices can help improve your research!
Thanks to Francien Bossema and Richard Schoonhoven for reading drafts of this blog post.