Reproducible computing environments using Conda

Mar 9, 2020

Allard Hendriksen

4 minute read

While doing research, it is often necessary to install your research environment on various computers. A good first step is to use Conda to manage your Python environment.

Even with Conda, it can happen that exactly the same installation instructions yield different computing environments. In our group, for instance, some installation instructions stopped working when a new version of ASTRA-toolbox was released with support for a newer version of CUDA. This version of CUDA was not yet installed all host machines, and new installations of ASTRA-toolbox defaulted to the newest version of cudatoolkit. Therefore, the exact same installation instructions could fail on one computer and succeed on another. These installation instructions were not reproducible, since they did not result in the same installed environment in all circumstances.

Reproducible installation instructions should yield the same installed computing environment even when the surrounding ecosystem changes. These changes include the publication of new versions of a package, or differences in the host computer (where you install the environment).

Conda can make it really easy to create reproducible installation instructions. This does require a few extra steps and best practices though.

The first step is to add an environment.yml file to your research repository. This file contains the name of your conda environment and indicates which packages you want to have installed from which channels. Below, I show an example of such an environment.yml file:

name: my-research-environment-name
channels:
  # The order of the channels is important! It indicates the relative
  # priority of one channel over the other. When a package is
  # available in multiple channels, the package from the first channel
  # in this list is picked. I prefer to use defaults before installing
  # something from conda-forge.
  - defaults                    # Default conda channel
  - conda-forge
  - astra-toolbox/label/dev     # You can indicate a channel label in the usual way
dependencies:
  - astra-toolbox=1.9.0.dev11
  - libastra=1.9.0.dev11
  - cone_balls=0.3.1
  - cudatoolkit=10.0
  - pip:
    - snakemake                   # Any package from pypi can be listed here.
    # You can also install from a git repository like this:
    - git+https://github.com/ahendriksen/sacred_utils
    # And from a specific branch like this (the 'develop' branch in this
    # case)
    - git+https://github.com/ahendriksen/tomosipo.git@develop

To install this environment, run this command from the root directory of your repository:

conda env create -f environment.yml

If the environment already exists, conda will signal an error. You can force the creation of the environment by using:

conda env create --force -f environment.yaml

Using an environment.yml file does not solve the reproducibility problem yet. It does bring us closer to the solution. In the initial stages of your research, you will probably to edit the environment.yml frequently to install additional packages. Once you reach a more stable phase, you start running the experiments for your paper and do not want anything package-related to break your system. At that point, it is advisable to create a lock file to really lock your environment down.

Locking your dependencies is inspired by the approach taken to manage dependencies in the Rust programming ecosystem. Here, you specify what packages you want to install in a broad sense in one file, and specify the exact dependencies in another file, the lock file.

To create a lock file using conda, execute the following command:

conda env export -n my-research-environment-name --file environment_lock.yml

This command writes the exact package specifications of the current environment to the environment_lock.yml file. This includes packages that you have not explicitly installed, but were installed as a dependency of another package. I have included a shortened example below:

name: my-research-environment-name
channels:
  - defaults
  - conda-forge
  - astra-toolbox/label/dev
dependencies:
  - astra-toolbox=1.9.0.dev11=np115py36_0
  - cudatoolkit=10.0.130=0
  - libastra=1.9.0.dev11=h28bbb66_0
  - python=3.6.9=h265db76_0
  - pip:
    - snakemake==5.7.4
    - git+https://github.com/ahendriksen/sacred_utils
    - git+https://github.com/ahendriksen/tomosipo.git@develop

To install your environment from the environment_lock.yml file, execute

conda env create --force -f environment_lock.yml

This restores the exact environment from the lock file. Note that this command does overwrite the current environment, if it exists.

By maintaining an environment.yml and environment_lock.yml file, you tackle two problems. First of all, the environment.yml file keeps track of the dependencies of your project in broad strokes. This prevents you from forgetting what packages are required to run your project. Secondly, the environment_lock.yml file makes sure that your computing environment is reproducible. This way, you will not be unpleasantly surprised by surreptitious updates of packages you depend on. I hope these best practices can help improve your research!

Thanks to Francien Bossema and Richard Schoonhoven for reading drafts of this blog post.

Allard Hendriksen | CI @ CWI