Conda¶
Conda helps install and manage software packages in a separate environment that is isolated from the rest of the system. You install conda once, and from there you use use conda to install lots of other software, without needing root access. Typically, you will be creating an environment for each project. That environment has everything in it that you need for that project (for example, maybe Python and R and various packages for those languages). Importantly, that version of R and that version of Python are completely independent of any other environments you might have.
Another major advantage is that you can generate a list of packages and send that list to someone else, allowing them to install the same exact packages on their machine. Since different versions of tools often give different results, this aspect is very important for reproducibility.
Note
For NIH users, see the NIH HPC Python documentation for specific points about
installing and using conda on Biowulf. One of the main points is that on
Biowulf, each person should have their own miniconda installation in their
/data/$USER
directory. Don’t use your home directory because the quota
of 16 GB can actually be too small after installing a lot of packages.
First, install conda. See the note above if you are working on NIH’s Biowulf cluster.
Then set up the channels like this, which follows the bioconda docs:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
Optional: Much of the bioinformatics community is moving to mamba, which is a faster
drop-in replacement for conda
which is quite a bit faster. So typically the
next step is to install mamba
into your base environment. You only have to
do this once:
conda install -n base mamba
From now on, instead of conda install
you can use mamba install
to make
it go faster. Instead of conda create
use mamba create
. And so on. There
are also some nice troubleshooting tools that come with mamba
that can come
in handy.
Warning
If you have a new Mac that uses the M1 chip, this is not yet supported by Bioconda. All packages need to be re-built for ARM64 architecture, which is a rather large task. There are plans to do this but the packages are not available yet.
What is an environment?¶
The purpose of conda is to create environments. We create environments and then activate them to use them.
What is an environment?
Briefly: an environment is a directory containing executables and supporting
files for those executables, and environment activated when its directory of
executables is been prepended to the $PATH
.
What does that mean?
First, let’s understand the $PATH
variable. For example, if you’re on a Mac
or Linux machine, you have access to the command ls
that lists the contents
of directories. ls
is an executable program. When you type ls
on the
command line, the command line interpreter needs to figure out what you mean.
It will look through its list of possible locations to try and find an
executable called ls
. The first one it finds, it runs.
How does it know where to look?
By convention, the shell uses the $PATH
variable. This is a variable that,
again by convention, is separated by colon (:
) characters. Here is
a typical PATH
from a fresh Linux machine, which we can see by running echo
$PATH
:
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
So with this PATH
, typing ls
(and then Enter) means that the shell will
first look in /usr/local/sbin
for the executable program ls
. If the file
/usr/local/sbin/ls
does not exist, then it will next check for
/usr/local/bin/ls
. If that doesn’t work, it keeps going. On this machine,
eventually /bin/ls
is found (it happens to be the last place it looked) and
it is that ls
that runs.
If ls
can’t be found at any of those locations, we’ll get a ls: command
not found
error.
We can edit your PATH
variable to add new locations. This is how we
“install” programs on Linux. For example, imagine we made a new amazing version
of ls
that we wanted to be called any time we typed ls
on the command
line. We don’t have root access to this machine, so we can’t put my new ls
in
any of the paths in our PATH
variable (they are all system-wide and owned by
root). Instead, we modify the PATH
variable. Say we were keeping the new ls
in our home directory, ~/tools/ls
. Then we would modify PATH
like
this:
export PATH="~/tools:$PATH"
Here is an annotated version of that:
export PATH="~/tools:$PATH"
^ ^ ^ ^
| | | |
| | | |__ expand out the existing contents of PATH
| | | and insert the results here
| | |
| | |__ Add the directory with the program I made
| | | to the beginning of the new PATH
| |
| |__ Overwrite the existing PATH
|
|__ Make the new PATH available to child processes,
not just this one
After running that command, we check the new value of $PATH
:
echo $PATH
~/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# ^^^^^
# the directory has been prepended
If we wanted that to be permanent so we had that every time we started a new shell,
then we would put that export
line in the ~/.bashrc
file, which is executed
every time bash starts up.
So now hopefully the following statment makes more sense: “an activated
environment is a directory containing executables that has been prepended to the
$PATH
.
Activating and deactivating environments¶
Activating an environment with conda activate
will add the
environment’s directory containing executables to the $PATH
. This is the
bin
directory of the evironment. For example, if we create a simple
environment with just Python:
mamba create -p ./env python
and look inside it with ls env/
, we see this:
├── bin
├── conda-meta
├── include
├── lib
├── man
├── share
└── ssl
If we look inside the bin
directory, we’ll see lots of files. One of them
is python
. If we activate the environment with conda activate ./env
,
and check the path, the bin directory of the environment has been prepended
to the path. So if we use which python
, it should point to the Python
installation in that directory because the shell found python
at the first
place it looked: the bin
directory of the environment. As long as this
environment is activated, any time we call python
it will use that
Python.
If we wanted to, we could avoid using conda activate
and just manually add
things to the path. Or we could explicitly call that Python with
./env/bin/python
. But conda activate
ends up being more convenient.
Deactivating an environment with conda deactivate
removes the path from
the $PATH
. So in this case, after deactivating the environment, calling
python
will find a different installation of Python. Typically it will find
the Python in the base environment, or, after running conda deactivate
again to deactivate the base environment, a location like /usr/bin/python
(in the case of MacOS).
Difference between named environment and a path environment¶
If we create a new environment like this:
conda create -n proj python
then it will create the environment directory wherever we have installed my
version of conda. Others might not have access to that directory. We need to
remember the name of the environment, or otherwise run conda env list
and
study the list to remember which one we should use. We would activate it like
this:
conda activate proj
If we instead create a new environment like this, say, after changing to our project directory:
conda create -p ./env python
then it will create the environment in a directory called env
in the current
directory, and we would instead activate it like this:
conda activate ./env
The ./
is important. We can alternatively use env/
. The point is that
conda needs to see that /
indicating that it’s a directory not an
environment name that should be activated. If we used conda
activate env
then it would look for an environment named env
which we might
not have created.
A path environment is very helpful when working in a shared directory. Anyone with access to the directory can activate the environment and be using the exact same set of packages as anyone else. This makes it easier for someone else to jump in and help troubleshoot immediately rather than have to worry about matching dependencies and do lots of installation work before they can even start to reproduce the thing they’re trying to troubleshoot. If we maintain a consistent naming convention, then it’s very clear which environment should be used for the project.
Conventions for project-specific environments¶
In BSPC, we have the convention that each project directory should have at least
an env
directory, at the top level of the project, containing the conda
environment to be used by that project.
Some projects may have a separate env-r
directory, or may have multiple
environments either for historical reasons (like keeping a copy of an env from
a previous version of the analysis) or for logistical reasons (like splitting
R and non-R packages into separate envs to save time). But in general, having
an obvious environment directory name
makes it easy for others to find.
Creating an environment¶
There are three ways to specify what should go into an environment:
Directly on the command line. Not advisable because it’s harder to track what’s in there.
Plain text file, one package per line (by convention called
requirements.txt
)An environment file in YAML format (by convention called
env.yml
)
Directly on the command line:
mamba create ./env python
Using a plain text file called requirements.txt
with the following contents
(one line per requirement):
python
would be:
mamba create -p ./env --file requirements.txt
Using an environment file in YAML format called env.yml
with the following
contents:
channels:
- conda-forge
- bioconda
dependencies:
- python
would be:
mamba env create -p ./env --file env.yml
That is, use create
for a text file, and env create
for a YAML file.
Updating and managing environments¶
In BSPC, we have the policy that anything added to the environment should be recorded in a file which is then used to update the environment. That way, the environment file is the authoritative source of what was put into the environment.
If you need to add something to the environment, add it to the requirements first (either requirements.txt or env.yaml) and then with the environment activated, install the entire requirements file. For example:
conda activate ./env
mamba install --file requirements.txt
This will only install packages (and dependencies) that have not already been
installed, and in this case requirements.txt
contains the packages that were
installed.
Conda envs cannot be moved¶
Due to the way that libraries (typically C and C++) are handled in conda, the absolute path to an environment is written into many of the executable files at install time. This means that if the environment is moved to another location, those absolute paths will no longer be pointing to the paths where the libraries are, which breaks the environment.
Recording installed packages¶
If you have been rigorous about maintaining the contents of the requirements, that should be sufficient for someone else to build the new environment. Otherwise, or if you want to be sure, you can export the environment.
conda env export --no-builds > env_export.yaml
This will include all dependencies in a YAML format file ready to be used by
conda env create --file
. This will also included depencencies that you
didn’t explicitly install. For example, building an evironment with just Python
in it will also install lots of other things that Python needs (like pip,
setuptools, sqlite, tk, wheel, ca-certificates, and more). These will also be
included in the export.
The --no-builds
part is helpful for maintaining the reproducibility – see
below for more on this.
Least reproducible (but may still be perfectly fine!):
python
pandas
Or, assuming you know that you need features from pandas that were added in version 1.5.1:
python
pandas>=1.5.1
Those files must be hand-written based on what you know your codebase requires.
Using conda env export
allows you to report everything that got installed
(dependencies of dependencies of depencencies of….) in the environment.
Installing a previously-exported env.yaml and dealing with version conflicts¶
If you re-create an environment from an env.yaml within a short amount of time (say a few months) then it is likely that it will work with no problems. However, over time, packages may get fixed which could cause issues.
This primarily happens when there are build numbers included in the env.yml. To understand this, first take a look at a typical conda package name:
zlib-1.2.12-h5eee18b_3
^^^^ ^^^^^^ ^^^^^^^^^^
| | |
| | | build string
| |
| | package version
|
| package name
Here, zlib
is the package name (it’s used by MANY other packages to handle
file compression, so there’s a good chance it’s in your environments). The
1.1.12
is the version of zlib. The h5eee18b_3
is called the build
number or sometimes the build string. Technically, that h5eee18b
part is
the hash of all of the pinned packages and versions used by this package that
are also pinned to a specific version by the build infrastructure (that is,
conda-forge or bioconda). In other words, it’s a string that will change if
a version changes in any of the packages it depends on. The _3
part means
that this is the fourth time (the number is zero-indexed) that zlib version
1.1.12 has been rebuilt using this same collection of underlying packages.
Do not expect build numbers to be stable over time. For example, a packager might realize that they forgot to copy over a file, and this issue wasn’t caught until later. Or a packager included large amounts of supplementary data into a package and was asked to remove it to avoid very long download times. In both cases, the package version doesn’t change – it’s just other parts around it that change. This is reflected in changes to the build number.
Excluding build numbers is useful because it allows packages to “float” to the most recent available build, while still keeping the package version the same. There are cases where the channel (like conda-forge or bioconda) removes a particular build because it is known to be broken. If an environment yaml happened to contain that broken build, recreating that environment would fail because it wouldn’t be found.
Completely remove defaults channels¶
If you get an error like this:
RuntimeError: Multi-download failed. Reason: Transfer finalized, status: 403 [https://repo.anaconda.com/pkgs/r/noarch/repodata.json] 4020 bytes
then you can fix it by completely removing the default channels from conda by
adding this to your ~/.condarc
:
default_channels: []
Then run:
conda config --remove channels defaults
While you’re at it, you may want to set strict channel priorities, as recommended by the bioconda docs.
So a working .condarc
looks like the following:
channels:
- conda-forge
- bioconda
channel_priority: strict
default_channels: []
Installation on Helix/Biowulf¶
On NIH’s Helix/Biowulf cluster, trying to install miniconda can result in the
installation directory having only a conda.exe
file in it, and you also get
warnings about libraries. This appears to be an issue with how temp files are
handled on the system.
In general, the latest info is on Biowulf docs on conda. Here is a summary of that section showing how to use a new temp directory:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
TMPDIR=/scratch/$USER/temp bash \
Miniconda3-latest-Linux-x86_64.sh \
-p /data/$USER/miniconda3 \
-b
Biowulf staff also recommend NOT activating your base environment by default. Why?
activating an environment runs conda, which runs Python
Python touches a lot of files when starting up
If you run thousands of jobs on the cluster, session each job will activate the base environment (and therefore using Python), which will possibly touch hundreds of thousands of files before any computational work is done, potentially causing I/O lag on the cluster.
There are a few ways around this. The one I have found most convenient is to
first run conda init bash
, which adds lines to your ~/.bashrc
file that
look like this:
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/data/$USER/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/data/$USER/miniconda3/etc/profile.d/conda.sh" ]; then
. "/data/$USER/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/data/$USER/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
Edit your .bashrc
, and wrap that newly-added-by-conda-init code in
a function. Here, the function is called c
just because it’s easy to type
but it can be whatever you want. Here, I also added conda activate $1
to
the end of it. So I converted those lines to something that looks like this in
my .bashrc
(added lines emphasized):
function c() {
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/data/$USER/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/data/$USER/miniconda3/etc/profile.d/conda.sh" ]; then
. "/data/$USER/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/data/$USER/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate $1
}
Now, I can either activate my base environment with c
, or activate an
environment with c ./env
…but my base environment is not activated at the
start of every session, thus reducing the I/O burden on the cluster.
source activate vs conda activate¶
The “old” way of activating an environment was source activate env
.
This should still work.
The “new” way of activating an environment is conda activate env
.
The new way requires to do a one-time setup, conda init bash
, which adds
a bunch of stuff into your .bashrc
.
However if you try conda activate
within a script, you’ll get an error
because the script does not source .bashrc
. The solution is to change
conda activate ./env
to
eval "$(conda shell.bash hook)"
conda activate ./env
Note that if you inspect what conda init bash
adds to your .bashrc
,
it’s basically doing the same thing.
Detailed troubleshooting example¶
Here are some notes on a recent conda troubleshooting session that may provide some useful tools for future cases.
This issue started when cutadapt in the original env was giving CRC errors apparently because it thought the gzipped fasta files were corrupt. I verified the files themselves in various ways, it didn’t look to be the input files that caused the problem. So I started to suspect the tool.
I created a quick script that would perform the test to isolate the issue and so that I would have a quick way of seeing if a possible fix worked or not, to minimize downtime between tests.
OK, first thing to check: maybe it’s something with the env. Tried module
load cutadapt
on Biowulf, to use the version installed by Biowulf staff. Ran
the test script, and it worked.
OK, maybe it’s version thing? The biowulf module was using cutadapt 3.0, and
the original env was using cutadapt 3.3. So I made a new env with cutadapt 3.0
(mamba create -p ./env-cutadapt3.0 cutadapt=3.0
). It worked. Also verified
running cutadapt 3.4 worked in another environment created similarly. So using
3.0 or 3.4 worked, that’s good.
To verify that it’s in fact a version thing, I created a fresh env with just cutadapt 3.3, expecting it to fail. But it worked! Oh no. Is this some sort of strange filesystem thing that is now magically resolved? I went back to the original env that had 3.3, and verified that yes, it’s still failing. So there must be something in particular about that original env.
To test that idea, I created a new version of the original environment but with cutadapt 3.4 to see what other packages are brought in. I probably should have done this with cutadapt 3.3; this was not a properly controlled experiment! Anyway, this new environment did work.
So the signs were pointing to the fact that something else, brought in as a dependency of cutadapt in that original envirnoment, was the cause. So I needed to figure out what was different between the original env and the new one with cutadapt 3.4 (besides cutadapt itself of course).
To do this, I used conda env export
on the original and newly-created envs,
and then studied the diff between the two files. After chasing a couple of
false leads, I eventually saw that python-isal
was at version 0.8.0 in the
original environment…but 0.9.0 in the newly working one. I looked up what
python-isal
was, and it’s a compression library. The original problem I was
trying to solve was that error with gzip, so this was a promising reasonable
lead. After checking the github page, I did find a closed issue,
https://github.com/pycompression/python-isal/issues/60, that described the
problem and showed that the problem was fixed in 0.8.1. Note that this github
issue did not come up when I was searching for the original error!
So the solution was to pin python-isal>0.8.0
in the requreiments.txt in the
original environment…and that worked!