Understanding and Improving Conda's performance


Lately, we have been responding to issues about Conda’s speed.  We’re working on it and we wanted to explain a few of the facets that we’re looking at to solve the problem.  

TL;DR: make it faster

Are you:

  • Using conda-forge?
  • Using bioconda?
  • Specifying very broad package specs?
    • Be more specific.  Letting conda filter more candidates makes it faster.  Instead of “numpy”, think “numpy=1.15” or even better, “numpy=1.15.4”
    • If you are using R, specifying a particular implementation makes the problem much simpler.  Instead of specifying only “r-essentials”, specify “r-base=3.5 r-essentials”
  • Feeling frustrated with “verifying transaction” and also feeling lucky?
    • conda config –set safety_checks disabled
  • Getting strange mixtures of defaults and conda-forge?
    • conda config –set channel_priority strict
    • This also makes things go faster by eliminating possible mixed solutions.
  • Observing that an Anaconda or Miniconda installation is getting slower over time?
    • Create a fresh environment.  As environments grow, they become harder and harder to solve.  Working with small, dedicated environments can be much faster.

How Conda works

To understand the above recommendations, let’s take a look at how Conda works by examining how a package is installed:.

  1. Downloading and processing index metadata
  2. Reducing the index
  3. Expressing the package data and constraints as a SAT problem
  4. Running the solver
  5. Downloading and extracting packages
  6. Verifying package contents
  7. Linking packages from package cache into environments

Downloading and processing index metadata

Metadata is currently fed into conda from JSON files (either repodata.json or its compressed form, repodata.json.bz2).  Unlike many package managers, Anaconda’s repositories generally don’t filter or remove old packages from the index. This is good, in that old environments are can easily be recreated.  However, it does mean that the index metadata is always growing, and thus conda becomes slower as the number of packages increase. Either packages need to move to archive channels over time, or the server needs to present a reduced view of all available packages.

Conda-metachannel is a community project started by Marius van Niekerk that tries to reduce the size of the repodata that gets fed into conda.  It is effectively a server-side solve that is cached and reused for multiple people, giving the solve task on the user’s computer less potential work.  Because all of this happens behind the scenes, and conda-metachannel provides a repodata.json file that isn’t any different, it does not require special support from conda.  These ideas will be critical for future developments on repositories offered by Anaconda, Inc., because they allow the ecosystem to grow without continuing to slow down conda. There is no specific development to point to right now aside from conda-metachannel, but these ideas will be part of future development.

After downloading metadata, conda loads the JSON file into memory, and creates objects representing each package.  There’s some sanitizing of data that happens at this stage, to prevent things from choking in unexpected and undesirable ways down the line.  This loading can be costly, but it is cached, so you often are not paying this cost for a given install. For any scheme that reduces the index upstream of this step, such as conda-metachannel, this caching won’t work as well, because the upstream will change more often or represent an incomplete set of the current request.  Further development will be needed to cache entries on some finer level.

For especially large channels, such as conda-forge and bioconda, this step can take a lot of time.  For example, consider creating a simple environment without a cached index:

conda clean -iy && CONDA_INSTRUMENTATION_ENABLED=1 conda create -n test_env –dry-run python numpy

Adding in conda-forge and bioconda channels dramatically increase the time spent on the creating the index, while using conda metachannel reclaims a lot of the time increase:

 
Defaults only Conda-forge Bioconda + Conda-forge Conda-forge with metachannel
Metadata collection time 2.80s 8.33s 10.23s 3.41s

These benchmarks were run on a win-64 system with conda 4.6.7.

Reducing the index

The repodata we have at this point probably contains a lot of package data that is not used in the solving stage.  The next stage, expressing the package data and constraints as a boolean satisfiability problem, is fairly costly, and filtering out unnecessary packages can save some time.  Conda starts out with only the explicit specs provided by the user. Conda then recurses through dependencies of these explicit specs to build the complete set of packages that might possibly be used in any final solution.  Packages that are not involved either explicitly or as a dependency are pruned in this step. This step is why it is beneficial to be as specific as possible in your package specifications. Simply listing a version for each of your specs may dramatically reduce the packages that are considered after this step.

One of the optimizations that was made in conda 4.6 was to make this pruning more aggressive.  In conda’s recursion through these dependencies, specs are not allowed to “broaden” from constraints that are expressed in explicit dependencies.  For example, the anaconda metapackage is made up of all exact constraints (version and build string are both specified). The zlib dependency of that metapackage would look something like “zlib 1.2.11 habc1234_0”.

  • Dependency constraints that express a version requirement of any sort imply that all dependency constraints of that package name must also have a version requirement
  • Similarly, dependency constraints that express a build string requirement of any sort imply that all dependency constraints of that package name must also have a build string requirement

With our anaconda metapackage and zlib example, if some other dependency of anaconda expressed a zlib dependency, that zlib dependency would be ignored for expanding the collection of repodata, unless that zlib dependency also had a version and build string specified.

This is a delicate balance: if conda filters too aggressively, users may not see the solutions they’re expecting, and the reasons why they don’t see those solutions may be obscure.  For this reason, we make two careful considerations:

  • We make no attempt to filter actual values of version ranges, nor build strings.  We only require that some constraint is present.
  • We sort the dependency collection by descending version order, such that the newest packages have the most impact on how much broadening is or is not allowed.

By making this more aggressive, we have decreased the solve time for metapackages, such as anaconda, down to less than 10 seconds in our benchmarks.  There is room for improvement, however, as this style of filtering is still allows versions and builds (such as packages built for mismatching python versions) into the solver.  Filtering these can be expensive, and may be best done at the level of the server providing the repodata.

Expressing the package data and constraints as a SAT problem

At its heart, Conda relies on something called a SAT solver.  It expresses the collection of package metadata and how that metadata relates to constraints as a Boolean satisfiability problem.”  Notably, SAT problems are NP-complete, and were in fact the first formal proof of NP-completeness.  Conda doesn’t just want satisfiability, though – we want a particular solution, for example, often the newest packages.  Conda thus assigns scores to packages with the same name. These scores account for channel priority, version differences, build number, and timestamp. These are fed into a clause generator which is customizable by specific SAT implementations in their native format.  The solver runs through several stages. At each stage, the clauses are altered or added to to prioritize particular things.  We’ll talk about those in the next step.

Running the solver

People sometimes wonder why Conda returns a particular unintuitive result.  Trust us, Conda is not crazy, though the package metadata might be. When you have something undesirable happening, it helps to think through the steps that conda takes when running the solver:

  1. Test for satisfiability of initial specs
  2. Prioritize solutions by which require removal of fewest existing packages
  3. Prioritize solutions that maximize versions of explicit specs
  4. Prioritize solutions that minimize the number of “track_features” entries in the environment.  Track_features are legacy metadata once used for lining up mkl and debug packages.  They are deprecated in general, but are still sometimes used to effectively implement a default variant of a package by “weighing down” alternatives with track_features.
  5. Prioritize solutions that maximize the number packages that have “features”.  This is a counterpart to 4.  Once a track_feature is active, conda tries very hard to use packages that have that “feature”.  In the past, this has caused confusion as it can have more weight than channel priority. Strict channel priority has largely fixed that confusion.  In general, packages with “features” are no longer being created. “track_features” remain as mentioned above as a means of prioritizing otherwise equivalent packages.
  6. Prioritize solutions that maximize build numbers of explicit specs
  7. Prioritize solutions which install fewer optional packages
  8. Prioritize solutions that minimize upgrades necessary to existing packages
  9. Prioritize solutions by version, then build, of indirect dependencies of explicit specs
  10. Prune any unnecessary packages
  11. Check if converged.  If we have more than one “equivalent” solution at this stage, prioritize solutions by maximizing package timestamps.  Packages that conda considers similar (variants where versions are similar, or where totally different indirect deps otherwise render the above steps equivalent) need a tiebreaker.  This step is where conda frequently gets tied up. If conda is getting hung on something, it is helpful to look at why conda needs to use a timestamp for the tiebreaker. Often, it is best to add additional metadata to one or more packages to break the equivalency.

This order of operations and prioritization has been developed slowly and painfully over time.  It is only with great caution that we even think about changing that order, because doing so has such dramatic, unpredictable effects with so many packages out there in our ecosystem.

When it comes to actually running these steps, not all solvers are created equal.  For many years, conda has been well-served by the pycosat library, which wraps the picosat solver.  SAT solving is a competitive area of research, though, and conda should be able to benefit from recent developments in this research.  Thanks to efforts by conda contributors, the solver part of conda is abstracted, and any solver can be used by writing interface glue code.  By using cryptominisat, for example, one can immediately see a small speedup in conda:

CONDA_INSTRUMENTATION_ENABLED=1 conda create –dry-run -n conda_forge_r -c conda-forge r-essentials

Time results from instrumentation record (~/.conda/instrumentation-record.csv) for conda.resolve.Resolve.solve and conda.core.solve.Solver._run_sat

Pycosat Pycryptosat 5.6.6
34.73 sec 32.82 sec

Cryptominisat is an award-winning SAT implementation whose developer has actively worked with conda developers to collaboratively make it work for conda.  We’ll also continue to explore different solver implementation options (including other classes of solvers, such as Microsoft’s Z3 SMT solver) to keep the conda user experience as quick as possible.  We value the consistency of environments, and we’ll continue pursuing improvements in metadata to help solvers make decisions accurately, and also improvement in solvers to help make those accurate decisions quickly.

Downloading and Extracting Packages

Conda has been based around .tar.bz2 files since its inception.  The actual file format doesn’t matter much to conda. Really only the relative paths within the container are what matter.  Thanks to this simplifying assumption, Anaconda has been developing a new file format that will speed up conda. The new file format (.conda) consists of an outer, uncompressed ZIP-format container, with two inner compressed .tar files.  These inner compressed tar files are free to use any compression that libarchive is built to support. For the inaugural .conda files that Anaconda is creating, the zstandard compression format is being used. The zstandard-compressed tarballs can be significantly smaller than their bzip2 equivalents.  In addition, they decompress much more quickly.

Package name Size (.tar.bz2) /extract time Size (.conda) / extract time
python 17.3 MB / 3.613 sec 14.5 MB / 2.021 sec
mkl 178 MB / 16.972 sec 99.8 MB / 2.385 sec
pytorch 464 MB / 38.271 sec 320 MB  / 4.785 sec

Package metadata (the “info” folder) makes up one of the inner tarballs, while the actual package contents (everything but the “info” folder) make up the other.  By separating the metadata this way, we take advantage of the indexed nature of zip files for rapid access to package metadata. Tasks that involve accessing the package metadata now scale simply with the number of packages being processed, rather than the number and size of packages being processed.  Indexing a collection of packages, for example, is now much faster, although the time to create packages in the new format can outweigh and even dwarf these gains.

While the performance of the new package format is exciting, it will take time to be broadly supported.  In particular, the codebase behind anaconda.org does not currently support hosting packages in the new format.  Channels that benefit from Anaconda’s edge-cache mirroring (presently, conda-forge), will benefit from conversion to the new package format.  Other channels will need to wait for broader support.

Verifying package contents

Conda 4.3 added many features around safety.  At the time, conda was suffering from pip writing into the package cache, which then made future conda installations very unpredictable. This release added sanity checks so that conda could warn you if anything was out-of-sorts. Conda grew support for transactions,  so that conda would be able to rollback a change if it found things that were out of sorts. These checks are unfortunately costly to do. In conda 4.6, we relaxed these checks from doing a sha256 checksum on every file to only checking file size instead.  The old sha256-all-files behavior is still there, it’s just behind a configuration variable (extra_safety_checks in condarc).  It’s difficult to say what the correct behavior is here.  Some users value the peace of mind that the extra time for doing file verification buys them.  Others do not. We enable file verification (but not full sha256 verification) by default because having conda report corrupted files is more helpful than having conda blindly install broken packages.  It’s a lot easier to figure out what’s going on when conda tells you that you should not trust the integrity of the files that are there than it is to have unexpected behavior and try to figure out that files have been corrupted or overwritten somehow.  If you’re impatient or otherwise in the mood to shout YOLO, you may disable these checks with a setting in condarc:

safety_checks: disabled

If you’d like to see more about conda’s verification checks, please see the code at https://github.com/conda/conda/blob/4.6.7/conda/core/path_actions.py#L275-L362

Linking packages from package cache into environments

There’s not a whole lot to this.  Conda tries to use hard links whenever possible.  Hard links increase the reference count on disk to a specific file, without actually copying its contents.  Files that require prefix replacement (files that have hard-coded paths in them) can’t be just hard-linked and must be copied.  

Where conda fails to create a hard link, it may fall back to either a symlink or a copy.  Hardlinks may fail due to permissions error, or because the destination is on a different volume than the package cache.  Hard links only work within a volume. Pay special attention to how your folders are mounted, as the fallback to copying is a big speed hit.

One other interesting thing that happens at package link time is the remapping of noarch files into the appropriate place for the destination installation.  Noarch python packages store their contents in an intermediate path that isn’t actually valid for any python at all. At install time, conda uses information about the destination (which python version, which platform) to determine the appropriate destination, and links files appropriately.  Conda then compiles bytecode for the installed files. This was recently optimized to happen in batches, rather than one file at a time, which dramatically improved the time required for installing noarch python packages.

How to helpfully report speed issues

We get a lot of reports on the conda issue tracker about speed.  These are mostly not very helpful, because they only very generally describe the total time conda is taking.  They often don’t even say what conda is taking so long to do – just that it’s slow. If you want to file an issue report about conda’s speed, we ask that you take the time to isolate exactly what is slow, and what you think is a reasonable amount of time for that operation to take (preferably by comparison with another tool that is performing a similar task).  We have an issue template on github that helps us gather information about why conda is slow: https://github.com/conda/conda/issues/new?template=Speed_complaint.md

Please fill in all information on that template when raising an issue about speed.

How to help speed things up

So you want to help?  That’s great! There’s an excellent outline of potential improvements at https://github.com/conda/conda/issues/7700. If you have other ideas, please open an issue on the conda issue tracker to discuss them.  Aside from the contents of that issue, several people have recommended different solvers to try.  Efforts to plug those into Conda would be most welcome.  We’re definitely open to any idea that maintains the current functionality while going faster.  If the current functionality needs to change to enable a significant speedup, we can discuss it, but a lot of people depend on conda and we have to provide smooth transitions when functionality changes dramatically