Justin Calamari

Modern data science has revolutionized practically every imaginable facet of life. Such data science has been enabled by the development of languages and programs that facilitate manipulation of large amounts of data both intuitively and quickly. The intuitive part comes from the language side, providing high-level tools such as Python that are easy to work with. However, these high-level tools are often slow by nature. Adding in compiled libraries that extend and accelerate the high-level tools are necessary. As such, binary compatibility among these compiled libraries is critical in the Python data analytics ecosystem, as incompatibility between packages can result in broken software. It would be catastrophic, for instance, if a new release of numpy caused all numpy-dependent software to stop functioning. It is therefore necessary that package managers are careful to only install packages compatible with each other. PyPI is a great resource, but efforts to standardize binaries have been lackluster. The more distributed model of Anaconda.org allows any given channel to achieve binary consistency, but historically no single channel has had enough software to cover most use cases. Conda-forge alleviated this problem by ensuring that all builds within its ecosystem use consistent toolchains and build options, but ABI changes in new releases of critical packages makes compatibility a rapidly moving target, threatening to tear conda-forge apart at the seams or consign our stack to be out of date. This problem has been exacerbated by Anaconda’s new compilers, which added support for new compiled language standard, but in the process caused conda-forge to become incompatible with the official Anaconda packages, leading to broken software when users mix the two channels. One of the major advantages of using conda-forge over individual user channels is its consistent build environment, which is meant to ensure that software works without much effort from the end user. Unfortunately, it is common to need to mix conda-forge software with defaults software, and in light of these binary incompatibility challenges, the promise of things simply working is harder to make.

These issues can be solved as simply as rebuilding all of conda-forge’s packages, but it would require all packages in the tree that depend on a specific ABI. This can be prohibitively costly, but also extremely difficult to coordinate across conda-forge’s distributed collection of packages, each with their own maintainers. Therefore, conda-forge requires some sort of automation to handle this. With its ability to create pull requests with updates to recipes, the conda-forge autotick bot, which had originally been used for version bumping, is a great tool for performing this rebuild process. My work this summer added new functionality to the bot to bring conda-forge closer to making the switch to the new compilers, so that they can support newer compiled language features.

When conda-forge switches to the new compilers, all packages using compilers need to be rebuilt in the proper order. To do this, the bot can create a pull request to all pertinent feedstocks increasing the build number by one. This can be accomplished by searching the meta.yaml for the pattern number:\s*[0-9]+ and increasing the number by one; however, this method fails on the many recipes that use a Jinja variable to store the build number, e.g.

{% set build = 0 %}
...
build:
  number: {{ build }}

In this case, the bot should update the Jinja variable rather than the number: {{ build }} line. My first contribution for the summer (regro/cf-scripts#148) was to make sure the bot updates the Jinja variable in this case. This PR was made when the bot’s only functionality was version bumping, so the build number is reset to 0, but the same technique to update the Jinja variable was used when I wrote the rebuild Migrator (regro/cf-scripts#246), which increases the build number of a recipe by one.

All Migrator objects have the methods migrate and filter. The migrate method performs a desired update to a recipe, while filter determines if a package needs the be migrated. In the rebuild migrator, migrate increases the build number by one, while filter checks if the package is ready to be rebuilt. Since a package should only be rebuilt after all of its dependencies are rebuilt, filter checks all of a package’s dependencies and filters the package out if any of its dependencies that require rebuilding have not yet been rebuilt. This way the bot will only create a PR if a package is ready to be rebuilt, ensuring that rebuilds are triggered in the correct order.

With the rebuild migrator, the bot can create PRs bumping the build number of all packages that need to be built using the new compilers, but this migrator does not update the package’s build requirements, so it would just be rebuilt with the old compilers! Conda-build 3 introduced a new Jinja2 function, compiler(), which allows for dynamic specification of compiler packages. If a recipe requires a C compiler, for instance, the recipe would include

requirements:
  build:
    - {{ compiler('c') }}

The package that this variable is replaced by is defined in the conda_build_config.yaml file. If the recipe does not define compilers in its conda_build_config.yaml it uses the packages defined in conda-forge’s central pinning file at https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/master/recipe/conda_build_config.yaml. Currently conda-forge’s pinning file uses the old compilers, so any package using conda-build 3’s Jinja2 syntax will be built with the old compilers, but by changing the definition in the pinning file, different compilers, including those used by Anaconda, could be used. Therefore, before switching to Anaconda’s toolchains, recipes should first be adapted to use this Jinja2 syntax. This can be accomplished with a migrator that runs the update-cb3 command from conda-smithy on recipes that need this new syntax, so I added this migrator to the bot (regro/cf-scripts#185).

Because there are over 1500 recipes that need the new compiler syntax, this migration will take a long time to complete. One plan for the migration to the new compilers could be to wait until all packages are switched to the new syntax before starting, but this would likely take far too long, since some recipe maintainers are inactive and despite how great the bot is, there are certain edge cases that cause it to miss or fail updating feedstocks. Identifying and handling all of these edge cases would also take much time and take time away from developers who could work on other important issues. But since packages already using the new syntax can be rebuilt as soon as all of their dependencies are, the compiler migration can be started before the syntax migration is finished.

It then became important to identify packages on which many other packages depend, as they must be built using the new compilers before anything that requires them. This way we can make sure that they are ready to be rebuilt so that large sections of the graph are not blocked while waiting on certain packages. To help with this I wrote code to find the longest path from each package a source node (regro/cf-scripts#161). A package cannot depend on anything with a longer path to the source node, so packages with shorter paths should likely be given higher priority. By performing syntax migrations in topological sort order, packages with shorter paths to the compilers will be made ready for migration to the new compilers as soon as possible (regro/cf-scripts#225).

Topological sort only works on an acyclic graph, though, and conda-forge’s dependency graph contains some cycles, so certain edges would need to be excised to remove all cycles from the graph. This introduces some difficulties, however, since removing edges may make certain nodes unreachable from the compiler nodes. The solution to this is to remove edges while performing the topological sort.

def cyclic_topological_sort(graph, sources):
    g2 = deepcopy(graph)
    order = []
    for source in sources:
        _visit(g2, source, order)
    return reversed(order)

def _visit(graph, node, order):
    if graph.node[node].get("visited", False):
        return
    graph.node[node]["visited"] = True
    for n in graph.neighbors(node):
        _visit(graph, n, order)
    order.append(node)

The above code performs a depth-first topological sort starting from the input source nodes, but instead of failing when a cycle is encountered, it simply ignores the edge completing the cycle, effectively removing it from the graph.

With all of the migrations that need to be done in order to switch to the new compilers, conda-forge will need to run a lot of builds on Travis CI, CircleCI, and AppVeyor. There is little that can be done to curb our CI usage besides limiting the number of migrations we issue, but it does help for packages to be build noarch. This would mean the package would only need to be built once (on CircleCI) freeing up CI resources for other builds. This would not work with packages built with compilers, but it would help with CI usage from standard version bump builds. Building as many packages as possible with noarch would drastically reduce CI usage, which is especially important while the syntax and compiler migration is occurring. Therefore, I wrote a migrator to add noarch to all possible packages (regro/cf-scripts#199).

In addition to the new Jinja2 syntax for compilers, conda-build 3 introduces other new features. One of the new features is the ability to specify multiple locations of the source code for the package. When the bot bumps the version of a feedstock, the URL to the source code also changes to the URL for the new version. The bot then needs to update the checksums for each of these URLs. Initially, the bot assumed there was a single URL per feedstock, so it would find a URL, hash it, and replace all checksums in the recipe with the hash for that URL. To make the bot compatible with recipes with multiple URLs, I gave the bot the ability to update each checksum with the correct hash (regro/cf-scripts#163 and regro/cf-scripts#182). With this new functionality, the bot helps keep conda-forge up-to-date as more recipes take advantage of conda-build 3’s new features.

There are many challenges in maintaining a package ecosystem, but through automation, some of the pain in dealing with these issues can be alleviated. Automation also makes the process easy to reproduce, so my work this summer will help conda-forge not just in switching to Anaconda’s new toolchains, but whenever a subsection of the graph needs to be rebuilt for binary compatibility. While the bot is mostly ready for the compiler switch there is still some work to be done before it can start. Conda-smithy needs to be updated so that packages are built using both the old and new compilers, which are pushed to different channels. A pull request for this is currently in progress (conda-forge/conda-smithy#836). While the compiler migration can start before all syntax migration is finished, eventually every package should be rebuilt and will need the new syntax. There are many packages that have syntax migration PRs that have not been merged yet, and many others that the bot is unable to update. These packages should be identified so that conda-forge can proceed with rebuilds of these packages and any that depend on them. Also, it is important to identify which of these packages are most important to update (have short paths to compilers and many packages that depend on them) so that they are given priority. Even when everything is built using the new compilers, conda-forge is not done with rebuilding. When a pinning in conda-forge’s central pinning file changes, packages that use that pinning should be rebuilt. Most of the infrastructure for this is in place now, but it requires creating a subgraph, instantiating a rebuild Migrator to act on that subgraph, and adding that code to the bot. But since the bot already inspects feedstocks to find the version, it should be able to inspect conda-forge-pinning-feedstock, identify pinning changes, and automatically start a rebuild Migrator.

Overall, the bot is a great benefit to conda-forge’s ecosystem. With such a large number of packages and contributors, it is difficult make sure every package is kept up to date with changes to both its source code and the conda-forge ecosystem. Keeping the entire ecosystem centrally managed may help with organizing large-scale efforts such as rebuilding dependency chains, but makes it difficult to maintain all of the packages on conda-forge. Central management could also dissuade newcomers from contributing to the project, and would likely make conda-forge more selective about which packages it allows on the channel, introducing the problems present in Anaconda’s default channel. But if the system is completely distributed, then it is difficult to make sure packages managed by different recipe maintainers play well with each other. The bot helps strike the balance between these two extremes by allowing the ecosystem to be distributed, with many different recipe maintainers, while giving conda-forge’s core team a tool to communicate standards (pinning changes, updated Jinja2 syntax, etc.) by issuing PRs that maintainers are able to accept, reject, or change. The advances made to the bot this summer helps conda-forge with the ever-present challenge of ensuring binary compatibility, and provides conda-forge’s core team with the tools to communicate with recipe maintainers the steps necessary to preserve compatibility as ABI changes are introduced to the ecosystem.