Publishing Proprietary Python Packages on PyPI Using Poetry

Here at Aotu, we’re preparing to open-source our BrainFrame Client (more news on this to come soon). BrainFrame is our platform for performing deep learning on a wide variety of video sources, using containerized models and algorithms, and our client serves as the interface to the system for configuration and display of results. 

Once the client is open-sourced, we want our users to be able to not just use our pre-built executables, but also to develop and build the client themselves on their own machines. Unfortunately, the client currently has a dependency on Gstly, our internal, closed-source library that we use to manage our GStreamer streaming logic. It’s used by both our backend and our client, and needs to remain private. 

In the future we’d like to be able to remove the client’s dependency on it, replacing it with simpler, more lightweight streaming logic. This would allow our client to be completely open-source, but it’s not on the table just quite yet. One of the hurdles with open-sourcing previously closed-source software is that it’s often a very all-or-nothing approach. This is our attempt at making that transition a bit more piecemeal.

We needed a way for our users to be able to use Gstly without having access to its source code. We came up with the idea to use Cython to obfuscate the code before distributing the library using PyPI (the official third-party Python software repository). This allows the BrainFrame Client to use it as a normal project dependency without us having to distribute its source.

Maybe you too want to publish a proprietary package to PyPI; distributing it while keeping the source private. This can be especially useful if you want an otherwise open-source project to use closed-sourced packages as dependencies. In this tutorial, we’ll learn how to obfuscate a library using Cython, and build/publish it using Poetry. 

Cython and Poetry

Cython is a tool most commonly used to speed up the execution of Python programs through a secondary compilation step. It works by transpiling Python code into C code, and then compiling that into machine code binaries, which can then be imported in other Python scripts as if they were normal Python. While we don’t really care about the execution speedup for our use-case here, the compilation will instead serve as an easy method to obfuscate an otherwise hard-to-obfuscate language

At Aotu, we’ve standardized on using Poetry as our Python dependency management tool—we love its simplicity and ease of use, and have high hopes for its continued progression. However, that simplicity comes at the expense of not covering as many unconventional use-cases, so we’re going to have to get hacky to make it do what we need. 

In this tutorial, we’re going to be making use of an undocumented feature (as of Poetry 1.1.4). If you’re not using Poetry, or want a solution that’s a bit more stable, you should be able to adapt what we do here to use other tools and methods, such as a plain setup.py file and twine, with only a bit more work.

Poetry Configuration

Poetry has a (currently undocumented) feature that will allow us to bypass its standard wheel build proceedure and substitute it with our own script. We’ll add the following section to our pyproject.toml to enable it:

[tool.poetry.build]
script = "build.py"
generate-setup-file = false

Typically, Poetry would use your project’s pyproject.toml to generate a behind-the-scenes setup.py file which it would then execute to perform tasks such as “build” and “install”. However, here we tell Poetry to not generate a setup.py file when building packages and to instead use our own build script, which we’ll call build.py. Poetry will execute this script whenever we build our release wheel (but not during a user’s package installation). In the next section, we’ll go over what it needs to do and how it works. 

This build script is going to use Cython to compile our code, so we’ll need to make sure that it’s added to our Poetry development dependencies. This ensures that only developers of our package will need Cython, and not the end-users of the package. We can either run poetry add --dev Cython, or manually add the desired version to the tool.poetry.dev-dependencies section and then run poetry update.

[tool.poetry.dev-dependencies]
Cython = "^0.29.21"  # Latest version at time of publishing

Lastly, as the whole point of this procedure is to prevent the distribution of our source, we need to make sure the built package does not include any Python code. Poetry automatically adds all files in the project’s source directory to the built package, including the very Python files that we’re trying to keep private! We need to explicitly configure it to not add these files. To do so we’ll add an exclude key to our tool.poetry section. This uses Path.glob matching, so make sure to read the documentation on those for clarification.

Additionally, if you have ignored .so files (or Mac/Windows binary equivalents) in your project’s .gitignore file, you’ll need to manually include them in your pyproject.toml, as Poetry will automatically not package files ignored through the project’s VCS. 

[tool.poetry]
# ...
exclude = ["SRC/**/*.py"]  # replace SRC with the root of your source
include = ["SRC/**/*.so"]  # And/or Windows/Mac equivalents

The build.py Build Script

Now that we’re done with Poetry configuration, we’ll get to the meat and potatoes of the build process, our build.py script. This script is actually fairly straightforward and is responsible for only a couple small tasks:

  • Collecting all the Python files
  • Cythonizing the Python files into binary blobs
  • Copying the binaries back to our source tree for Poetry to later collect

First, we’ll import the modules we’ll use and set up some constants for later use. Make sure to change the value of SOURCE_DIR to match the structure of your project.

import multiprocessing
from pathlib import Path
from typing import List

from setuptools import Extension, Distribution

from Cython.Build import cythonize
from Cython.Distutils.build_ext import new_build_ext as cython_build_ext

SOURCE_DIR = Path("SRC")  # replace SRC with the root of your source
BUILD_DIR = Path("cython_build")

Next, we’ll write a function that collects all of Python files and converts them into Distutils/Setuptools Extension objects. This is a common object type that most Python build scripts use, and Cython is no exception.

def get_extension_modules() -> List[Extension]:
    """Collect all .py files and construct Setuptools Extensions"""
    extension_modules: List[Extension] = []

    for py_file in SOURCE_DIR.rglob("*.py"):

        # Get path (not just name) without .py extension
        module_path = py_file.with_suffix("")

        # Convert path to module name
        module_path = str(module_path).replace("/", ".")

        extension_module = Extension(
            name=module_path,
            sources=[str(py_file)]
        )

        extension_modules.append(extension_module)

    return extension_modules

Next, we’ll create a function that takes in these Extension objects, and performs the Cython compilation step (i.e. the Cythonization). We’ll make use of Cython’s cythonize function and configure it with some arguments.

def cythonize_helper(extension_modules: List[Extension]) -> List[Extension]:
    """Cythonize all Python extensions"""

    return cythonize(
        module_list=extension_modules,

        # Don't build in source tree (this leaves behind .c files)
        build_dir=BUILD_DIR,

        # Don't generate an .html output file. Would contain source.
        annotate=False,

        # Parallelize our build
        nthreads=multiprocessing.cpu_count() * 2,

        # Tell Cython we're using Python 3. Becomes default in Cython 3
        compiler_directives={"language_level": "3"},

        # (Optional) Always rebuild, even if files untouched
        force=True,
    )

Finally, we’ll add the code that puts it all together. We’ll make use of the Setuptools Distribution object to handle the orchestration of the build. This is very similar to how a standard setup.py file that builds Cython code executes.

# Collect and cythonize all files
extension_modules = cythonize_helper(get_extension_modules())


# Use Setuptools to collect files
distribution = Distribution({
    "ext_modules": extension_modules,
    "cmdclass": {
        "build_ext": cython_build_ext,
    },
})

# Grab the build_ext command and copy all files back to source dir.
# Done so Poetry grabs the files during the next step in its build.
distribution.run_command("build_ext")
build_ext_cmd = distribution.get_command_obj("build_ext")
build_ext_cmd.copy_extensions_to_source()

Building the wheel

We’re done! Simply run poetry build --format wheel to create a .whl for your package. If you open it up with a .zip extractor, you should find binary versions of all your Python files inside.

Make sure you do not distribute an sdist package as it will contain uncompiled source code. We specify --format wheel so that we only build the wheel and there is no risk of publishing the package to PyPI. You can also make the build more verbose (it will tell you what files are being built and which are being added to the .whl) using -vvv.

Publishing the wheel to PyPI

Now that we have the wheel file, we can publish it to PyPI. This is a common task through Poetry and requires nothing special for our workflow here. There are a number of available tutorials online for doing so, such as this one. All will make use of the poetry publish command.

Final Notes

One downside of our wheel files is that you’ll need to generate one for each version of Python and each platform (OS + architecture) you plan to support. Because they use pre-compiled code, they aren’t portable like Python source is. If you take a peek at other projects on PyPI, such as TensorFlow, you’ll find that this is fairly common. At Aotu, we use CI/CD build matrices to easily automate the generation of the wheels for each of our target platforms. 

Finally, if you encounter any confusion or trouble with this tutorial, we’ve set up an example repository on our GitHub here. Feel free to browse around, or even open an issue if you have trouble. We’ll be happy to help. While you’re there, feel free to check out our other public repositories, and stay tuned for the upcoming BrainFrame Client source release.

— Bryce Beagle (Github)

© 2021 Aotu 版权所有