triNNity is a header-only C++17 template library with over 80 DNN convolution algorithms. It’s a collaborative effort with several other people in our research group to collect as many DNN convolution algorithms as possible in one place, and give them clean, simple, and performant implementations. It is also a testbed for algorithm design for DNN convolution.
The library implements normal dense convolution (both direct and GEMM-based), strided convolution, dilated convolution, group convolution, sparse convolution, Winograd convolution, FFT convolution, and more, including super high performance specialized algorithms for cases like 1x1 convolution.
Many libraries and frameworks present algorithms like
fft, and others, as monolithic operations, but there are in fact dozens of algorithmic variants of these approaches, all of which are better suited to some kinds of convolutions than others. Our paper in ASAP 2017 details many of these algorithms.
Under the hood, the library uses BLAS, OpenMP multithreading, SIMD vectorization, and more, without any programmer intervention required. It can also run completely standalone, without any, or with only a subset, of these components enabled. We currently support
aarch64, but support for more platforms is planned. Since the library is released as header-only C++, all that’s really required to bring up a new platform is a working compiler supporting the C++17 standard.
We have working, well-tested integration with the Intel MKL, OpenBLAS, ARM Compute Library, FFTW, and libxsmm, among others, as back-end libraries providing specific functionality (such as optimized GEMM routines).
The library is released under the BSD3 license, and is accompanied by an extensive performance benchmark suite.
triNNity DNN compiler and optimizer
We’ve developed a sophisticated ahead-of-time optimization framework for DNNs, based on the PBQP formulation, which uses profiled layer timings from performance benchmarking to build a cost model which can statically choose from among the 70+ convolution algorithms in the primitive library to produce a provably-optimal instantiation of a full CNN.
Our paper on the DNN optimizer appeared at CGO 2018.
We’ve run some performance comparisons with Intel’s native