Most of the mainstream machine learning (ML) frameworks (e.g., Tensorflow, PyTorch, MXNet),
who use NVIDIA GPUs to accelerate performance, eventually call into well-optimized
CUDA kernels in libraries such as cuBLAS and cuDNN. One way to view these frameworks
is that they are wrappers that provide APIs in high-level programming language
(e.g., Python) around operator implementations in low-level CUDA code (e.g., matrix
multiplication in cuBLAS, dropout in cuDNN). Therefore, operators that are available
in these frameworks are completely determined by the software libraries, and programmers
are restricted to use those operators when implementing their ML models.

However, this is not the most efficient way, and one example is operator fusion,
i.e. instead of implementing a new operator at the Python level using multiple
existing operators, fuse them first at the CUDA kernel level and bubble the resulting
kernel up to the Python level as a new operator. Previous work by Jeremy Appleyard
et al. has shown that performance can be improved by using operator fusion due to
better cache locality and reduced overhead of launching GPU threads.
More broadly, compiler optimization techniques can be applied to generate kernel
dynamically targeted at a given ML model for performance improvement.

This presentation will cover the recent trend in ML compilers, including Halide,
TVM, Tensor Comprehensions, and Tensorflow XLA. We begin by providing motivating
examples why ML compilers can give us performance benefits compared with legacy
frameworks. We then discuss about each state-of-the-art compiler in details,
starting from Halide which first came out in 2013 as a language and compiler
that targets image processing pipelines. Finally, we wrap up our presentation
by providing possible future directions.