Most of the mainstream machine learning (ML) frameworks (e.g., Tensorflow, PyTorch, MXNet), who use NVIDIA GPUs to accelerate performance, eventually call into well-optimized CUDA kernels in libraries such as cuBLAS and cuDNN. One way to view these frameworks is that they are wrappers that provide APIs in high-level programming language (e.g., Python) around operator implementations in low-level CUDA code (e.g., matrix multiplication in cuBLAS, dropout in cuDNN). Therefore, operators that are available in these frameworks are completely determined by the software libraries, and programmers are restricted to use those operators when implementing their ML models. However, this is not the most efficient way, and one example is operator fusion, i.e. instead of implementing a new operator at the Python level using multiple existing operators, fuse them first at the CUDA kernel level and bubble the resulting kernel up to the Python level as a new operator. Previous work by Jeremy Appleyard et al. has shown that performance can be improved by using operator fusion due to better cache locality and reduced overhead of launching GPU threads. More broadly, compiler optimization techniques can be applied to generate kernel dynamically targeted at a given ML model for performance improvement. This presentation will cover the recent trend in ML compilers, including Halide, TVM, Tensor Comprehensions, and Tensorflow XLA. We begin by providing motivating examples why ML compilers can give us performance benefits compared with legacy frameworks. We then discuss about each state-of-the-art compiler in details, starting from Halide which first came out in 2013 as a language and compiler that targets image processing pipelines. Finally, we wrap up our presentation by providing possible future directions.