1) Extending MLIR Dialects for Deep Learning Compilers - Charitha Saumya, Jianhui Li
2) Unlocking High Performance in Mojo through User-Defined Dialects - Mathieu Fehr, Jeff Niu
3) Speeding up Intel Gaudi deep-learning accelerators using an MLIR-based compiler - Jayaram Bobba
4) Quidditch: An End-to-End Deep Learning Compiler for Occamy using IREE & xDSL - Markus Böck, Sasha Lopoukhine
5) Atomic Reduction Operations - Gonzalo Brito Gadeschi
1) Extending MLIR Dialects for Deep Learning Compilers - Charitha Saumya, Jianhui Li
This talk discusses the design of XeTile, a dialect developed for expressing and compilation of deep learning kernels. XeTile demonstrates that with a few critical extensions, MLIR dialects can be used as building blocks to support deep learning compiler development for high-performant code generation. With the "Tile" data type and a few operations, XeTile dialect greatly simplifies the lowering of dense operations. Any tile-based GEMM-like algorithms can easily be expressed in a few lines of code, including advanced optimizations like cooperative load/prefetch, K-slicing, and software pipelining.
2) Unlocking High Performance in Mojo through User-Defined Dialects - Mathieu Fehr, Jeff Niu
Traditionally, a clear separation exists between language libraries and compiler intermediate representations (IRs): libraries are typically limited to API calls that the compiler cannot reason with, while IRs consist of instructions that only the compiler can analyze and transform. Embedded DSLs typically blur this line, as they often use the host language introspection mechanism, or macro system, to include their own compiler. In this talk, we will present how we merge the concept of libraries and embedded DSLs by providing in Mojo first-class support for extending its MLIR-based compiler.
3) Speeding up Intel Gaudi deep-learning accelerators using an MLIR-based compiler - Jayaram Bobba
Middle-end optimizations play a critical role in generating high-performance code for deep learning accelerators. In this talk, we will present an MLIR-based fusing compiler that generates optimized LLVM IR from high-level graph IR, which is then compiled by an LLVM backend for execution on tensor processing cores in Intel Gaudi deep learning (DL) accelerator. This compiler has been in use for the past three generations of Gaudi products and provides around 54% average performance improvements at a model-level. The talk will cover the lowering pipeline, how we leverage upstream MLIR dialects and some key optimizations and learnings for compiling deep learning workloads to Gaudi.
4) Quidditch: An End-to-End Deep Learning Compiler for Occamy using IREE & xDSL - Markus Böck, Sasha Lopoukhine
We present Quidditch, a neural network compiler and runtime, that provides an end-to-end workflow from a high-level network description to high-performance code running on ETH Occamy, one of the first chiplet-based AI research hardware accelerators. Quidditch builds on IREE, an AI compiler and runtime focused on GPUs, and a micro-kernel compiler for RISC-V-based accelerators in xDSL.
5) Atomic Reduction Operations - Gonzalo Brito Gadeschi
Atomic reductions are atomic read-modify-write operations that do not return a value, enabling them to leverage hardware support in architectures like Arm, X86, and GPUs like PTX. Despite the significant performance improvements they offer, these operations are not currently exposed in LLVM IR. This talk introduces atomic reduction operations, explores their performance benefits, explains why optimizing atomicrmw into atomic reductions is - in general - unsound, and discusses how to provide first-class exposure for them in LLVM IR.
Jianhui Li
Mathieu Fehr
Jeff Niu
Jayaram Bobba
Markus Böck
Sasha Lopoukhine
Gonzalo Brito Gadeschi