1) Clacc 2023: OpenACC, C++/Kokkos, and the end of ECP - Joel E. Denny
2) LLVM-IR centric Large Language Models: The Devil is in the IR - Ludger Paehler
3) The ComPile: A Dataset of LLVM-IR for ML in Compilers, and Large-Scale Statistical Introspection of LLVM - Aidan Grossman
4) A Compilation Framework for High-Performance Fast Fourier Transform Code Generation - Yifei He
5) Improving the outer-loop vectorization in LLVM - Etienne Renault, Lou Knauer
Clacc 2023: OpenACC, C++/Kokkos, and the end of ECP - Joel E. Denny
Clacc has developed OpenACC compiler, runtime, and profiling support for C and C++ by extending Clang and LLVM under the Exascale Computing Project (ECP). OpenACC support in Clang and LLVM can facilitate the programming of GPUs and other accelerators in HPC applications running on heterogeneous computing architectures. Clacc is hosted publicly on github (https://github.com/llvm-doe-org/llvm-project/wiki). The purpose of this poster is to present the latest developments in the Clacc project as well as future plans, especially considering the end of ECP later this year. We will cover topics such as the following: recent Clacc support for OpenACC in C++, support for Kokkos’s new OpenACC backend, and a general summary of Clacc's current OpenACC feature support. We will also invite the community to give feedback on their interest in seeing OpenACC support in Clang going forward.
LLVM-IR centric Large Language Models: The Devil is in the IR - Ludger Paehler
We present a study exploring the importance of IR-centric tokenization schemes for large language models for IR. Taking advantage of ComPile, a novel dataset of IR for machine learning and statistical analysis, we explore various tokenization schemes ranging from treating IR as text, to baking the entirety of our knowledge about LLVM IR into a custom tokenization schemes, and explore the performance of these tokenization schemes on small- to medium-sized Transformer models. To evaluate the performance of our models, and understand the tokenization’s importance to downstream model performance, we focus on a set of compiler-relevant machine learning tasks including, but not limited to the prediction of code size on x86. In addition, we will outline future applications, and necessary work to perform to enable a larger array of compiler-relevant downstream tasks, focussing particularly on potential future uses for performance analysis.
The ComPile: A Dataset of LLVM-IR for ML in Compilers, and Large-Scale Statistical Introspection of LLVM - Aiden Grossman
We present a dataset named ComPile, a massive 2.4TB dataset of LLVM-IR constructed from production code across multiple package ecosystems intended for the training of compiler-related ML models, but also of high interest to non-ML related compiler work beyond the limited purview of machine learning. We outline the methodology used to construct ComPile, including our process for gathering packages from packaging ecosystems, how we leverage the LLVM ecosystem to build these packages, and efforts we take in deduplicating the dataset. We furthermore present a previously impossible prototypical statistical analysis on several IR features, including comparisons between optimization pipelines and language frontends showing the usefulness of the dataset for cross-language statistical analysis of the usage of LLVM.
A Compilation Framework for High-Performance Fast Fourier Transform Code Generation - Yifei He Fast Fourier
Transform (FFT) libraries are one of the most critical HPC software components. Current popular FFT libraries, such as FFTW, were introduced decades ago, making it hard to benefit from the modern compiler infrastructures, such as LLVM and Multi-Level Intermediate Representation (MLIR), and do not support emerging hardware architectures, such as GPUs. We present the design and development of FFTc, a compilation framework based on MLIR and LLVM to generate FFT libraries for CPUs and GPUs seamlessly. FFTc consists of a Domain-Specific Language for defining FFT algorithms, abstractions and MLIR and LLVM transformations. We leverage MLIR dialects to optimize FFTc with improved data layout for complex-value arrays, vectorization, and sparsification and porting FFTc to GPU systems. We show that, on CPUs, the performance of the FFTc-generated FFT is comparable to FFTW performance, and we also present the initial performance results for FFTc on GPUs.
Improving the outer-loop vectorization in LLVM - Lou Knauer, Etienne Renault
Nowadays, LLVMs vectorization path (VPlan infrastructure) produces highly optimized code but is incapable of outer-loop vectorization. Outer-loop vectorization is only supported in LLVM through the VPlan-native path requiering the -enable-vplan-native-path option to be explicitely set.
The VPlan-native path is a alternative vectorization code-path that is purely pragma/metadata driven, has no memory-dependency checks, no cost-model. Consequently it generates poor-quality code because it vectorized everything (even instructions that compute uniform values) and uses gathers/scatters for every memory access.
This poster presents transformations in the VPlan-native in order to enchanced code quality by two factors:
- Produce consecutive or uniform memory access in order to avoid gather/scatter
- Compute the induction variables of both, the outer loop that is vectorized along and the inner loop, with scalar instructions