Date & Time
Wednesday, October 29, 2025, 3:15 PM - 4:15 PM
Name
Poster Session
Session Type
Poster
Description


 

Abstract/s

An Automated Semantic Validation/Evaluation Framework for Processing-in-Memory Systems - Akash Madhu
We propose an automated static semantic validation framework for programs targeting processing-in-memory (PIM) systems, specifically the commercial UPMEM PIM architecture. Our framework leverages multiple LLVM tools, including Clang LibTooling, the Clang Static Analyzer, LLVM IR passes, and custom-built ThreadSanitizers tailored for the UPMEM system. This enables novice programmers to write efficient, safe, and domain-specific programs while adhering to the constraints and resource limitations of the UPMEM architecture. We have developed static validation rules and verified their correctness against the PIM system. Our approach can also help extend the aforementioned LLVM tools for PIM-specific optimizations. This work is inspired by similar efforts in GPU programming, which assist developers unfamiliar with hardware-specific performance pitfalls. Processing-in-memory is an emerging technology with significant potential for accelerating memory-bound workloads, and our framework aims to lower the barrier to adoption for programmers.
 

Optimizing IREE to Match llama.cpp: An Introduction to IREE optimization for newbies through a benchmark journey - Uiseop Eom
Deploying efficient machine learning inference engines often involves rigorous benchmark comparisons to achieve optimal performance. In this talk, we present a benchmark analysis comparing the inference performance of IREE against llama.cpp, focusing specifically on executing open-source llama3 model.

Attendees will learn about the performance gaps initially observed between IREE and llama.cpp, and the targeted optimizations we implemented within the iree-compiler to bridge these gaps. The session will introduce common performance bottlenecks faced by new users of iree-compiler and iree-runtime, including typical profiling tips. We will demonstrate practical MLIR optimizations and how to implement them. This talk aims to be especially valuable for newcomers looking to understand and enhance performance when leveraging IREE for model inference tasks.

Porting Linear Algebra Inference Kernels to GPUs using ALPAKA and Unified BLAS: Towards LLVM-Compatible Backend Abstractions - S Akash
I am presenting my work in extending TMVA-SOFIE's machine learning inference pipeline with GPU acceleration using ALPAKA - a C++ abstraction for portable parallel programming. On top of  LPAKA, we build a unified BLAS interface that dynamically dispatches to cuBLAS, rocBLAS, or CPU backends. While LLVM and MLIR provide powerful frontend and intermediate  epresentations, our work complements them by offering a backend runtime layer capable of executing common ML kernels like GEMM portably and efficiently. This approach aims to make the LLVM-based inference stack hardware-agnostic and easier to deploy across HPC platforms

Better multithreading with LLVM - Jameson Nash
While no sane compiler would optimize atomics, we aren't always particularly sane. We'll look at three ways that llvm could do more to interact with threads better: thread static analysis warnings, new code optimizations for atomic update loops, and a work-stealing runtime for the llvm backend.

XeGPU: A High-Performance MLIR Dialect for Intel GPU Programming - Chao Chen, Jianhui Li
We present XeGPU, the official MLIR dialect for programming Intel GPUs. Built on experience from the XeTile prototype introduced last year, XeGPU brings a simplified, layout-guided programming model tailored for tile-based GPU kernel development. By representing tile decomposition through layout annotations instead of multiple explicit tiled loops, XeGPU produces IR that is both more concise and easier to reason about and optimize. In contrast to Triton and CUTE, which employ general-purpose layout algebra, XeGPU introduces a straightforward nested block layout abstraction. This design effectively captures common patterns such as transpose, broadcast, reduction, and matrix-matrix multiply (MMA), enabling concise and performant kernel construction. 

XeGPU allows developers to define high-level workgroup operations using layout-annotated types, which guide hierarchical lowering using upstream MLIR infrastructure. The lowering process includes workgroup-to-subgroup distribution, blocking, and subgroup-to-workitem decomposition, all expressed through MLIR’s transformation pipelines. The dialect lowers to Intel GPU ISA through LLVM-based code generation and enables mapping to hardware features such as shared local memory and matrix instructions.

We evaluate XeGPU on a range of matrix multiplication (GEMM) workloads and compare it against hand-written reference kernels. XeGPU achieves competitive performance while maintaining a compact and composable intermediate representation. This work demonstrates how a domain-specific layout abstraction can simplify GPU programming without compromising performance, and how MLIR can serve as a foundation for building production-grade AI compilers.

Location Name
California Ballroom Salons 1-4