TorchFuser: A Plug-and-Play MLIR-Based Compiler and Optimized Runtime Integration

Itssshikhar · May 5, 2025, 8:57am

Overview

TorchFuser is (going to be) a compiler and runtime framework designed to enhance PyTorch training performance by fusing common transformer operations. Drawing inspiration from projects like MLC-LLM, which leverages TVM for optimization, TorchFuser aims to integrate seamlessly into PyTorch workflows, utilizing MLIR-based compilation strategies.

Motivation

Training large-scale models, such as LLMs and transformers, is computationally intensive and highly inefficient. While solutions like FlashAttention have optimized specific components, there’s a need for a more generalized approach that:

Integrates seamlessly with PyTorch, possibly through a @torchfuser.jit decorator.
Leverages MLIR for backend optimizations, similar to Triton’s approach.
Supports diverse hardware, including NVIDIA GPUs, AMD GPUs, and Apple’s MLX.
Achieves tangible performance gains, targeting at least a 10% improvement over standard PyTorch operations.

Technical Approach

MLIR Integration:
- Utilize torch-mlir as a foundation, benefiting from its alignment with PyTorch’s ecosystem.
- Draw insights from Triton’s MLIR-based backend, which compiles Python-decorated functions into optimized GPU kernels.
Decorator-Based API:
- Introduce a @torchfuser.jit decorator, allowing users to annotate functions for optimization, akin to Numba’s approach.
Hardware Abstraction:
- Design the compiler to generate optimized code for various hardware backends, ensuring broad compatibility.
Performance Optimization:
- Implement techniques inspired by FlashAttention-3, such as overlapping computation and data movement, and leveraging low-precision computations (e.g., FP8).

Roadmap

Phase 1: Research & Design
- Analyze existing solutions like MLC-LLM and Triton to inform design decisions.
- Define the MLIR dialects and transformations required for TorchFuser.
Phase 2: Prototype Development
- Develop initial prototypes focusing on fusing transformer components like attention mechanisms and MLPs.
- Benchmark performance against standard PyTorch implementations.
Phase 3: Community Feedback & Iteration
- Share prototypes on here itself to gather feedback.
- Iterate on design and implementation based on the insights.
Phase 4: Production Readiness
- Finalize the API and backend implementations.
- Provide comprehensive documentation and tutorials to facilitate adoption.

Call to Action

We’re seeking feedback on:

The feasibility and design of integrating MLIR-based optimizations into PyTorch.
Strategies for broad hardware support, including AMD GPUs and Apple’s MLX.
Potential challenges in achieving the targeted performance improvements.

Your expertise and insights are very much appreciated in developing this project.

ayushnangia · May 6, 2025, 11:10am

why build a new compiler? why not just improve the TorchInductor?

qihqi · May 7, 2025, 4:26am

Would you post the github link for this project? Thanks!

tangzhiyi11 · May 8, 2025, 2:14am

I think we can register TorchFuser as a new backend in Torch Dynamo. This way, we can reuse torch.compile. For users, a single torch.compile can solve all the problems. All they need to do is try different backends to determine which one offers the best performance.

Topic		Replies	Views
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	63676	July 29, 2024
TorchDynamo Update 3: GPU Inference Edition compiler	12	6552	February 2, 2023
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1939	September 22, 2023
TorchInductor Update 7: key optimizations with CPU backend in PyTorch 2.2 release compiler	4	898	March 8, 2024
PyTorch Sparse(GNN) Compiler RFC compiler	28	2216	May 21, 2024