Registration has reached capacity. Join the waitlist

SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

Jiacheng Yang (University of Toronto), Jun Wu (Amazon), Yaoyao Ding (NVIDIA/Univerisity of Toronto), Zhiying Xu (Amazon Web Services), Yida Wang (Amazon), Gennady Pekhimenko (NVIDIA/University of Toronto)

System Optimization & Efficiency

SwiftFusion is a sequence parallelism system for distributed diffusion transformer inference that reduces latency by optimizing inter-GPU communication patterns and fusing all-to-all operations with attention computation. It enables high-resolution image and long video generation to scale efficiently across multi-GPU and multi-node configurations.

Presentation

Talk

Paper Session 3: Systems Efficiency

Wednesday, May 27 · 4:20 PM – 4:30 PM

Bayshore Ballroom

Poster

Wednesday, May 27 · 5:15 PM – 6:45 PM

Carmel / Monterey

View day schedule

Abstract

Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present SwiftFusion, a topology-aware efficient DiT serving engine. SwiftFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that SwiftFusion outperforms the state-of-the-art approach by an average of 1.35× (up to 1.77×).

Artifacts & Links

Paper (ACM Digital Library)

                        Authors
                        Jiacheng Yang
University of Toronto
Jun Wu
Amazon
Yaoyao Ding
NVIDIA/Univerisity of Toronto
Zhiying Xu
Amazon Web Services
Yida Wang
Amazon
Gennady Pekhimenko
NVIDIA/University of Toronto