Time Compute in Linear Diffusion TransDG游戏former_欧博ABG官网-欧博官方网址-会员登入

TL;DR: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

Abstract: This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.

Lay Summary: Creating high-quality images from text descriptions typically requires massive computing power, putting advanced AI image generation out of reach for many researchers and developers. We developed SANA-1.5, a smarter AI system that achieves top-tier image generation while being dramatically more efficient. Our key breakthroughs include: 1. A "growing" training method that builds larger models using 60% less computing power 2. A compression technique that shrinks models without losing quality 3. A clever sampling trick that lets smaller models temporarily boost their capabilities These innovations allow SANA-1.5 to match or exceed the performance of systems like Stable Diffusion XL while being more accessible. On standard tests, it achieves record-breaking accuracy in matching images to text descriptions (80% alignment score when using our sampling boost). By making advanced image generation more efficient, SANA-1.5 helps democratize AI creativity - enabling more researchers to experiment with the technology and developers to integrate it into applications without needing expensive hardware.

Link To Code: https://github.com/NVlabs/Sana

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Diffusion Model, Inference Scaling, Linear Attention

Submission Number: 1929

(责任编辑：)

搜索

热门标签:

Time Compute in Linear Diffusion TransDG游戏former