Scaling Down Text Encoders of Text-to-Image Diffusion Models

1Jingdong Explore Academy, 2Georgia Institute of Technology

The scaling down pattern

We distilled T5-XXL into a series of smaller T5 models and evaluated their performance in guiding image synthesis across three key dimensions: image quality, semantic understanding, and text-rendering. We treat T5-XXL’s performance as baseline. Our findings indicate that while image quality and semantic understanding remain largely intact, text-rendering is more sensitive to reductions in model size. Our T5-Base achieves 50x memory compression. On consumer-grade GPUs, our T5-Base achieves 2.7x latency speedup compared to T5-XXL with CPU offload.

Visualization of the scaling down pattern.

Interpolation end reference image.

Abstract

Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in repfresentational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit T5-XXL's capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

Compatibility with auxiliary modules

The distilled T5-Base is compatible with ControlNet, which controls image generation with additional conditions. The checkpoint can be found here.

In addition, we assess its compatibility with LoRA, a method that efficiently updates the weights of the base diffusion model. We use prithivMLmods/Canopus-LoRA-Flux-Anime.

We also examine whether T5-Base is compatible with a step-distilled model. We use FLUX.1-schnell.

More comparison between our T5-Base and T5-XXL



Portrait of a stylish young woman wearing a futuristic golden bodysuit that creates a metallic, mirror-like effect. She is wearing large, reflective blue-tinted aviator sunglasses. Over her head, she wears headphones with metallic accents.


T5-Base

T5-Base

T5-XXL

T5-XXL

Haunting dark fantasy illustration of an ancient, twisted statue standing atop a steep cliff, overseeing a decaying metropolis shrouded in mist. The sky churns with ominous clouds and flashes of lightning.


T5-Base

T5-Base

T5-XXL

T5-XXL

An airplane in front of a clock.


T5-Base

T5-Base

T5-XXL

T5-XXL

Four people gathered for a picnic.


T5-Base

T5-Base

T5-XXL

T5-XXL

A robot displaying the word 'efficiency'.


T5-Base

T5-Base

T5-XXL

T5-XXL

A tree displaying the word 'i'm lost'.


T5-Base

T5-Base

T5-XXL

T5-XXL

BibTeX

@article{park2021nerfies,
  author    = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
  title     = {Nerfies: Deformable Neural Radiance Fields},
  journal   = {ICCV},
  year      = {2021},
}