Temmuz 2024
P	S	Ç	P	C	C	P
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

MobileDiffusion: Rapid text-to-image generation on-device

Posted by Yang Zhao, Senior Software Engineer, and Tingbo Hou, Senior Staff Software Engineer, Core ML

Text-to-image diffusion models have shown exceptional capabilities in generating high-quality images from text prompts. However, leading models feature billions of parameters and are consequently expensive to run, requiring powerful desktops or servers (e.g., Stable Diffusion, DALL·E, and Imagen). While recent advancements in inference solutions on Android via MediaPipe and iOS via Core ML have been made in the past year, rapid (sub-second) text-to-image generation on mobile devices has remained out of reach.

To that end, in “MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices”, we introduce a novel approach with the potential for rapid text-to-image generation on-device. MobileDiffusion is an efficient latent diffusion model specifically designed for mobile devices. We also adopt DiffusionGAN to achieve one-step sampling during inference, which fine-tunes a pre-trained diffusion model while leveraging a GAN to model the denoising step. We have tested MobileDiffusion on iOS and Android premium devices, and it can run in half a second to generate a 512×512 high-quality image. Its comparably small model size of just 520M parameters makes it uniquely suited for mobile deployment.

Rapid text-to-image generation on-device.

Background

The relative inefficiency of text-to-image diffusion models arises from two primary challenges. First, the inherent design of diffusion models requires iterative denoising to generate images, necessitating multiple evaluations of the model. Second, the complexity of the network architecture in text-to-image diffusion models involves a substantial number of parameters, regularly reaching into the billions and resulting in computationally expensive evaluations. As a result, despite the potential benefits of deploying generative models on mobile devices, such as enhancing user experience and addressing emerging privacy concerns, it remains relatively unexplored within the current literature.

The optimization of inference efficiency in text-to-image diffusion models has been an active research area. Previous studies predominantly concentrate on addressing the first challenge, seeking to reduce the number of function evaluations (NFEs). Leveraging advanced numerical solvers (e.g., DPM) or distillation techniques (e.g., progressive distillation, consistency distillation), the number of necessary sampling steps have significantly reduced from several hundreds to single digits. Some recent techniques, like DiffusionGAN and Adversarial Diffusion Distillation, even reduce to a single necessary step.

However, on mobile devices, even a small number of evaluation steps can be slow due to the complexity of model architecture. Thus far, the architectural efficiency of text-to-image diffusion models has received comparatively less attention. A handful of earlier works briefly touches upon this matter, involving the removal of redundant neural network blocks (e.g., SnapFusion). However, these efforts lack a comprehensive analysis of each component within the model architecture, thereby falling short of providing a holistic guide for designing highly efficient architectures.

MobileDiffusion

Effectively overcoming the challenges imposed by the limited computational power of mobile devices requires an in-depth and holistic exploration of the model’s architectural efficiency. In pursuit of this objective, our research undertakes a detailed examination of each constituent and computational operation within Stable Diffusion’s UNet architecture. We present a comprehensive guide for crafting highly efficient text-to-image diffusion models culminating in the MobileDiffusion.

The design of MobileDiffusion follows that of latent diffusion models. It contains three components: a text encoder, a diffusion UNet, and an image decoder. For the text encoder, we use CLIP-ViT/L14, which is a small model (125M parameters) suitable for mobile. We then turn our focus to the diffusion UNet and image decoder.

Diffusion UNet

As illustrated in the figure below, diffusion UNets commonly interleave transformer blocks and convolution blocks. We conduct a comprehensive investigation of these two fundamental building blocks. Throughout the study, we control the training pipeline (e.g., data, optimizer) to study the effects of different architectures.

In classic text-to-image diffusion models, a transformer block consists of a self-attention layer (SA) for modeling long-range dependencies among visual features, a cross-attention layer (CA) to capture interactions between text conditioning and visual features, and a feed-forward layer (FF) to post-process the output of attention layers. These transformer blocks hold a pivotal role in text-to-image diffusion models, serving as the primary components responsible for text comprehension. However, they also pose a significant efficiency challenge, given the computational expense of the attention operation, which is quadratic to the sequence length. We follow the idea of UViT architecture, which places more transformer blocks at the bottleneck of the UNet. This design choice is motivated by the fact that the attention computation is less resource-intensive at the bottleneck due to its lower dimensionality.

Our UNet architecture incorporates more transformers in the middle, and skips self-attention (SA) layers at higher resolutions.

Convolution blocks, in particular ResNet blocks, are deployed at each level of the UNet. While these blocks are instrumental for feature extraction and information flow, the associated computational costs, especially at high-resolution levels, can be substantial. One proven approach in this context is separable convolution. We observed that replacing regular convolution layers with lightweight separable convolution layers in the deeper segments of the UNet yields similar performance.

In the figure below, we compare the UNets of several diffusion models. Our MobileDiffusion exhibits superior efficiency in terms of FLOPs (floating-point operations) and number of parameters.

Comparison of some diffusion UNets.

Image decoder

In addition to the UNet, we also optimized the image decoder. We trained a variational autoencoder (VAE) to encode an RGB image to an 8-channel latent variable, with 8× smaller spatial size of the image. A latent variable can be decoded to an image and gets 8× larger in size. To further enhance efficiency, we design a lightweight decoder architecture by pruning the original’s width and depth. The resulting lightweight decoder leads to a significant performance boost, with nearly 50% latency improvement and better quality. For more details, please refer to our paper.

VAE reconstruction. Our VAE decoders have better visual quality than SD (Stable Diffusion).

Decoder	#Params (M)	PSNR↑	SSIM↑	LPIPS↓
SD	49.5	26.7	0.76	0.037
Ours	39.3	30.0	0.83	0.032
Ours-Lite	9.8	30.2	0.84	0.032

One-step sampling

In addition to optimizing the model architecture, we adopt a DiffusionGAN hybrid to achieve one-step sampling. Training DiffusionGAN hybrid models for text-to-image generation encounters several intricacies. Notably, the discriminator, a classifier distinguishing real data and generated data, must make judgments based on both texture and semantics. Moreover, the cost of training text-to-image models can be extremely high, particularly in the case of GAN-based models, where the discriminator introduces additional parameters. Purely GAN-based text-to-image models (e.g., StyleGAN-T, GigaGAN) confront similar complexities, resulting in highly intricate and expensive training.

To overcome these challenges, we use a pre-trained diffusion UNet to initialize the generator and discriminator. This design enables seamless initialization with the pre-trained diffusion model. We postulate that the internal features within the diffusion model contain rich information of the intricate interplay between textual and visual data. This initialization strategy significantly streamlines the training.

The figure below illustrates the training procedure. After initialization, a noisy image is sent to the generator for one-step diffusion. The result is evaluated against ground truth with a reconstruction loss, similar to diffusion model training. We then add noise to the output and send it to the discriminator, whose result is evaluated with a GAN loss, effectively adopting the GAN to model a denoising step. By using pre-trained weights to initialize the generator and the discriminator, the training becomes a fine-tuning process, which converges in less than 10K iterations.

Illustration of DiffusionGAN fine-tuning.

Results

Below we show example images generated by our MobileDiffusion with DiffusionGAN one-step sampling. With such a compact model (520M parameters in total), MobileDiffusion can generate high-quality diverse images for various domains.

Images generated by our MobileDiffusion

We measured the performance of our MobileDiffusion on both iOS and Android devices, using different runtime optimizers. The latency numbers are reported below. We see that MobileDiffusion is very efficient and can run within half a second to generate a 512×512 image. This lightning speed potentially enables many interesting use cases on mobile devices.

Latency measurements (s) on mobile devices.

Conclusion

With superior efficiency in terms of latency and size, MobileDiffusion has the potential to be a very friendly option for mobile deployments given its capability to enable a rapid image generation experience while typing text prompts. And we will ensure any application of this technology will be in-line with Google’s responsible AI practices.

Acknowledgments

We like to thank our collaborators and contributors that helped bring MobileDiffusion to on-device: Zhisheng Xiao, Yanwu Xu, Jiuqiang Tang, Haolin Jia, Lutz Justen, Daniel Fenner, Ronald Wotzlaw, Jianing Wei, Raman Sarokin, Juhyun Lee, Andrei Kulik, Chuo-Ling Chang, and Matthias Grundmann.

source

Spread the love <3

Tags: ai:artificial generation intelligence mobilediffusion:on-device rapid text-to-image

hacialikara

Merhaba, ben Ali Kara. Özel Kocaeli Güneş Okullarında eğitim veren 6 senelik bir İngilizce öğretmeniyim. Hala okulumuzda İngilizce öğretimine devam ediyorum. Öğrencilerim ile hem çevrimiçi (zoom üzerinden) hem de yüzyüze biraraya geliyorum. Güneş okulları olarak öğrencilerimizin yabancı dil öğrenimine önem veriyoruz. Bu sebeple bir çok etkinlik ve proje ile dil öğrenimini destekliyoruz. Öğretmenlik dışında 2014 yılında hobi olarak kod yazmaya başladım. Başta eğitim içerikleri olmak üzere birçok web projelerinde görev aldım. Etüt sistemi, Öğretmen platformu, Randevu sistemi, Online ödeme sistemi, alışveriş otomasyonları gibi çeşitli projeler geliştirdim. Yapay zeka desteği ile halihazırda farklı içerikler oluşturuyorum.

Little Sparrow dedi ki:

17 Mart 2024, 22:48

I have been exploring options to reduce my phone space usage. ======== MobileDiffusion: Rapid text-to-image generation on-device ====== has got to be the best. I will recommend this to all my friends

Yanıtla
Techy Owl dedi ki:

17 Mart 2024, 22:48

Xmas discounts are on and so as the unending responsibilities. I don’t have time for this! I need something fast and productive. ======== MobileDiffusion: Rapid text-to-image generation on-device ====== might just be a life saver

Yanıtla
Curious Squirrel dedi ki:

17 Mart 2024, 22:48

I think ======== MobileDiffusion: Rapid text-to-image generation on-device ====== may have a copyright issues. The images it generates might not be free to use as the company claims. This could get people into trouble. The company should make more research on this

Yanıtla
Happy Hedgehog dedi ki:

17 Mart 2024, 22:48

Xmas is here. For those of us that have very little time, ======== MobileDiffusion: Rapid text-to-image generation on-device ====== is the deal. You can generate images on your own without no skills required. It is a must have for everyone this period

Yanıtla
Thoughtful Cat dedi ki:

17 Mart 2024, 22:48

I think ======== MobileDiffusion: Rapid text-to-image generation on-device ====== is a great option for those who are looking for a fast and easy way to generate images. However, it is important to be aware of the potential limitations of the app, such as the fact that it can only generate images that are similar to the ones that it has been trained on. Overall, I think ======== MobileDiffusion: Rapid text-to-image generation on-device ====== is a valuable tool for anyone who is looking to generate images quickly and easily

Yanıtla
Grumpy Elephant dedi ki:

17 Mart 2024, 22:48

What is the point of ======== MobileDiffusion: Rapid text-to-image generation on-device ====== when we have access to many free stock image websites with no copyright issues. I see this as a waste of time and space

Yanıtla
Wise Owl dedi ki:

17 Mart 2024, 22:48

I think ======== MobileDiffusion: Rapid text-to-image generation on-device ====== is a great way to generate images for a variety of purposes. However, it is important to remember that the images generated by the app are not always perfect. There may be some artifacts or errors in the images, especially if the images are complex. Overall, I think ======== MobileDiffusion: Rapid text-to-image generation on-device ====== is a valuable tool for anyone who is looking to generate images quickly and easily

Yanıtla
Curious Monkey dedi ki:

17 Mart 2024, 22:48

I wonder if ======== MobileDiffusion: Rapid text-to-image generation on-device ====== can be used to generate images for commercial purposes. I am also curious about the copyright of the images generated by the app. I would like to know more about the terms of use of the app

Yanıtla
Fast Fox dedi ki:

17 Mart 2024, 22:48

I have been waiting for something like this for ages. ======== MobileDiffusion: Rapid text-to-image generation on-device ====== thank you for coming through. It is very helpful for my online business. I am sure it will be for others too

Yanıtla
Cautious Deer dedi ki:

17 Mart 2024, 22:48

I have to confess I have mixed reactions about this. ======== MobileDiffusion: Rapid text-to-image generation on-device ====== may infringe on the rights of certain image owners that it borrows content from for the generation of new images. The blurred line between what is considered inspiration and copyright infringement worries me

Yanıtla
Playful Dolphin dedi ki:

17 Mart 2024, 22:48

I have seen the future and it is ======== MobileDiffusion: Rapid text-to-image generation on-device ======. For everyone that loves pictures like I do, this is a dream come true. I can’t wait to explore all that this app has to offer

Yanıtla

Artificial Intelligence (AI) 17 Mart 2024

MobileDiffusion: Rapid text-to-image generation on-device

Background

MobileDiffusion

Diffusion UNet

Image decoder

One-step sampling

Results

Conclusion

Acknowledgments

11 Responses

Bir yanıt yazın Yanıtı iptal et

You may also like...

Prison sentence sought for football chief who kissed female player without consent

Can the Académie française stop the rise of Anglicisms in French? – Oxford University Press

Boxing great Roberto Duran receiving medical care for heart problem

Sanayi ve Teknoloji Bakanlığı açıkladı: İşte 15 yeni Turcorn adayı!