Abstract. In the real world, sounds inherently vary in length, spanning a broad spectrum of durations. Extending audio length beyond what was covered during training, also known as extrapolation, is specifically challenging for audio generative models. We particularly aim to address the challenge of generating variable-length audio in text-to-audio (TTA) diffusion models. The existing TTA diffusion model design notoriously suffers from the problem of generation with such flexibility. Therein, the design of TTA diffusion models do not accommodate to the change of positional information. In this work, we present a TTA diffusion model capable of producing audio of any arbitrary length, requiring no further training and additional conditions beyond the initial text prompts. In particular, we introduce a novel framework based on relative positional embeddings, which is specifically designed to support the flexibility without fundamental changes to the current diffusion pipeline. Our proposed method is \textit{training-free}, thus reducing the training cost for an unseen length. Moreover, our approach enables training with shorter audio durations, enjoying reduced training costs while maintaining performance levels comparable to those achieved with longer durations. Empirically, we demonstrate exceptional performance against the existing state-of-the-arts on audio generation benchmarks with a significantly lower model size compared to the counterparts. In variable audio length generation, our approach consistently outperforms existing methods by a large margin.
High-level overview of FleXounDiT in training and testing, and the DiT block architecture.
We compare the audio generated by FleXounDiT to prior Text-to-Audio works.
                  Input                        | FleXounDiT (ours) | Make-An-Audio2 | Audio-LDM2 | TANGO2 | Stable Audio Open |
---|---|---|---|---|---|
Middle aged man speaks foreign language while water is trickling | |||||
Wind blowing with ducks quacking and birds chirping | |||||
A man speaking followed by a faucet turning on and off while pouring water twice proceeded by water draining down a pipe | |||||
A car engine revs and then shuts off | |||||
Man speaking while insects buzz around | |||||
Man speaks followed by whistling | |||||
An emergency siren ringing with car horn honking | |||||
Birds chirp and sing in the background, while an adult male speaks and crunching footfalls occur | |||||
A sewing machine sews followed by a man talking | |||||
Static and beeping | |||||
A man speaking as a stream of water sprays and splashes | |||||
A man speaks, and a sewing machine sews material | |||||
A dog barks and rustles with some clicking | |||||
A man speaks followed by a vehicle revving | |||||
A duck quacks followed by a man talking |
FleXounDiT can generate audios of variable-length with preserved performance levels.
                  Input                        | Duration | FleXounDiT (ours) | Make-An-Audio2 | Audio-LDM2 | TANGO2 | Stable Audio Open |
---|---|---|---|---|---|---|
Pigeons cooing followed by camera muffling | 5 seconds | |||||
Bells ringing repeatedly | 5 seconds | |||||
Middle aged man speaks foreign language while water is trickling | 5 seconds | |||||
A motor vehicle engine is revving | 5 seconds | |||||
A tractor driving by as a car horn honks while wind blows into a microphone | 5 seconds | |||||
Pigeons cooing followed by camera muffling | 10 seconds | |||||
Bells ringing repeatedly | 10 seconds | |||||
Middle aged man speaks foreign language while water is trickling | 10 seconds | |||||
A motor vehicle engine is revving | 10 seconds | |||||
A tractor driving by as a car horn honks while wind blows into a microphone | 10 seconds | |||||
Pigeons cooing followed by camera muffling | 20 seconds | |||||
Bells ringing repeatedly | 20 seconds | |||||
Middle aged man speaks foreign language while water is trickling | 20 seconds | |||||
A motor vehicle engine is revving | 20 seconds | |||||
A tractor driving by as a car horn honks while wind blows into a microphone | 20 seconds | |||||
Pigeons cooing followed by camera muffling | 30 seconds | |||||
Bells ringing repeatedly | 30 seconds | |||||
Middle aged man speaks foreign language while water is trickling | 30 seconds | |||||
A motor vehicle engine is revving | 30 seconds | |||||
A tractor driving by as a car horn honks while wind blows into a microphone | 30 seconds |