Zero-Shot Text-to-Video Synthesis with
LLM-Driven Dynamic Scene Syntax

1ReLER Lab, University of Technology Sydney, 2CCAI, Zhejiang University

Zero-shot text-to-video generation. We present a new framework for text-to-video generation with exceptional temporal coherence, featuring realistic object movements, transformations, and background motion within the generated videos.


Text-to-video~(T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax~(DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion. Our code will be open sourced at: https://github.com/aniki-ly/FlowZero


Overview of FlowZero: Starting from a video prompt, we first instruct the LLMs (i.e., GPT4) to generate serial frame-by-frame syntax, including scene descriptions, foreground layouts, and background motion patterns. We employ an iterative self-refinement process to improve the generated spatio-temporal layouts. This process includes implementing a feedback loop where the LLM autonomously verifies and rectifies the spatial and temporal errors of the initial layouts. The loop continues until the confidence score \(C\) for the modified layouts exceeds a predefined threshold \(\lambda\). Next, we perform motion-guided noise shifting (MNS) to obtain the initial noise for each frame \(i\) by shifting the first noise with predicted background motion direcction \(d_{i}\) and speed \(s_{i}\). Then, a U-Net with cross-attention, gated attention, and cross-frame attention is used to obtain \(N\) coherent video frames.



"A butterfly leaving a flower" "A horse is running from right to left in an open field" "A caterpillar is crawling on a branch, and then it transforms into a butterfly, then it flies away" "A man and a woman running towards each other, and hugging together"
"sun rises from the sea" "Three birds flying from right to left across the sky" "A volcano first dormant, then it erupts with smoke and fire" "A man is waiting at a bus stop, and after the bus arrives"
"A jogger is running in the field, then a dog joins him" "A panda is climbing the tree from bottom to top" "Ironman is surfing on a surfboard in the sea from left to right" "A soccer player kicks a ball towards another player"
"A bird rests on a tree, then fly away" "A balloon floats up into the sky" "A plane ascends into the sky" "A girl is reading a book in a garden as two butterflies flutter in the side and a cloud moves across the sky"


If you use our work in your research, please cite our publication:

    title={FlowZero:Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax},
    author={Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang},
    journal={arXiv preprint arXiv:2311.15813},