Generating complete songs from lyrics is a challenging task due to the complexity of coordinating vocals, accompaniment, and musical style. YuE tackles this by scaling dual-track music generation using a two-stage language model architecture, allowing zero-shot style transfer through audio in-context learning. This approach sets YuE apart as a foundation model alternative to commercial systems like Suno.ai, all under a permissive Apache 2.0 license.
What YuE does and how it works
YuE is an open-source foundation model series designed to generate full songs — vocals plus accompaniment — directly from lyrics. It positions itself as an open alternative to proprietary models such as Suno.ai by leveraging a dual-stage architecture that produces music in segments (like verse and chorus) with coherent flow.
Under the hood, YuE employs a 7-billion parameter language model to generate audio tokens in a chain-of-thought manner across multiple song segments. This large language model (LM) handles the broad structure and content of the song. Then, a smaller 1-billion parameter LM refines the output, improving audio quality and coherence.
A key feature is YuE’s dual-track in-context learning (ICL). By providing a reference song, the model can clone vocal styles or musical genres, enabling zero-shot style transfer without additional training. This makes the model flexible for creative applications where users want to match specific voices or genres.
The codebase is Python-based, with dependencies managed through a requirements.txt file fetched at runtime. It integrates LoRA finetuning support, which is a parameter-efficient fine-tuning technique suited for adapting large models to specific styles or datasets without retraining from scratch.
Technical strengths and design tradeoffs
YuE’s two-stage LM architecture is its most distinguishing technical trait. Using a large 7B parameter LM for initial token generation allows capturing long-range dependencies and complex musical phrasing. The subsequent 1B parameter LM refines the output, addressing quality bottlenecks that often occur in generative audio models.
This design balances model scale and inference cost. While the 7B LM is resource-intensive, offloading refinement to a smaller model is a pragmatic way to improve output without doubling inference time or VRAM requirements.
Another strong point is the dual-track audio in-context learning, which lets the model adapt vocal style and genre dynamically based on example prompts. This approach avoids heavier fine-tuning cycles and makes style transfer more accessible.
However, the tradeoff includes significant generation latency. According to the README, an H800 GPU takes 150 seconds to generate 30 seconds of audio, whereas an RTX 4090 takes about 360 seconds. This means real-time or low-latency applications are currently out of reach.
The repo addresses VRAM constraints with optimized Gradio UIs that work on 8GB GPUs and community contributions like YuE-exllamav2 for lighter deployments. Still, running the full 7B model comfortably requires high-end hardware.
The code quality is pragmatic with clear separation between the large and small LM stages and modular support for LoRA finetuning. Model files are managed via git-lfs due to their size. The Python stack aligns with typical ML research tooling, making it accessible for practitioners familiar with PyTorch and Hugging Face ecosystems.
Installation and quickstart
The repository provides platform-specific quickstart instructions:
Windows users
- For a one-click installer, use Pinokio.
- To use Gradio with Docker, see: YuE-for-Windows
Linux/WSL users
- For a quick start, watch this video tutorial by Fahd: Watch here.
- If you’re new to machine learning or the command line, this video is highly recommended.
To use a GUI/Gradio interface, check out:
- YuE-exllamav2-UI
- YuEGP
- YuE-Interface
1. Install environment and dependencies
Make sure to properly install flash attention 2 to reduce VRAM usage.
# install cuda >= 11.8
conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install -r <(curl -sSL https://raw.githubusercontent.com/multimodal-art-projection/YuE/main/requirements.txt)
# Make sure you have git-lfs installed (https://git-lfs.com)
These steps establish the Python environment with required ML libraries and manage large model files via git-lfs.
verdict
YuE is a solid open-source foundation model for researchers and developers interested in lyrics-to-song generation with flexible style transfer. Its two-stage architecture is a practical tradeoff balancing model size and output quality.
However, hardware requirements and generation speed limit its use to experimental and non-real-time scenarios unless you have access to high-end GPUs. The LoRA finetuning support and community-driven lighter interfaces improve accessibility.
If you want to experiment with generative music models beyond black-box commercial APIs, YuE offers a transparent and extensible starting point. It’s particularly relevant for ML practitioners comfortable with Python and model finetuning, and those curious about audio in-context learning techniques.
For real-time or production deployment, expect to face challenges in latency and resource consumption. But for research, prototyping, and creative exploration, YuE’s architecture and open license make it worth exploring.
Related Articles
- ChatTTS: conversational text-to-speech with prosodic control and responsible AI tradeoffs — ChatTTS is an open-source conversational text-to-speech model trained on 100,000+ hours of bilingual audio. It offers fi
- Voice Clone Studio: unified modular web UI for multi-engine voice cloning and TTS — Voice Clone Studio unifies multiple voice AI engines in a modular Gradio web UI. Supports voice cloning, multi-speaker d
- LyricsX: extending LRC lyrics with word-level timing and multi-language support on macOS — LyricsX is a native macOS app that fetches and displays synchronized lyrics using LRCX, an extended LRC format with word
- LiteRT-LM: Google’s C++ library for efficient edge language model inference — LiteRT-LM is a Google AI Edge C++ library for performant language model inference on edge devices with multi-language AP
- Omni-Diffusion: unified any-to-any multimodal generation with masked discrete diffusion — Omni-Diffusion models text, image, and speech tokens jointly via masked discrete diffusion, enabling any-to-any multimod
→ GitHub Repo: multimodal-art-projection/YuE ⭐ 6,233 · Python