UniverSR Demo

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

ICASSP 2026
Yonsei University

Abstract

In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.


Pipeline of UniverSR

UniverSR Pipeline

Overall illustrations of UniverSR showing (a) training stage, (b) inference stage, (c) vector field estimator architecture, and (d) feature encoder architecture. Specifically, ODE solver consists of feature encoder and vector field estimator.


Audio Super Resolution in Speech Domain

Ground Truth Ground Truth (Vocoded)
GT spectrogram
Vocoded spectrogram
8 → 48 kHz 12 → 48 kHz 16 → 48 kHz 24 → 48 kHz
Input
AudioSR
FlashSR
UniverSR (Proposed)

Audio Super Resolution in Music Domain

Ground Truth Ground Truth (Vocoded)
GT spectrogram
Vocoded spectrogram
8 → 48 kHz 12 → 48 kHz 16 → 48 kHz 24 → 48 kHz
Input
AudioSR
FlashSR
UniverSR (Proposed)

Audio Super Resolution in Sound Effect Domain

Ground Truth Ground Truth (Vocoded)
GT spectrogram
Vocoded spectrogram
8 → 48 kHz 12 → 48 kHz 16 → 48 kHz 24 → 48 kHz
Input
AudioSR
FlashSR
UniverSR (Proposed)

Comparison with Speech Super Resolution Models

Ground Truth Ground Truth (Vocoded)
GT spectrogram
Vocoded spectrogram
8 → 48 kHz 12 → 48 kHz 16 → 48 kHz 24 → 48 kHz
Input
Fre-Painter
FlowHigh
NU-Wave2
UDM+
UniverSR (Proposed)