UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

ICASSP 2026

Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang

Yonsei University

Abstract

In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.

Pipeline of UniverSR

Overall illustrations of UniverSR showing (a) training stage, (b) inference stage, (c) vector field estimator architecture, and (d) feature encoder architecture. Specifically, ODE solver consists of feature encoder and vector field estimator.