WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Etimation of Automatic Speech Recognition Error

Published in ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026

Recommended citation: Harvey Donnelly, Ken Shi and Gerald Penn, "WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error," ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026, pp. 15022-15026, doi: 10.1109/ICASSP55912.2026.11462338. https://ieeexplore.ieee.org/abstract/document/11462338

The predominant method for scoring the quality of automatic speech recognition (ASR) transcripts when ground-truth labels are not available is to predict the word error rate (WER) from the corresponding audio segment. We propose WAV2LEV, a novel paradigm for WER estimation which predicts the underlying sequences of Levenshtein edit operations (substitutions, deletions, insertions and matches) from which the WER can be computed. This approach offers more fine-grained token-level error estimation in comparison to previous work without compromising on performance for WER estimation. To support this investigation, we present Mini-CNoiSY (Miniature Clean-Noisy Speech from YouTube), a bespoke 354-hour noisy speech corpus which ensures confidence in ground-truth labeling and captures a diverse range of noise artifacts which degrade ASR performance. Our results show that WAV2LEV achieves near state-of-the-art performance for the task of WER estimation with a root mean square error (RMSE) of 0.1488 and a Pearson correlation coefficient (PCC) of 89.71%, while generating predictions of ASR error that are more informative and fine-grained than that of direct WER estimators.

Download paper here

Recommended citation: Harvey Donnelly, Ken Shi and Gerald Penn, WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error, ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026, pp. 15022-15026, doi: 10.1109/ICASSP55912.2026.11462338.