LIP2: Lip-to-Speech Synthesis

Comparing multi-speaker (LRS3) vs single-speaker voice-converted (LRS3-VC-LJS) training with AV-HuBERT encoder + Flow decoder, evaluated with UnivNet vocoder on LRS3 test set (1321 samples).

Method WER ↓ SECS ↑ ESTOI ↑
Literature — LRS3-TED (AV-HuBERT)
RESOUND '25 20.1%0.7770.423
V2SFlow-A '25 28.5%0.851
V2SFlow-V '25 28.5%0.664
IntelL2S '23 27.7%0.7500.396
Ours — LRS3-TED (AV-HuBERT + Flow, GL vocoder)*
LIP2 (LRS3) 28.1%0.6790.383
LIP2 (VC-LJS) 28.1%0.810†0.432†

* Scores computed with Griffin-Lim vocoder for fair comparison with literature. Demo audio uses UnivNet vocoder for better perceptual quality.
† SECS and ESTOI for VC-LJS are computed against voice-converted (single-speaker) reference audio, not original multi-speaker audio. These values are not directly comparable to other rows.

Good Low WER Samples (0% error @ ns=0.1)
"I FORGIVE YOU AND I DO NOT HATE YOU"
8iSpNVj5KMw/00002 · 2.6s
Noise Scale 0.1
Ground Truth
LRS3
WER 0%SECS 0.70ESTOI 0.301
VC-LJS
WER 0%SECS 0.80ESTOI 0.311
Noise Scale 1.0
LRS3
WER 0%SECS 0.66ESTOI 0.255
VC-LJS
WER 0%SECS 0.86ESTOI 0.230
"I THINK WHAT THAT MEANS IS THAT PEOPLE JUST COULDN'T SEE WHAT WAS IN FRONT OF THEM"
rP7nmdDA1Fg/00006 · 4.0s
Noise Scale 0.1
Ground Truth
LRS3
WER 0%SECS 0.63ESTOI 0.321
VC-LJS
WER 0%SECS 0.91ESTOI 0.403
Noise Scale 1.0
LRS3
WER 0%SECS 0.72ESTOI 0.207
VC-LJS
WER 0%SECS 0.89ESTOI 0.367
"PROGRESSIVE MOVEMENTS ARE GROWING AND RESISTING WITH TREMENDOUS COURAGE"
JSSc7hYKstI/00008 · 4.9s
Noise Scale 0.1
Ground Truth
LRS3
WER 0%SECS 0.65ESTOI 0.399
VC-LJS
WER 0%SECS 0.89ESTOI 0.576
Noise Scale 1.0
LRS3
WER 0%SECS 0.75ESTOI 0.362
VC-LJS
WER 0%SECS 0.91ESTOI 0.495
"WE CAN CREATE A DECENTRALIZED DATABASE THAT HAS THE SAME EFFICIENCY OF A MONOPOLY"
RplnSVTzvnU/00003 · 5.7s
Noise Scale 0.1
Ground Truth
LRS3
WER 0%SECS 0.69ESTOI 0.472
VC-LJS
WER 0%SECS 0.91ESTOI 0.518
Noise Scale 1.0
LRS3
WER 7%SECS 0.79ESTOI 0.372
VC-LJS
WER 7%SECS 0.91ESTOI 0.463
Medium Moderate WER Samples (1-3 errors @ ns=0.1)
"BUT IT'S NOT ABOUT FIRE AND BRIMSTONE EITHER"
ROgFmb3oTLo/00005 · 2.7s
Noise Scale 0.1
Ground Truth
LRS3
WER 13%SECS 0.61ESTOI 0.492
VC-LJS
WER 25%SECS 0.88ESTOI 0.482
Noise Scale 1.0
LRS3
WER 38%SECS 0.73ESTOI 0.363
VC-LJS
WER 25%SECS 0.89ESTOI 0.412
"BUT THE SECOND OF THE TRANSFORMATIONS THE CLIMATE TRANSFORMATIONS WE HAVE TO DECIDE TO DO"
Gmai4zkKNcM/00006 · 5.2s
Noise Scale 0.1
Ground Truth
LRS3
WER 20%SECS 0.82ESTOI 0.508
VC-LJS
WER 33%SECS 0.82ESTOI 0.502
Noise Scale 1.0
LRS3
WER 33%SECS 0.77ESTOI 0.422
VC-LJS
"TEXTING HAS A 100 PERCENT OPEN RATE"
LiUClSItcy0/00001 · 3.2s
Noise Scale 0.1
Ground Truth
LRS3
WER 29%SECS 0.60ESTOI 0.391
VC-LJS
WER 43%SECS 0.88ESTOI 0.588
Noise Scale 1.0
LRS3
WER 57%SECS 0.69ESTOI 0.290
VC-LJS
WER 43%SECS 0.89ESTOI 0.514
Hard High WER Samples (>3 errors @ ns=0.1)
"SO WE NEED THE SOLUTIONS AND THESE PEOPLE PLAYING THE GAME THEY ARE TRYING OUT"
qYUmI5kGsYk/00001 · 6.0s
Noise Scale 0.1
Ground Truth
LRS3
WER 33%SECS 0.70ESTOI 0.303
VC-LJS
WER 33%SECS 0.88ESTOI 0.444
Noise Scale 1.0
LRS3
WER 33%SECS 0.78ESTOI 0.239
VC-LJS
WER 40%SECS 0.90ESTOI 0.352
"MORE TRUST IS NOT AN INTELLIGENT AIM IN THIS LIFE"
1PNX6MSdVsk/00005 · 3.7s
Noise Scale 0.1
Ground Truth
LRS3
WER 90%SECS 0.57ESTOI 0.294
VC-LJS
WER 90%SECS 0.81ESTOI 0.282
Noise Scale 1.0
LRS3
WER 70%SECS 0.67ESTOI 0.256
VC-LJS
WER 80%SECS 0.85ESTOI 0.231
"WE'RE GOING TO START PUTTING AN ENTIRE LAYER OF DIGITAL INFORMATION ON THE REAL WORLD"
H9ZOpQzjukY/00001 · 4.6s
Noise Scale 0.1
Ground Truth
LRS3
WER 40%SECS 0.74ESTOI 0.391
VC-LJS
WER 20%SECS 0.90ESTOI 0.410
Noise Scale 1.0
LRS3
WER 13%SECS 0.80ESTOI 0.338
VC-LJS
WER 20%SECS 0.91ESTOI 0.338