A subtitle generator for Japanese Adult Videos.
Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the spontaneous and noisy domain of JAV. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.
Features
- Non-Verbal Vocalisations (NVVs): A high density of physiological sounds (heavy breathing, gasps, sighs) and "obscene sounds" that lack clear harmonic structure.
- Spectral Mimicry: These vocalizations often possess "curve-like spectrum features" that mimic the formants of fricative consonants or Japanese syllables (e.g., fu), acting as accidental adversarial examples that trick the model into recognizing words where none exist.
- Extreme Dynamics: Volatile shifts in audio intensity, ranging from faint whispers (sasayaki) to high-decibel screams, which confuse standard gain control and attention mechanisms.
- Linguistic Variance: The prevalence of theatrical onomatopoeia and Role Language (Yakuwarigo) containing exaggerated intonations and slang absent from standard corpora.
- Acoustic Filtering: Deploys scene-based segmentation and VAD clamping under the hypothesis that distinct scenes possess uniform acoustic characteristics, ensuring the model processes coherent audio environments rather than mixed streams [1-3].
- Linguistic Adaptation: Normalizes domain-specific terminology and preserves onomatopoeia, specifically correcting dialect-induced tokenization errors (e.g., in Kansai-ben) that standard BPE tokenizers fail to parse [4, 5].
- Defensive Decoding: Tunes log-probability thresholding and no_speech_threshold to systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g., (moans)) from the final subtitle track [6, 7].
Categories
Speech RecognitionLicense
MIT LicenseFollow WhisperJAV
Rate This Project
Login To Rate This Project
User Reviews
-
good subtitle generator