Transcription format for forced alignment, segmentation, diarization

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Transcription format for forced alignment, segmentation, diarization

Forum: Help

Created: 2021-12-08

Updated: 2021-12-08

Leo - 2021-12-08

Hi, I am wondering if anyone might be able to advise the best format to use to manually transcribe audio files consisting of two speakers (about 250 ~20 min recordings) so that we are ultimately able to use forced alignment, segmentation at speaker turns, and speaker diarization, in CMUSphinx. Is it ok to transcribe verbatim utterances in a microsoft word document and label each speaker? Do we need to indicate timings at all in these text files?
For context, the purpose of the forced alignment, segmentation, and diarization is ultimately to be able to examine vocal prosody and accoustic characteristics of speech using open source algorithms in both speakers (currently both speakers are on one audio track). About to start the manual transcribing of these recordings and am hoping to make it easiest for ourselves in the future when we're ready to pre-process the recordings. Thank you for any guidance!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.