Leo - 2021-12-08

Hi, I am wondering if anyone might be able to advise the best format to use to manually transcribe audio files consisting of two speakers (about 250 ~20 min recordings) so that we are ultimately able to use forced alignment, segmentation at speaker turns, and speaker diarization, in CMUSphinx. Is it ok to transcribe verbatim utterances in a microsoft word document and label each speaker? Do we need to indicate timings at all in these text files?
For context, the purpose of the forced alignment, segmentation, and diarization is ultimately to be able to examine vocal prosody and accoustic characteristics of speech using open source algorithms in both speakers (currently both speakers are on one audio track). About to start the manual transcribing of these recordings and am hoping to make it easiest for ourselves in the future when we're ready to pre-process the recordings. Thank you for any guidance!