Menu

Tree [0ef166] master /
 History

HTTPS access


File Date Author Commit
 lib 2016-03-31 Matthew Purver Matthew Purver [0ef166] POI updated 3.6 -> 3.14
 src 2016-03-31 Matthew Purver Matthew Purver [0ef166] POI updated 3.6 -> 3.14
 .gitignore 2015-11-13 Matthew Purver Matthew Purver [f3f883] nov 2015 update
 README.txt 2015-11-13 Matthew Purver Matthew Purver [f3f883] nov 2015 update
 build.sh 2014-02-27 Matthew Purver Matthew Purver [361cca] remove unnecessary stuff, although you're welco...
 corpus.sh 2015-11-13 Matthew Purver Matthew Purver [f3f883] nov 2015 update
 license.txt unknown
 run.sh 2014-02-27 Matthew Purver Matthew Purver [361cca] remove unnecessary stuff, although you're welco...

Read Me

1. Import your corpus. Either:

   a) create a new subclass of qmul.corpus.DialogueCorpus. The constructor
   should read in the data from your external source, and use the existing
   addDialogue(), addTurn(), addSent() methods to build the corpus. The main()
   method should call this, and save the corpus to file using the existing
   saveToFile() method. See BNCCorpus, SwitchboardCorpus, DCPSECorpus for
   examples.

   b) put your external data into a simple text format like that used by
   qmul.corpus.TranscriptCorpus or qmul.corpus.SwitchboardTranscriptCorpus, and
   use that class to read it in (and save it to file, via main()). This only
   works for corpora without syntactic information - these text formats don't
   understand trees. TranscriptCorpus expects text files with one sentence per
   line in the following format:
   
     DialogueActTag SpeakerID_StartTime_EndTime [transcript]

   If you're doing it this way, you can run the TranscriptCorpus class from the
   command line:

     ./corpus.sh CORPUS_DATA_DIR CORPUS_NAME CORPUS_GENRE

2. If you are interested in syntactic similarity, but your corpus doesn't
   already contain syntax trees, you'll need to parse it. The
   qmul.corpus.CorpusParser class can do this, via the Stanford, C&C or RASP
   parsers. Its methods are static; see the main() method in
   qmul.corpus.BNCCorpus for an example of how to call it. Remember to save the
   corpus to file again once you've parsed it.

3. Decide on the similarity measure you want to use. The existing measures in
   qmul.align provide general lexical and syntactic similarity measures; if
   that's all you want, you don't need to do anything here except decide which
   of them to use. If you want something else, though, you'll need to define a
   new subclass of qmul.util.similarity.SimilarityMeasure.

4. Do you want to compare your data's observed similarity to a random baseline?
   We like to do this. If so, decide on the kind of randomisation of your corpus
   you want to compare to. The qmul.corpus.RandomCorpus class will do this for
   you, if you're happy with one of the randomisation methods it uses - see
   definitions and comments in that class. You can randomise the order of all
   utterances in the corpus; randomise the order of just one speaker's
   utterances; randomise pairing of speakers while keeping utterance order; and
   more. Again, remember to save the corpus to file once you've done this.

4. Now that you have an existing .corpus (or .corpus.gz) file containing your
   data, use the main() method in qmul.align.AlignmentTester to run a moving
   window across your data, calculating similarity between windows. You can
   define the size of your window; compare windows with previous windows for the
   same speaker; or with the equivalent window for the other speaker. Window
   units (for size and step) can be defined in either speaker turns or sentences
   (where a turn can consist of multiple sentences - this will depend on your
   data). You can set these options within the code or as command-line arguments
   - see the main() method and/or the run.sh script.
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.