Does anyone know exactly what formulas are used for computing the evaluation measures in Corpus Quality Assurance (CQA) and in Corpus Benchmark Tool (CBT)? I tried to get recall, precision, and f-measure values for a small corpus I annotated and I got different values from the two tools. This is what I know/understand so far (I might be wrong) ...
- both CBT and CQA seem to use macro-averaging (final measure per corpus is the average of measure values per file)
- CBT uses averaging while CQA uses strict, lenient, and average for computations depending on what the user selects
- the precision (any measure actually) of an annotation per corpus is the average of that measure for that annotation at file level
My results from CQA and CBT are the following (I'll use precision to explain my results):
- precision for an annotation, at file level, might be same between CBT and CQA but I'm not sure; I don't have the formulas here and CQA is not reporting this
- overall precision per file is always greater for CBT when compared to CQA, regardless of strict, lenient, or average ... I inferred formula for CBT but I don't know how it is calculated for CQA so that it yields lower values
- precision of an annotation at corpus level is same between CBT and CQA; it seems like same formula is used
- overall precision of a corpus is always greater for CBT when compared to CQA, regardless of style being strict, lenient, or average. I have a strong feeling this is due to having a higher value for the overall precision per file.
Can anyone help me clarify these? I only wanted higher values for these measures and therefore switched from CBT to CQA in order to get lenient measures, which were supposed to be higher. But I ended up with lower values for precision and recall even though f-measure went up very slightly.