Software
The scripts that will be used to evaluate system submissions against the gold standard are available on GitHub. The following releases are currently available:
To use any of the included scripts, first download the release, unpack it into a directory, and change to that directory. All scripts should be run from that directory.
The instructions below cover only the basic usages of each script. For further details on how to use a script, run the script with --help
as a command-line argument.
Validating Anafora XML
Before evaluating your system output, you should check to make sure it is valid Anafora XML according to the THYME schema. To test this, download the THYME schema to the same directory where you unpacked the anaforatools release and then run:
python -m anafora.validate -s thyme-schema.xml -i <system-dir>
where <system-dir>
is the directory containing Anafora XML files that you would like to have validated. If the validator identified any problems in your XML files, it will print messages such as:
WARNING:...: invalid annotation type 'XXX'
WARNING:...: invalid value 'YYY' for property 'Type' of annotation type 'EVENT'
WARNING:...: missing required property 'Source' of annotation type 'TLINK'
Evaluating System Output
To evaluate the Anafora XML files produced by your system against the manually annotated Anafora XML files, run:
python -m anafora.evaluate -r <reference-dir> -p <system-dir>
where <reference-dir>
is the directory of manually annotated Anafora XML files and <system-dir>
is the directory of your system-predicted XML files. Note that both <reference-dir>
and <system-dir>
should have the same directory structure. This will print a variety of evaluation measures that look something like:
ref pred corr P R F1 TIMEX3 1874 1031 0 0.000 0.000 0.000 TIMEX3:<span> 1874 1031 890 0.863 0.475 0.613 TIMEX3:Class 1874 1031 879 0.853 0.469 0.605 TIMEX3:Class:DATE 1340 795 677 0.852 0.505 0.634 TIMEX3:Class:DURATION 186 74 54 0.730 0.290 0.415 TIMEX3:Class:PREPOSTEXP 160 134 126 0.940 0.787 0.857 TIMEX3:Class:QUANTIFIER 56 8 6 0.750 0.107 0.188 TIMEX3:Class:SET 72 18 14 0.778 0.194 0.311 TIMEX3:Class:TIME 60 2 2 1.000 0.033 0.065
The columns identify the number of reference items, the number of system-predicted items, the number of system-predicted items that were correct (i.e., in the reference items), and the precision (P), recall (R) and F1-measure.
A line starting with a single token, like the TIMEX3
line above, gives the performance on exactly matching entire annotations, including their type, spans and all of their properties. A line like TIMEX3:<span>
gives the performance on just matching the entity type and the character offsets (ignoring the properties). The remaining lines show performance on entity properties, either overall, as in the case of TIMEX3:Class
, or separated out by the different values of a property as in TIMEX3:Class:DATE
, TIMEX3:Class:DURATION
, etc.
TLINKs have only been annotated on clinical notes, not on pathology notes, so you should restrict the evaluation to those notes when evaluating TLINKs. Run:
python -m anafora.evaluate -r <reference-dir> -p <system-dir> -x "(?i).*clin.*[.]xml$" --include TLINK
For TLINKs, it is also possible to run a form of evaluation that takes temporal inferences into account, applying temporal closure on the reference annotations when calculating precision, and applying temporal closure on the predicted annotations when calculating recall. For such an evaluation, run:
python -m anafora.evaluate -r <reference-dir> -p <system-dir> -x "(?i).*clin.*[.]xml$" --include TLINK:Type --temporal-closure
Troubleshooting
If you have trouble running any of the scripts, please ask for assistance on the Clinical TempEval Google Group.
If you have found a bug in the scripts, please file an issue on the anaforatools issue tracker