The scripts that will be used to evaluate system submissions against the gold standard are available on GitHub. The following releases are currently available:

To use any of the included scripts, first download the release, unpack it into a directory, and change to that directory. All scripts should be run from that directory.

The instructions below cover only the basic usages of each script. For further details on how to use a script, run the script with --help as a command-line argument.

Validating Anafora XML

Before evaluating your system output, you should check to make sure it is valid Anafora XML according to the THYME schema. To test this, download the THYME schema to the same directory where you unpacked the anaforatools release and then run:

python -m anafora.validate -s thyme-schema.xml -i <system-dir>

where <system-dir> is the directory containing Anafora XML files that you would like to have validated. If the validator identified any problems in your XML files, it will print messages such as:

WARNING:...: invalid annotation type 'XXX'
WARNING:...: invalid value 'YYY' for property 'Type' of annotation type 'EVENT'
WARNING:...: missing required property 'Source' of annotation type 'TLINK'

Evaluating System Output

To evaluate the Anafora XML files produced by your system against the manually annotated Anafora XML files, run:

python -m anafora.evaluate -r <reference-dir> -p <system-dir>

where <reference-dir> is the directory of manually annotated Anafora XML files and <system-dir> is the directory of your system-predicted XML files. Note that both <reference-dir> and <system-dir> should have the same directory structure. This will print a variety of evaluation measures that look something like:

                              	 ref 	pred 	corr 	  P  	  R  	 F1  
TIMEX3                        	 1874	 1031	    0	0.000	0.000	0.000
TIMEX3:<span>                 	 1874	 1031	  890	0.863	0.475	0.613
TIMEX3:Class                  	 1874	 1031	  879	0.853	0.469	0.605
TIMEX3:Class:DATE             	 1340	  795	  677	0.852	0.505	0.634
TIMEX3:Class:DURATION         	  186	   74	   54	0.730	0.290	0.415
TIMEX3:Class:PREPOSTEXP       	  160	  134	  126	0.940	0.787	0.857
TIMEX3:Class:QUANTIFIER       	   56	    8	    6	0.750	0.107	0.188
TIMEX3:Class:SET              	   72	   18	   14	0.778	0.194	0.311
TIMEX3:Class:TIME             	   60	    2	    2	1.000	0.033	0.065

The columns identify the number of reference items, the number of system-predicted items, the number of system-predicted items that were correct (i.e., in the reference items), and the precision (P), recall (R) and F1-measure.

A line starting with a single token, like the TIMEX3 line above, gives the performance on exactly matching entire annotations, including their type, spans and all of their properties. A line like TIMEX3:<span> gives the performance on just matching the entity type and the character offsets (ignoring the properties). The remaining lines show performance on entity properties, either overall, as in the case of TIMEX3:Class, or separated out by the different values of a property as in TIMEX3:Class:DATE, TIMEX3:Class:DURATION, etc.

TLINKs have only been annotated on _clinic_ notes, not on _path_ notes, so you should restrict the evaluation to those notes when evaluating TLINKs. Run:

python -m anafora.evaluate -r <reference-dir> -p <system-dir> -x ".*_clinic_.*[.]xml$" --include TLINK

For TLINKs, it is also possible to run a form of evaluation that takes temporal inferences into account, applying temporal closure on the reference annotations when calculating precision, and applying temporal closure on the predicted annotations when calculating recall. For such an evaluation, run:

python -m anafora.evaluate -r <reference-dir> -p <system-dir> -x ".*_clinic_.*[.]xml$" --include TLINK:Type --temporal-closure


If you have trouble running any of the scripts, please ask for assistance on the Clinical TempEval Google Group.

If you have found a bug in the scripts, please file an issue on the anaforatools issue tracker