Evaluation Criteria < SemEval-2017 Task 7

Evaluation Criteria

Subtask 1: Pun detection

The evaluation for this subtask will be carried out in two simultaneous phases, one for the homographic data set and one for the heterographic data set. Systems may participate in either or both phases.

Systems participating in a given phase must classify all contexts in the data set. Contexts must be classified as either containing or not containing a pun.

The classification results for each phase must be submitted in a delimited text file named answer.txt. Each line consists of two fields separated by horizontal whitespace (a single tab or space character). The first field is the ID of a context from the data set. The second field is either 1 if the text contains a pun, or 0 if the text does not contain a pun. Sample data and results files are available in the trial data.

To submit the results, place answer.txt in a ZIP file (in the top-level directory), and then upload it to CodaLab according to the instructions at Participating in a Competition.

Systems will be scored using the standard precision, recall, accuracy, and F1 measures as used in classification:

precision: # of true positives ÷ ( # of true positives + # of false positives)
recall: # of true positives ÷ ( # of true positives + # of false negatives)
accuracy: (# of true positives + # of true negatives ) ÷ ( # of true positives + # of true negatives + # of false positives + # of false negatives)
F1: ( 2 × precision × recall ) ÷ ( precision + recall )

Subtask 2: Pun location

Systems participating in a given phase may provide a single guess for any or all of the contexts in the data set.

The results for each phase must be submitted in a delimited text file named answer.txt. Each line consists of two fields separated by horizontal whitespace (a single tab or space character). The first field is the ID of a context from the data set. The second field is the ID of the one word in that context which is a pun. Sample data and results files are available in the trial data.

To submit the results, place answer.txt in a ZIP file (in the top-level directory), and then upload it to CodaLab according to the instructions at Participating in a Competition.

Systems will be scored using the standard coverage, precision, recall, and F1 measures as used in word sense disambiguation:

coverage: # of guesses ÷ # of contexts
precision: # of correct guesses ÷ # of guesses
recall: # of correct guesses ÷ # of contexts
F1: ( 2 × precision × recall ) ÷ ( precision + recall )

Subtask 3: Pun interpretation

Systems participating in a given phase may provide single a guess for any or all of the contexts in the data set.

The results for each phase must be submitted in a delimited text file named answer.txt. Each line of the text file consists of three fields separated by horizontal whitespace (a single tab or space character). The first field is the ID of a pun word from the data set. The second field is a semicolon-delimited list of WordNet 3.1 sense keys that match one meaning of the pun. The third field is a semicolon-delimited list of WordNet 3.1 sense keys that match the other meaning of the pun. Sample data and results files are available in the trial data.

To submit the results, place answer.txt in a ZIP file (in the top-level directory), and then upload it to CodaLab according to the instructions at Participating in a Competition.

Systems will be scored using the standard coverage, precision, recall, and F1 measures as used in word sense disambiguation:

coverage: # of guesses ÷ # of contexts
precision: # of correct guesses ÷ # of guesses
recall: # of correct guesses ÷ # of contexts
F1: ( 2 × precision × recall ) ÷ ( precision + recall )

A guess is considered to be "correct" if one of its sense lists is a non-empty subset of one of the sense lists from the gold standard, and the other of its sense lists is a non-empty subset of the other sense list from the gold standard. That is, the order of the two sense lists is not significant, nor is the order of the sense keys within each list. If the gold standard sense lists contain multiple senses, then it is sufficient for the system to correctly guess only one sense from each list.

For example, take the following gold standard key:

t_1_17  propane%1:27:00::       profane%3:00:00::;profane%3:00:00:unholy:00

Any of the following system guesses would be considered correct:

t_1_17	propane%1:27:00::	profane%3:00:00::;profane%3:00:00:unholy:00
t_1_17	propane%1:27:00::	profane%3:00:00:unholy:00;profane%3:00:00::
t_1_17	propane%1:27:00::	profane%3:00:00::
t_1_17	propane%1:27:00::	profane%3:00:00:unholy:00
t_1_17	profane%3:00:00::;profane%3:00:00:unholy:00	propane%1:27:00::
t_1_17	profane%3:00:00:unholy:00;profane%3:00:00::	propane%1:27:00::
t_1_17	profane%3:00:00::	propane%1:27:00::
t_1_17	profane%3:00:00:unholy:00	propane%1:27:00::

SemEval-2017 Task 7

Detection and Interpretation of English Puns