Data < SemEval-2015 Task 4

Data

Evaluation Data: Test and Gold data

The evaluation data consists of 3 sets documents annotated with event mentions and a set of 38 target entities. Each set contains 30 documents from Wikinews, for a total of around 30,000 tokens.

Evaluation data are available by filling the following form: Download ME!

Trial Data

The trial data consists of a set of 30 documents collected from Wikinews (http://en.wikinews.org) about Apple Inc. A set of target entities (input) and the corresponding ordered list of events (the output timeline) is provided with the set of documents.

The trial data have been annotated with the extents of event mentions.

Download:

Documents in CAT labelled format (version 1.2): Corpus-trial-data-task4_v1.2.zip
Documents in TimeML format: Corpus-trial-data_task4_TimeML_v1.2.zip
Documents in CAT labelled format (version with only events that can appear in the expecting timelines): Corpus-trial-data_task4_events_timelines.zip
Documents in TimeML format (version with only events that can appear in the expecting timelines): Corpus-trial-data_task4_events_timelines_TimeML.zip
Target entities: Target_entities_task4.txt
TimeLines (version 1.2): TimeLines_trial_data_task4_v1.2.zip
Updated TimeLine of Steve Jobs (version 2014-11-28): steve_jobs.txt

We also provide independently the 3 files used for the agreement on event mentions annotation, and the two TimeLines built by using these files for the agreement. The 3 files are also included in the whole corpus, but not the TimeLines. The annotation and the TimeLines have been reviewed.

Download the 3 agreement files in CAT labelled format: Corpus-agreement-data_task4.zip
Download TimeLines built by using the 3 agreement files: TimeLines_agreement_data_task4_v1.0.zip

No training data have been provided in addition to the trial data.

Format

Documents. The documents will be available in two formats: CAT (Content Annotation Tool) (Bartalesi Lenzi et al.,2012) labelled format and a format which mimics TimeML format (http://timeml.org/site/publications/specs.html).

CAT labelled format is an XML based standoff format where different annotation layers are stored in separate document sections and are related to each other and to source data through pointers. Trial data are annotated with event mentions and the document creation time, so each document contains 2 different sections: one with the tokens and one with the markables.

The XSD schema of the annotated documents in CAT labelled format is available here.

In the alike TimeML format events are annotated using only the EVENT element (and not the MAKEINSTANCE as in TimeML). Elements has been added to mark out the sentences (s) and associate them to an unique id. The text is tokenized.

TimeLine. One file by TimeLine must be created. The first line contains the target entity.
The name of the files must be the mention of the target entity in lower case, and the extension “.txt”. In the case of multi-words entity, tokens will be separated by an underscore.
E.g.: steve_jobs.txt

Set of target entities. For each set of documents, one file is provided containing the list of target entities, one by line.

SemEval-2015 Task 4

Data

Contact Info

Organizers

Other Info

Announcements