SemEval-2015 Task 7: Diachronic Text Evaluation

 

INTRODUCTION

Ye knowe ek that in forme of speeche is chaunge
You know also that in (the) form of speech (there) is change

(Geoffrey Chaucer, Troilus and Criseyde, late 14th century)

 

Language changes over the time, even over relatively small periods. For example, as the main intent of publishing newspapers is to disseminate information to the population of a whole country, there is an objective pressure to impose a standard and to smooth over the dialectical differences. However, since the late 1600s, each generation has read pieces of news containing new words, borrowed or invented, exhibiting new drifts in the meanings of old words, printed with different spelling etc.

While for humans it is relatively easy to notice the language differences between two texts, and to even be accurate in determining the period when a piece of news was written, for computational systems this task is challenging. On the other hand, with the availability of large time tagged corpora, a computational system can perform various analyses and extract correlations that are impossible for humans to know beforehand or acquire through manual inspection of the information scattered over huge collections of texts.

The interesting question is whether it is possible to automatically determine the period when a text was written. For this task, all aspects of language change may be taken into account – references to well known people or events, phrases used preponderantly in a certain period, epoch specific syntax, word senses, morphology etc. The sentences below exemplify some of these time related features occurring in texts.

(1) “Dictator Saddam Hussein ordered his troops to march into Kuwait. After the invasion is condemned by the UN Security Council, the US has forged a coalition with allies. Today American troops are sent to Saudi Arabia in Operation Desert Shield, protecting Saudi Arabia from possible attack. ” circa 1990
(2) "We have cabled the English house from which we get it and expect a reply to-morrow." circa 1900
(3) “Occasional selfies are acceptable, but uploading a new picture of yourself every day is not necessary.” circa 2014

 

 

We propose to tackle the task of automatically identifying the time period when a piece of news was written. We provide a corpus of fragments of pieces of news, for both training and testing. The length of a fragment is a few hundred words. The system has to choose the correct time period, e.g. 1700-1750, …, 1900-1910, …., from the given set of contiguous intervals which cover the whole period considered, i.e. from 1700 to 2014.

 

 

On the basis of the types of information that can potentially be used, we individuated three subtasks. A system may participate in any of the subtasks separately, but we encourage the participation in all three subtasks. All types of approaches are welcomed.

 

 

 

TASK DESCRIPTION

 

Subtask 1 – texts with clear reference to time anchors
Each fragment of a piece of news may contain clear time anchors and/or explicit references to famous persons or events. The fragments are not necessarily self explanatory time-wise, but time information may be obtained from external knowledge bases, like in example (1) above.

Data Set Example
<text no=“1700-1710” yes=“1985-1995” no=“2000-2010”>
Dictator Saddam Hussein ordered his troops to march into Kuwait. After the invasion is condemned by the UN Security Council, the US has forged a coalition with allies. Today American troops are sent to Saudi Arabia in Operation Desert Shield, protecting Saudi Arabia from possible attack.
</text>


Subtask 2 – texts with specific time language usage
While the references to named entities and events may be accidentally present in the body of the fragment, it is unlikely that these are present in other resources. However, other clues are available to determine a time period, like in example (2) above.

Data Set Example
<text yes=“1895-1905” no=“1970-1980” no=“2004-2014”>
We have cabled the English house from which we get it and expect a reply to-morrow.
</text>

 

Subtask3 – recognizing time specific phrases

Some phrases are particularly relevant for time detection for a small period of time, while some others are irrelevant. The task consists in responding with “yes” or “no” if a marked phrase bears time information content. For example, in (3), “selfie” is considered a time information content bearer, but “is not necessary” is not, with respect to the recent period. In this task the goal is not to determine the time when the fragment is most likely written, as in the two previous tasks, but the goal is to decide whether the marked phrases are indicative features for a certain period.  In this task, the period of interest is specified both in training and in testing.

Data Set Example
<text yes=“selfie” no=“is not necessary” period=”2000-2014”>
Occasional selfies are acceptable, but uploading a new picture of yourself every day is not necessary.
</text>
 

REFERENCES

Mihalcea, R. and Nastase, V. (2012). "Word epoch disambiguation: Finding how words change over time". In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea.

 

Popescu, O. and Strapparava, C. (2013). "Behind the Times: Detecting Epoch Changes using Large Corpora" In Proceedings of the JCNLP, Nagoya, Japan 

 

Popescu, O. and Strapparava, C. (2014). "Time corpora: Epochs, opinions and changes ", in Journal of Knowledge-Based Systems, Elsevier

 

Wang, C., Blei, D., and Heckerman, D. (2008). "Continuous time dynamic topic models". In Proceedings of the International Conference on Machine Learning.

Contact Info

Organizers

  • Octavian Popescu, IBM Research, US
  • Carlo Strapparava, FBK-irst, Italy

googlegroups: semeval2015task7-diachronic@googlegroups.com

Other Info

Announcements

  • 27/08/2014 - the 3,000 pieces of news between 1700-2010 training corpus is on-line
  • 30/05/2014 - the trial is on-line
  • 08/05/2014 - the task description is available
  • 05/06/2014 - the trial is on-line (again) - small changes
  • 05/06/2014 - first questions coming, google group created