Task Description: L2 writing assistant
[2014-04-08 - The evaluation results are now available and the gold standard has been released. Thanks for your participation! All data, tools, and github repository]
[2014-03-24 - You can submit up to three runs per language pair and per configuration type (best/oof)]
[2014-03-23 - The test data has been released! Download from the data & tools page! The evaluation period starts now and ends at Sunday March 30, 2014, 23:59 PST (Monday March 31, 07:00 GMT)]
[2014-03-04 - Download the new version of the evaluation tool on the data & tools page!]
[2014-02-27 - Updated schedule for SemEval task 5: L2 Writing Assistant - Test period starts Sunday the 23th of March 12:00 GMT]
We offer a new cross-lingual and application-oriented task for SemEval that finds itself in the area where techniques from Word Sense Disambiguation and Machine Translation meet.
The task concerns the translation of L1 fragments, i.e words or phrases, in an L2 context. This type of translation can be applied in writing assistance systems for language learners in which users write in their target language, but are allowed to occasionally back off to their native L1 when they are uncertain of the proper word or expression in L2. These L1 fragments are subsequently translated, along with the L2 context, into L2 fragments.
Thus, participants are asked to build a translation/writing assistance system that translates specifically marked L1 fragments, in an L2 context, to their proper L2 translation.
The task find itself on the boundary of Cross-Lingual Word Sense Disambiguation and Machine Translation. Full-on machine translation typically concerns the translation of whole sentences or texts from L1 to L2. This task, in contrast, focuses on smaller fragments, side-tracking the problem of full word reordering.
In this task we focus on the following language combinations of L1 and L2 pairs: English-German, English-Spanish, French-English, and Dutch-English. Task participants may participate for all language pairs or any subset thereof.
2. Task Description
In this task we ask the particpants to build a translation assistance system rather than a full machine translation system. The general form of such a translation assistance system allows a translator or L2-language student to write in L2, but allowing him or her to back off to L1 where he is uncertain of the correct lexical or grammatical form in L2. The L1 expression, a word or phrase, is translated by the system to L2, given the L2 context already present, including right-side context if available. The aim here, as in all translation, is to carry the semantics of the L1 fragment over to L2 and find the most suitable L2 expression given the already present L2 context.
The task essentially addresses a core problem of WSD, with cross-lingual context, and a sub-problem of Phrase-based Machine Translation; that of finding the most suitable translation of a word or phrase. In MT this would be modelled by the translation model. Our task does not address the full complexity of sentential translation, thus evading problems associated with reordering and syntax. Instead it emphasizes the local semantic aspect of phrasal or word translation in context. The user group we have in mind is that of intermediate and advanced language learners, whom you generally want to encourage to use their target language as much as possible, but may often feel the need to fall back to their native language.
Currently, language learners are forced to fall back to a bilingual dictionary when in doubt. These do not take the L2 context into account and are generally more constrained to single words or short expressions. The proposed application automatically generates context-dependent translation suggestions as writing progresses. The task tests how effective participating systems accomplish this.
The following example sentence pairs illustrate the idea:
- Input (L1=English,L2=Spanish): “Hoy vamos a the swimming pool”.
- Desired output: “Hoy vamos a la piscina”
- Input (L1-English, L2=German): “Das wetter ist wirklich abominable”.
- Desired output: “Das wetter ist wirklich ekelhaft”
- Input (L1=French,L2=English): “I rentre à la maison because I am tired”
- Desired output: “I return home because I am tired”.
- Input (L1=Dutch, L2=English): “Workers are facing a massive aanval op their employment and social rights .”
- Desired output: “Workers are facing a massive attack on their employment and social rights”
The L2 writing assistant task can be related to two tasks that were offered in previous years of SemEval: Lexical Substitution (Mihalcea et al, 2010) and Cross-lingual Word Sense Disambiguation (Lefever and Hoste, 2010, 2013). When comparing this task to the Cross-Lingual Word-Sense Disambiguation task, notable differences are the fact that this task concerns not just words, but also phrases. Another essential difference is the nature of the context; the context is in L2 instead of L1.
Several metrics are available for automatic evaluation. First, we measure the absolute accuracy a = c/n, where c is the number of fragment translations from the system output that precisely match the corresponding fragments in the reference translation, and n is the total number of translatable fragments, including those for which no translation was found. We also introduce a word-based accuracy, which unlike the absolute accuracy, still gives some credits to mismatches that show partial overlap with the reference translation. The system with the highest word-based accuracy wins the competition.
A recall metric simply measures the number of fragments for which the system generated a translation, as a proportion of the total number of fragments. As no selection is made in L1 words or phrases that may appear in a an L2 context, and due to the way evaluation is conducted, it is important that participating systems produce output for as many possible words and phrases as possible, and thus achieve a high recall.
In addition to these task-specific metrics, standard MT metrics such as BLEU, NIST, METEOR and error rates such as WER, PER and TER, are included in the evaluation script as well. Scores such as BLEU will generally be high (> 0.95), as a large portion of the sentence is already translated and only a specific fragment remains to be evaluated. Nevertheless, these generic metrics are proven to follow the same trend as the more task-specific evaluation metrics.
It regularly occurs that multiple translations are possible. In the creation of the test set will take this into account by explicitly encoding valid alternatives. A match with an alternative counts as a valid match. Likewise, a translation assistance system may output multiple alternatives as well. We therefore allow two different types of runs, following the example of the Cross-Lingual Lexical Substitution and Cross-Lingual Word Sense Disambiguation tasks:
- Best - The system must output its best guess;
Out of Five - The system may output up to five alternatives.
Up to three runs may be submitted per language-pair and evaluation type (totalling 24 runs in total if you participate for all language pairs and all evaluation types)
An evaluation script that implements all these measures will be made available to the participants. This same script will be used to compute the final evaluation of this task.
4. Data sets
We provide material for the following L1 and L2 pairs:
Both trial and test data will be offered in a clear and simple XML format. The test data will be delivered in tokenised format. This tokenisation is done using ucto. System output is expected to adhere to this same XML format so it can be automatically evaluated. Output should not be detokenised, it should however respect case as evaluation will be case-sensitive. We do not provide any training data for this task. Participants are fee to use any suitable training material such as parallel corpora, wordnets or bilingual lexica.
Participants are encouraged to participate in as many of the four language pairs as possible, but may also choose any subset.
The task organizers produce a test data set for the selected language pairs of 500 sentences each. In the selection of test data we aim for realism, by selecting words and phrases that may prove challenging for language learners. To achieve this, we gather language learning exercises with gaps and cloze-tests, as well as learner corpora with annotated errors to act as the basis for our test set. When L2 sentences with such marked error fragments are gathered, or gaps specifically designed to test a specific aspect of language ability, we manually translate these fragments into L1, effectively forming a sentence pair for the test set. Note that the test sentences will not contain other L2 learner errors, we only use the errors of the L2 learners to get more natural places to insert the L1 phrases.
We also provide trial data for the selected language pairs consisting of 500 sentences as well. This trial data is semi-automatically generated using a parallel corpus, namely the Europarl corpus (Koehn, 2005). We performed a manual selection to get sentences that contain translations of appropriate words or phrases that mimick the L2 writing assistant task as naturally as possible. It has to be noted that this trial set is less optimised for realism than the test data. Nevertheless, it suffices to measure relative system performance, and it is a sufficiently large set.
Please go to the Data & Tools page to obtain the trial data and tools to work with it.
5. Available software
An evaluation script implementing all the above mentioned evaluation measures will be provided (implemented in Python). This same script will be used to compute the final scores on the test set of this task.
We will provide a context-insensitive baseline for the trial data, computed by using the phrase-translation table generated on a particular training set from a parallel corpus, and selecting simply the most probable translation.
A small Python library for handling the XML file format will be provided, facilitating system construction for participants using Python.
A mailing list is available for questions and discussion regarding this task: subscribe here.
- van Gompel, M. (2010). UvT-WSD1: A cross-lingual word sense disambigua- tion system. In Semeval ’10. In Proceedings of the 5th international workshop on semantic evaluation (pp. 238–241). Morristown, NJ, USA: Association for Computational Linguistics.
- van Gompel, M. (2013). WSD2: Parameter optimisation for memory-based cross- lingual word-sense disambiguation. In Proceedings of the 7th international work- shop on semantic evaluation (SemEval 2013), in conjunction with the second joint conference on lexical and computational semantics.
- Koehn, Ph. "Europarl: A parallel corpus for statistical machine translation." MT summit. Vol. 5. 2005
- Lefever, E., & Hoste, V. (2013). SemEval-2013 Task 10: Cross-Lingual Word Sense Disambiguation. In Proceedings of the 7th international workshop on semantic evaluation (SemEval 2013), in conjunction with the second joint conference on lexical and computational semantics.
- Mihalcea, R., Sinha, R., & McCarthy, D. (2010). Semeval 2010 task 2: Cross- lingual lexical substitution. In Proceedings of the 5th international workshop on semantic evaluations (semeval-2010). Uppsala, Sweden.