Team Z: Wiktionary as a L2 Writing Assistant

This paper presents a word-for-word translation approach using Wiktionary for SemEval-2014 Task 5. The language pairs attempted for this task were EnglishSpanish and English-German. Since this approach did not take context into account, it performed poorly.


Introduction
The objective of SemEval-2014 Task 5 is to translate a few words or a phrase from one language (L1) into another (L2). More specifically, a sentence containing primarily L2 and a few L1 words is provided, and the task is to translate the L1 words into the L2. This task is similar to the previous cross-linguistic SemEval tasks involving lexical substitution (Mihalcea et al., 2010) and wordsense disambiguation (Lefever and Hoste, 2013).
For example, consider the following sentence, written entirely in German except for one English word: Aber auf diesem Schiff wollen wir auch Ruderer sein, wir sitzen im selben Boot und wollen mit Ihnen row. Here, the word row is polysemous and can be translated as the verb rudern or as the noun Reihe depending on context. The words to be translated can also form an idiomatic expression, such as in exchange in die 1967 eroberten arabischen Gebiete in exchange gegen Frieden. These examples reveal that this is not a straightforward task, as word-for-word translation may give inaccurate results.
Wiktionary is a multilingual dictionary containing word-sense, examples, sample quotations, collocations, usage notes, proverbs and translations (Torsten et al., 2008;Meyer and Gurevych, 2012). Since Wiktionary data have previously This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http: //creativecommons.org/licenses/by/4.0/ been used for translations (Orlandi and Passant, 2010), it was chosen for looking up the translation of source language (L1) words. However, the translation approach was word-for-word and ignored the target language (L2) context, i.e., the context in which the text fragment to be translated is found. The Wiktionary-based solution is for English-to-Spanish and English-to-German language translation though four language pairs were provided in this shared task.

Wiktionary
For a given word, the English version of Wiktionary gives not only its definition but also possible translations. The translations are divided based on part of speech (PoS) and word sense and at times also encode gender and number information. For example, the German and Spanish translations for the English word book are stored in Wiktionary as follows: The Wiktionary dump 1 is an XML file containing the word in the <title> tag and its description under the <text> tag. The translation of the word is indicated by {{t| or {{t+| followed by two letters to denote the target language (es for Spanish and de for German). This is followed by the translation and gender information in the case of nouns.
The information in Wiktionary was converted into a multidimensional hash table consisting of English words as key and PoS and translations in 1 For this task the 17 Dec 2013 version was used.
Spanish and German as the values. This table was used to look up the translations for the task.
Wiktionary also contains lists of the 10000 most frequent words in Spanish and of the 2000 most frequent words in German. This information was used to sort the target language words in the hash table in decreasing order of frequency. The translations absent from these frequency lists were kept in the hash table in the order that they were extracted from Wiktionary. The TreeTagger (Schmid, 1994) was used to parse the English (L1) phrases to obtain the PoS of each word along with the lemma. The PoS tags returned by the TreeTagger were mapped to the PoS used in Wiktionary as shown in Table 1. The word and its PoS were searched for in the hash table. If the translation was not found, then the lemma and its PoS were looked up. If the lemma lookup also failed then the phrase was not translated.
Once the L2 words were obtained for all the L1 words in the phrase, the L2 words were matched based on the gender and number information provided. For example, for the phrase this question, Wiktionary offered este|m and esta|f as Spanish translations of this, and interrogante|m pregunta|f duda|f cuestión|f incógnita|f for question. The translations were paired based on gender agreement rules (e.g. este interrogante, where both are masculine, and esta pregunta, where both are feminine) and provided as solutions.

Rules for English-to-Spanish Translation
Wiktionary only provides translations for the citation form of a word (even though other forms exist in WIktionary as valid entries), which is prob-  lematic when translating plural nouns or conjugated (finite) verbs. Lack of this inflectional information degraded the overall performance of both English-to-Spanish and English-to-German translations. Two rules were included in an attempt to improve the English-to-Spanish translations: (1) plural nouns and adjectives were formed by adding -s or -es, and (2) where a noun was preceded by an adjective in a L1 phrase, after the translation, the positions of the noun and the adjective were switched to respect the noun-adjective word order that is more commonly found in Spanish. Table 2 shows the performance of the system for the English-to-Spanish and English-to-German translations. The approach in bold was submitted for evaluation. The accuracy refers to the percentage of the fragments that were predicted accurately, whereas word accuracy measures the partially correct solutions. For each fragment, up to 5 translations could be submitted with one considered as the best answer and the rest regarded as alternatives. The best evaluation considered only the best answers. On the other hand, oof (out-offive) evaluation considered the alternative answers to calculate the scores if the best answer was incorrect.

Results and Conclusions
A context-independent, word-for-word translation approach to L2 Writing Assistant was proposed. The mediocre performance was due to the fact the approach was very basic. The system can be significantly improved by using the Spanish and German versions of Wiktionary to make up for the translations missing from the English version and by considering the L2 context provided. One such example in the German Wiktionary is the {{Charakteristische Wortkombinationen}} tag, which refers to the possible collocations. For example, one of the translations of the English word exchange in German is Austausch, which is most often collocated with im or als. Also, using a tool like JWKTL 2 would improve the quality of information extracted from Wiktionary.