Evaluation

Experimental setting

Submitted systems

Final ranking of the systems

Detailed evaluation results


Evaluation Measures


Structural Measures:

  • |V|: number of dinstict vertices;
  • |E|: number of dinstict edges;
  • #c.c.: number of connected components;
  • #i.i.: intermediate nodes = |V| - |L| where L is the set of leaves
  • cycles: YES = the taxonomy contains cycles, NO = the taxonomy is a Directed Acyclic Graph (DAG).

  • Comparison against gold standard:

  • P = | {edges in common with the gold standard taxonomy} | / |{system edges}|
  • R = | {edges in common with the gold standard taxonomy} | / |{gold standard edges}|
  • F = 2(P*R)/(P+R)
  • Cumulative Fowlkes&Mallows Measure(F&M): cumulative measure of the similarity of two taxonomies .

  • Manual quality assessment of novel edges

  • correct ISA = ISA AND domain specific AND not over-generic
  • P = |correct ISA| / |sample|

  • Baseline

    A basic string inclusion approach that covers relations between compound terms such as network science -> science.


    Gold Standard

    The gold standard taxonomies (.taxo) are tab-separated fields:
    relation_id <TAB> term <TAB> hypernym
    where:
    - relation_id: is a relation identifier;
    - term: is a term of the taxonomy;
    - hypernym: is a hypernym for the term.
    e.g
    0<TAB>cat<TAB>animal
    1<TAB>dog<TAB>animal
    2<TAB>car<TAB>animal
    ....

    Gold Standard Structure

    Language Domain |V| |E| #i.i. #c.c. Cycles
    English Environment (Eurovoc) 261 261 60 1 no
    Food 1556 1587 70 1 no
    Food (Wordnet) 1486 1576 302 1 no
    Science 453 465 54 1 no
    Science (Eurovoc) 125 124 31 1 no
    Science (Wordnet) 429 452 117 1 no
    Dutch Environment (Eurovoc) 267 267 59 1 no
    Food 1429 1446 66 3 no
    Food (Wordnet) 1299 1340 259 3 no
    Science 445 449 54 1 no
    Science (Eurovoc) 125 124 32 1 no
    Science (Wordnet) 399 399 105 1 no
    French Environment (Eurovoc) 267 266 61 1 no
    Food 1418 1441 64 1 no
    Food (Wordnet) 1329 1358 263 2 no
    Science 449 451 54 1 no
    Science (Eurovoc) 125 124 31 1 no
    Science (Wordnet) 390 389 101 1 no
    Italian Environment (Eurovoc) 267 266 59 1 no
    Food 1274 1304 60 3 no
    Food (Wordnet) 1277 1332 254 1 yes
    Science 442 444 54 1 no
    Science (Eurovoc) 125 124 32 1 no
    Science (Wordnet) 396 396 105 1 no


    Submissions

    (with author-provided descriptions)


    The following archive contains the submissions received from the 6 participating systems.

     

    JUNLP

    The system is based on two hypernym detection modules. The first one deals with available semantic relations that can be found for a term. Instead of analysing the huge Wikipedia dump for pattern-based hypernym discovery, we opted for a significant reduction of execution time by extracting Wikipedia based hypernym relations from Babelnet (rich semantic network which connects concepts and named entities in a very large network of semantic relations, called Babel synsets). The second module tries to find out the subterm(s) (a subterm can be another term from the list or multiple term overlaps) present in the termlist which can be a possible hypernym for that term.

     

    TAXI - TAXonomy Induction

    The methods used in the TAXonomy Induction system (TAXI) rely on two sources of evidence: substring matching and Hearst-like patterns. The Hearst patterns for all languages are extracted from Wikipedia and focused crawls with seed pages that are Wikipedia pages. In addition, for English, we rely on several additional corpora: GigaWord, ukWaC, a news corpus and the CommonCrawl. For French, Italian and Dutch the method is completely unsupervised and relies on KNN approach.  For English, we train an SVM classifier on the trial data. For all languages the features are the same: substrings and ISA relations extracted with lexico-syntactic patterns. No databases or linguistic resources beyond trial data and raw text corpora mentioned above were used.

     

     

    NUIG-UNLP

    The system implements a semi-supervised method that finds hypernym candidates for the provided noun phrases by representing them as distributional vectors. Roughly, this method assumes that hypernyms may be induced by adding a vector offset [1,2] to the corresponding hyponym representation generated by GloVe over a Wikipedia dump. The vector offset is obtained as the average offset between 200 pairs of hyponym-hypernym in the same vector space.

    [1] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013.

    [2]  Marek Rei and Ted Briscoe. 2014. Looking for Hyponyms in Vector Space. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 68–77.

     

    USAAR

    Often multi-word hyponyms are endocentric constructions which contains a word that fulfills the same function as one part of its word. E.g. an "apple pie" is essentially a "pie". We explore the number of multi-words terms that are endocentric in English and whether we can use this endocentric property to generate entity links to connect terms in wikipedia list of list.

     

    QASSIT

    We use a semisupervised methodology for the acquisition of lexical taxonomies based on genetic algorithms. It is based on the theory of pretopology that offers a powerful formalism to model semantic relations and transforms a list of terms into a structured term space by combining different discriminant criteria. In particular, rare but accurate pieces of knowledge are used to parameterize the different criteria defining the pretopological term space. Then, a structuring algorithm is used to transform the pretopological space into a lexical taxonomy. 


    Final Ranking


    Overall Ranking

    Subtask Measure JUNLP TAXI NUIG-UNLP USAAR QASSIT

    Monolingual (EN)

    Taxonomy Construction

    Cyclicity 3 1 4 1 2
    Structure (F&M) 3 2 4 5 1
    Categorisation (i.i.) 1 3 2 4 5
    Connectivity (c.c.) 3 1 2 4 1
    Gold standard comparison (Fscore)* 4 1 5 2 3
    Domains* 1 1 2 1 2
    Manual Evaluation (Precision)* 4 2 5 1 3
    Total 19 11 24 18 17
    Ranking 4 1 5 3 2

    Monolingual (EN)

    Hypernym Identification*

    Total* 9 4 12 4 8
    Ranking* 3 1 4 1 2

    Multilingual (NL,FR,IT)

    Taxonomy Construction

     

    Cyclicity 1 1 n.a. n.a. n.a.
    Structure (F&M) 2 1
    Categorisation (i.i.) 1 2
    Connectivity (c.c.) 2 1
    Gold standard comparison (Fscore)^ 2 1
    Manual Evaluation (Precision)^ 2 1
    Total 10 7
    Ranking 2 1
    Multilingual (EN)

    Hypernym Identification^

    Total^ 4 2
    Ranking^ 2 1

    Only measures marked with * and ^ were used for ranking in the hypernym identification subtasks.


    Overall scores

    Subtask Measure Baseline JUNLP TAXI NUIG-UNLP USAAR QASSIT

    Monolingual

    (EN)

    Cyclicity 0 3 0 4 0 1
    Structure (F&M) 0.0046 0.1498 0.2908 0.0410 0.0013  0.4064
    Categorisation (i.i.) 77.67 377 104.5 213 96.33 59.5
    Connectivity (c.c.) 36.83 53.17 1 44.75 76.67 1
    Gold standard comparison (Fscore) 0.33 0.20 0.32 0.19 0.26 0.22
    Domains 6 6 6 4 6 4
    Manual Evaluation (Precision) n.a. 0.09 0.20 0.07 0.49 0.10

    Multilingual

    (NL,FR,IT)

    Cyclicity 0 0 0 n.a. n.a. n.a.
    Structure (F&M) 0.0087 0.0155  0.1885
    Categorisation (i.i.) 64.28 178.22 64.94
    Connectivity (c.c.) 40.5 34.89 1
    Gold standard comparison (Fscore) 0.3133 0.1921 0.2815
    Manual Evaluation (Precision) n.a. 0.2983 0.6252

    These scores are obtained by averaging the results over domains (environment, science, food) and languages (NL, FR, IT) for the multilingual setting.



    Detailed Evaluation Results


    Structural Evaluation

    Language Domain Measure Baseline JUNLP TAXI NUIG-UNLP USAAR QASSIT
    English Environment (Eurovoc) |V| 123 321 148 312 57 261
    |E| 112 463 207 456 47 365
    #i.i. 27 123 50 176 10 88
    #c.c. 17 19 1 58 10 1
    cycles no no no yes no no
    Food |V| 636 1802 781 n.a 3716 n.a.
    |E| 627 3015 1118 4347
    #i.i. 130 581 132 323
    #c.c. 40 48 1 217
    cycles no yes no no
    Food (Wordnet) |V| 826 1748 1122 n.a. 675 n.a.
    |E| 812 3607 2067 540
    #i.i. 205 866 259 146
    #c.c. 79 123 1 135
    cycles no yes no no
    Science |V| 232 602 294 595 371 452
    |E| 214 1046 418 1656 312 708
    #i.i. 41 255 73 409 60 58
    #c.c. 28 24 1 99 59 1
    cycles no no no yes no yes
    Science (Eurovoc) |V| 50 186 100 97 37 125
    |E| 42 342 139 218 30 164
    #i.i. 11 133 25 72 7 25
    #c.c. 9 15 1 13 7 1
    cycles no yes no yes no no
    Science (Wordnet) |V| 217 424 290 251 136 370
    |E| 174 690 459 929 104 647
    #i.i. 52 304 88 195 32 67
    #c.c. 48 90 1 9 32 1
    cycles no no no yes no no
    Dutch Environment (Eurovoc) |V| 116 317 85 n.a. n.a. n.a.
    |E| 100 379 92
    #i.i. 24 73 24
    #c.c. 20 8 1
    cycles no no no
    Food |V| 459 1616 350
    |E| 399 1974 395
    #i.i. 84 329 85
    #c.c. 68 71 1
    cycles no no no
    Food (Wordnet) |V| 610 1433 515
    |E| 500 1868 573
    #i.i. 147 320 145
    #c.c. 122 63 1
    cycles no no no
    Science |V| 192 143 46
    |E| 166 145 49
    #i.i. 40 40 15
    #c.c. 30 26 1
    cycles no no no
    Science (Eurovoc) |V| 40 561 193
    |E| 31 774 203
    #i.i. 11 163 40
    #c.c. 10 21 1
    cycles no no no
    Science (Wordnet) |V| 213 427 221
    |E| 169 452 230
    #i.i. 55 89 59
    #c.c. 50 52 1
    cycles no no no
    French Environment (Eurovoc) |V| 130 327 128 n.a. n.a. n.a.
    |E| 117 415 181
    #i.i. 23 79 37
    #c.c. 17 7 1
    cycles no no no
    Food |V| 531 1649 522
    |E| 500 2224 699
    #i.i. 109 352 125
    #c.c. 49 53 1
    cycles no no no
    Food (Wordnet) |V| 712 1533 707
    |E| 679 2157 964
    #i.i. 172 373 193
    #c.c. 68 45 1
    cycles no no no
    Science |V| 201 587 249
    |E| 181 885 298
    #i.i. 43 186 58
    #c.c. 24 18 1
    cycles no no no
    Science (Eurovoc) |V| 48 145 83
    |E| 38 158 113
    #i.i. 11 45 24
    #c.c. 10 21 1
    cycles no no no
    Science (Wordnet) |V| 208 419 265
    |E| 169 458 336
    #i.i. 54 89 76
    #c.c. 43 40 1
    cycles no no no
    Italian Environment (Eurovoc) |V| 102 341 70 n.a. n.a. n.a.
    |E| 91 474 69
    #i.i. 18 87 14
    #c.c. 13 4 1
    cycles no no no
    Food |V| 459 1490 328
    |E| 417 1864 332
    #i.i. 103 314 66
    #c.c. 49 58 1
    cycles no no no
    Food (Wordnet) |V| 644 1505 471
    |E| 589 2053 486
    #i.i. 153 354 107
    #c.c. 73 53 1
    cycles no no no
    Science |V| 211 564 197
    |E| 194 773 199
    #i.i. 42 172 35
    #c.c. 23 22 1
    cycles no no no
    Science (Eurovoc) |V| 56 144 54
    |E| 45 149 55
    #i.i. 13 46 12
    #c.c. 12 24 1
    cycles no no no
    Science (Wordnet) |V| 211 430 208
    |E| 165 463 209
    #i.i. 55 97 54
    #c.c. 48 42 1
    cycles no no no

     

    Gold Standard Evaluation

    Language Domain Measure Baseline JUNLP TAXI NUIG-UNLP USAAR QASSIT
    English Environment (Eurovoc) Precision 0.5 0.1296 0.3382 0.1579 0.8085 0.1479
    Recall 0.2146 0.2299 0.2682 0.2759 0.1456 0.2069
    Fscore 0.3003 0.1658 0.2992 0.2008 0.2468 0.1725
    F&M 0.0 0.0814 0.2384 0.0007 0.0007 0.4349
    Food Precision 0.4705 0.1320 0.3372 n.a. 0.0603 n.a.
    Recall 0.1859 0.2508 0.2376 0.1651
    Fscore 0.2665 0.1730 0.2787 0.0883
    F&M 0.0019 0.2608 0.2021 0.0
    Food (Wordnet) Precision 0.5 0.1475 0.2583 n.a. 0.7056 n.a.
    Recall 0.2576 0.3376 0.3388 0.2418
    Fscore 0.34 0.2053 0.2932 0.3601
    F&M 0.0022 0.1925 0.3260 0.0021
    Science Precision 0.6262 0.1377 0.3876 0.0984 0.3814 0.1794
    Recall 0.2882 0.3097 0.3484 0.3505 0.2559 0.2731
    Fscore 0.3947 0.1906 0.3669 0.1537 0.3063 0.2165
    F&M 0.0163 0.1774 0.3634 0.0090 0.0020 0.5757
    Science (Eurovoc) Precision 0.6190 0.1316 0.2950 0.1330 0.6333 0.2134
    Recall 0.2097 0.3629 0.3306 0.2339 0.1532 0.2823
    Fscore 0.3133 0.1931 0.3118 0.1696 0.2468 0.2431
    F&M 0.0056 0.1373 0.3893 0.1517 0.0023 0.3893
    Science (Wordnet) Precision 0.6897 0.2058 0.3747 0.1755 0.8173 0.2025
    Recall 0.2655 0.3142 0.3805 0.3606 0.1881 0.2898
    Fscore 0.3834 0.2487 0.3776 0.2361 0.3058 0.2384
    F&M 0.0016 0.0494 0.2255 0.0027 0.0008 0.2255
    Dutch Environment (Eurovoc) Precision 0.53 0.1425 0.5543 n.a. n.a. n.a.
    Recall 0.1985 0.2022 0.1910
    Fscore 0.2888 0.1672 0.284
    F&M 0.0007 0.0097 0.1910
    Food Precision 0.5363 0.1292 0.4608
    Recall 0.1470 0.1763 0.1259
    Fscore 0.2320 0.1491 0.1977
    F&M - 0.0 0.1226
    Food (Wordnet) Precision 0.526 0.1601 0.3857
    Recall 0.1963 0.2231 0.1649
    Fscore 0.2859 0.1864 0.2311
    F&M 0.0008 0.0009 0.1165
    Science Precision 0.6024 0.1655 0.5306
    Recall 0.2227 0.1935 0.2097
    Fscore 0.3252 0.1784 0.3006
    F&M 0.0057 0.0206 0.2215
    Science (Eurovoc) Precision 0.7742 0.1486 0.4778
    Recall 0.1935 0.2561 0.2160
    Fscore 0.3097 0.1881 0.2975
    F&M 0.0 0.0 0.1987
    Science (Wordnet) Precision 0.5976 0.2257 0.4739
    Recall 0.2531 0.2556 0.2732
    Fscore 0.3556 0.2397 0.3466
    F&M 0.0026 0.0020  0.1699
    French Environment (Eurovoc) Precision 0.5043 0.1373 0.2928 n.a. n.a. n.a.
    Recall 0.2218 0.2143 0.1992
    Fscore 0.3081 0.1674 0.2371
    F&M 0.0051 0.0110 0.1836
    Food Precision 0.466 0.1210 0.3433
    Recall 0.1619 0.1867 0.1666
    Fscore 0.2401 0.1468 0.2243
    F&M - 0.0 0.1398
    Food (Wordnet) Precision 0.4153 0.1312 0.2562
    Recall 0.2077 0.2084 0.1819
    Fscore 0.2769 0.1610 0.2127
    F&M 0.0006 0.0006 0.2068
    Science Precision 0.5856 0.1435 0.3993
    Recall 0.2350 0.2816 0.2639
    Fscore 0.3354 0.1901 0.3178
    F&M 0.0114 0.0748 0.3042
    Science (Eurovoc) Precision 0.8684 0.2089 0.3363
    Recall 0.2661 0.2661 0.3065
    Fscore 0.4074 0.2340 0.3207
    F&M 0.0 0.0 0.3192
    Science (Wordnet) Precision 0.6568 0.2664 0.3780
    Recall 0.2853 0.3136 0.3265
    Fscore 0.3979 0.2881 0.3503
    F&M 0.0022 0.1462 0.2597
    Italian Environment (Eurovoc) Precision 0.5604 0.0970 0.7536 n.a. n.a. n.a.
    Recall 0.1917 0.1729 0.1955
    Fscore 0.2857 0.1243 0.3104
    F&M 0.0011 0.0011 0.1776
    Food Precision 0.4365 0.1019 0.4277
    Recall 0.1396 0.1457 0.1089
    Fscore 0.2115 0.1199 0.1736
    F&M - 0.0 0.1311
    Food (Wordnet) Precision 0.4363 0.1305 0.4218
    Recall 0.1929 0.2012 0.1539
    Fscore 0.2676 0.1583 0.2255
    F&M 0.0868 0.0 0.0868
    Science Precision 0.5670 0.0095 0.5176
    Recall 0.2477 0.1552 0.2320
    Fscore 0.3448 0.2703 0.3204
    F&M 0.0095 0.0095 0.1933
    Science (Eurovoc) Precision 0.7556 0.2282 0.6
    Recall 0.2742 0.2742 0.2661
    Fscore 0.4024 0.2491 0.3687
    F&M 0.0034 0.0033 0.1976
    Science (Wordnet) Precision 0.6182 0.2224 0.5024
    Recall 0.2576 0.2601 0.2652
    Fscore 0.3636 0.2398 0.3471
    F&M 0.0  0.0 0.1735

     

    Gold Standard Evaluation (Average results for each language across domains)

    Language Measure Baseline JUNLP TAXI NUIG-UNLP USAAR QASSIT
    English Average Precision 0.57 0.15 0.33 0.14 0.57 0.19
    Average Recall 0.24 0.30 0.32 0.30 0.19 0.26
    Average Fscore 0.33 0.20 0.32 0.19 0.26 0.22
    Dutch Average Precision 0.59 0.16 0.48 n.a. n.a. n.a.
    Average Recall 0.20 0.22 0.20
    Average Fscore 0.30 0.19 0.28
    French Average Precision 0.58 0.17 0.33
    Average Recall 0.23 0.25 0.24
    Average Fscore 0.33 0.20 0.28
    Italian Average Precision 0.56 0.13 0.54
    Average Recall 0.22 0.20 0.20
    Average Fscore 0.31 0.19 0.29

    Overall Multilingual

    (Other than English)

    Average Precision 0.58 0.15 0.45
    Average Recall 0.22 0.22 0.21
    Average Fscore 0.31 0.19 0.28

     

    Manual Evaluation

    (Precision for maximum 100 random novel relations)

    Language Domain JUNLP TAXI NUIG-UNLP USAAR QASSIT
    English Environment (Eurovoc) 0.02 0.11 0.08 0.22 0.07
    Food 0.2 0.36 n.a. 0.73 n.a.
    Food (Wordnet) 0.18 0.32 n.a. 0.81 n.a.
    Science 0.06 0.14 0.09 0.71 0.07
    Science (Eurovoc) 0.02 0.02 0.04 0.0 0.05
    Science (Wordnet) 0.06 0.22 0.05 0.47 0.22
    Dutch Environment (Eurovoc) 0.27 0.24 n.a. n.a. n.a.
    Food 0.22 0.69
    Food (Wordnet) 0.28 0.71
    Science 0.41 0.8
    Science (Eurovoc) 0.26 0.43
    Science (Wordnet) 0.18 0.88
    French Environment (Eurovoc) 0.24 0.23
    Food 0.21 0.46
    Food (Wordnet) 0.23 0.58
    Science 0.32 0.58
    Science (Eurovoc) 0.53 0.6
    Science (Wordnet) 0.66 0.56
    Italian Environment (Eurovoc) 0.18 0.41
    Food 0.18 0.83
    Food (Wordnet) 0.32 0.69
    Science 0.39 0.90
    Science (Eurovoc) 0.25 0.82
    Science (Wordnet) 0.24 0.84

     

    Contact Info

    Organizers

    • Georgeta Bordea - Insight, Centre for Data Analytics, National University of Ireland, Galway
    • Els Lefever - LT3 language and translation team at the Faculty of Arts and Philosophy at Ghent University
    • Paul Buitelaar - Insight, Centre for Data Analytics, National University of Ireland, Galway

    email :

    Other Info

    Announcements

    • The deadline for submitting system description papers was extended to March 4, 2016
    • Final rankings made available on February 5, 2016
    • English, Dutch, French and Italian test data released on December 15, 2015
    • The task is organised in an unsupervised setting therefore no training data will be provided
    • Updated Dutch trial taxonomy and released Italian and French trial terms and taxonomy on July 31, 2015
    • Dutch trial terms and taxonomy released on July 12, 2015
    • Trial data and tools released on June 30, 2015