Evaluation

Experimental setting

Submitted systems

Final ranking of the systems

Detailed evaluation results

Evaluation Measures

Structural Measures:

|V|: number of dinstict vertices;

|E|: number of dinstict edges;

#c.c.: number of connected components;

#i.i.: intermediate nodes = |V| - |L| where L is the set of leaves

cycles: YES = the taxonomy contains cycles, NO = the taxonomy is a Directed Acyclic Graph (DAG).

Comparison against gold standard:

P = | {edges in common with the gold standard taxonomy} | / |{system edges}|

R = | {edges in common with the gold standard taxonomy} | / |{gold standard edges}|

F = 2(P*R)/(P+R)

Cumulative Fowlkes&Mallows Measure(F&M): cumulative measure of the similarity of two taxonomies .

Manual quality assessment of novel edges

correct ISA = ISA AND domain specific AND not over-generic

P = |correct ISA| / |sample|

Baseline

A basic string inclusion approach that covers relations between compound terms such as network science -> science.

Gold Standard

The gold standard taxonomies (.taxo) are tab-separated fields:
relation_id <TAB> term <TAB> hypernym
where:
- relation_id: is a relation identifier;
- term: is a term of the taxonomy;
- hypernym: is a hypernym for the term.
e.g
0<TAB>cat<TAB>animal
1<TAB>dog<TAB>animal
2<TAB>car<TAB>animal
....

Gold Standard Structure

Language	Domain	\|V\|	\|E\|	#i.i.	#c.c.	Cycles
English	Environment (Eurovoc)	261	261	60	1	no
	Food	1556	1587	70	1	no
	Food (Wordnet)	1486	1576	302	1	no
	Science	453	465	54	1	no
	Science (Eurovoc)	125	124	31	1	no
	Science (Wordnet)	429	452	117	1	no
Dutch	Environment (Eurovoc)	267	267	59	1	no
	Food	1429	1446	66	3	no
	Food (Wordnet)	1299	1340	259	3	no
	Science	445	449	54	1	no
	Science (Eurovoc)	125	124	32	1	no
	Science (Wordnet)	399	399	105	1	no
French	Environment (Eurovoc)	267	266	61	1	no
	Food	1418	1441	64	1	no
	Food (Wordnet)	1329	1358	263	2	no
	Science	449	451	54	1	no
	Science (Eurovoc)	125	124	31	1	no
	Science (Wordnet)	390	389	101	1	no
Italian	Environment (Eurovoc)	267	266	59	1	no
	Food	1274	1304	60	3	no
	Food (Wordnet)	1277	1332	254	1	yes
	Science	442	444	54	1	no
	Science (Eurovoc)	125	124	32	1	no
	Science (Wordnet)	396	396	105	1	no

Submissions

(with author-provided descriptions)

The following archive contains the submissions received from the 6 participating systems.

JUNLP

The system is based on two hypernym detection modules. The first one deals with available semantic relations that can be found for a term. Instead of analysing the huge Wikipedia dump for pattern-based hypernym discovery, we opted for a significant reduction of execution time by extracting Wikipedia based hypernym relations from Babelnet (rich semantic network which connects concepts and named entities in a very large network of semantic relations, called Babel synsets). The second module tries to find out the subterm(s) (a subterm can be another term from the list or multiple term overlaps) present in the termlist which can be a possible hypernym for that term.

TAXI - TAXonomy Induction

The methods used in the TAXonomy Induction system (TAXI) rely on two sources of evidence: substring matching and Hearst-like patterns. The Hearst patterns for all languages are extracted from Wikipedia and focused crawls with seed pages that are Wikipedia pages. In addition, for English, we rely on several additional corpora: GigaWord, ukWaC, a news corpus and the CommonCrawl. For French, Italian and Dutch the method is completely unsupervised and relies on KNN approach. For English, we train an SVM classifier on the trial data. For all languages the features are the same: substrings and ISA relations extracted with lexico-syntactic patterns. No databases or linguistic resources beyond trial data and raw text corpora mentioned above were used.

NUIG-UNLP

The system implements a semi-supervised method that finds hypernym candidates for the provided noun phrases by representing them as distributional vectors. Roughly, this method assumes that hypernyms may be induced by adding a vector offset [1,2] to the corresponding hyponym representation generated by GloVe over a Wikipedia dump. The vector offset is obtained as the average offset between 200 pairs of hyponym-hypernym in the same vector space.

[1] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013.

[2] Marek Rei and Ted Briscoe. 2014. Looking for Hyponyms in Vector Space. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 68–77.

USAAR

Often multi-word hyponyms are endocentric constructions which contains a word that fulfills the same function as one part of its word. E.g. an "apple pie" is essentially a "pie". We explore the number of multi-words terms that are endocentric in English and whether we can use this endocentric property to generate entity links to connect terms in wikipedia list of list.

QASSIT

We use a semisupervised methodology for the acquisition of lexical taxonomies based on genetic algorithms. It is based on the theory of pretopology that offers a powerful formalism to model semantic relations and transforms a list of terms into a structured term space by combining different discriminant criteria. In particular, rare but accurate pieces of knowledge are used to parameterize the different criteria defining the pretopological term space. Then, a structuring algorithm is used to transform the pretopological space into a lexical taxonomy.

Final Ranking

Overall Ranking

Subtask	Measure	JUNLP	TAXI	NUIG-UNLP	USAAR	QASSIT
Monolingual (EN) Taxonomy Construction	Cyclicity	3	1	4	1	2
	Structure (F&M)	3	2	4	5	1
	Categorisation (i.i.)	1	3	2	4	5
	Connectivity (c.c.)	3	1	2	4	1
	Gold standard comparison (Fscore)*	4	1	5	2	3
	Domains*	1	1	2	1	2
	Manual Evaluation (Precision)*	4	2	5	1	3
	Total	19	11	24	18	17
	Ranking	4	1	5	3	2
Monolingual (EN) Hypernym Identification*	Total*	9	4	12	4	8
Monolingual (EN) Hypernym Identification*	Ranking*	3	1	4	1	2
Multilingual (NL,FR,IT) Taxonomy Construction	Cyclicity	1	1	n.a.	n.a.	n.a.
	Structure (F&M)	2	1
	Categorisation (i.i.)	1	2
	Connectivity (c.c.)	2	1
	Gold standard comparison (Fscore)^	2	1
	Manual Evaluation (Precision)^	2	1
	Total	10	7
	Ranking	2	1
Multilingual (EN) Hypernym Identification^	Total^	4	2
Multilingual (EN) Hypernym Identification^	Ranking^	2	1

Only measures marked with * and ^ were used for ranking in the hypernym identification subtasks.

Overall scores

Subtask	Measure	Baseline	JUNLP	TAXI	NUIG-UNLP	USAAR	QASSIT
Monolingual (EN)	Cyclicity	0	3	0	4	0	1
	Structure (F&M)	0.0046	0.1498	0.2908	0.0410	0.0013	0.4064
	Categorisation (i.i.)	77.67	377	104.5	213	96.33	59.5
	Connectivity (c.c.)	36.83	53.17	1	44.75	76.67	1
	Gold standard comparison (Fscore)	0.33	0.20	0.32	0.19	0.26	0.22
	Domains	6	6	6	4	6	4
	Manual Evaluation (Precision)	n.a.	0.09	0.20	0.07	0.49	0.10
Multilingual (NL,FR,IT)	Cyclicity	0	0	0	n.a.	n.a.	n.a.
	Structure (F&M)	0.0087	0.0155	0.1885
	Categorisation (i.i.)	64.28	178.22	64.94
	Connectivity (c.c.)	40.5	34.89	1
	Gold standard comparison (Fscore)	0.3133	0.1921	0.2815
	Manual Evaluation (Precision)	n.a.	0.2983	0.6252

These scores are obtained by averaging the results over domains (environment, science, food) and languages (NL, FR, IT) for the multilingual setting.

Detailed Evaluation Results

Structural Evaluation

Language	Domain	Measure	Baseline	JUNLP	TAXI	NUIG-UNLP	USAAR	QASSIT
English	Environment (Eurovoc)	\|V\|	123	321	148	312	57	261
		\|E\|	112	463	207	456	47	365
		#i.i.	27	123	50	176	10	88
		#c.c.	17	19	1	58	10	1
		cycles	no	no	no	yes	no	no
	Food	\|V\|	636	1802	781	n.a	3716	n.a.
		\|E\|	627	3015	1118		4347
		#i.i.	130	581	132		323
		#c.c.	40	48	1		217
		cycles	no	yes	no		no
	Food (Wordnet)	\|V\|	826	1748	1122	n.a.	675	n.a.
		\|E\|	812	3607	2067		540
		#i.i.	205	866	259		146
		#c.c.	79	123	1		135
		cycles	no	yes	no		no
	Science	\|V\|	232	602	294	595	371	452
		\|E\|	214	1046	418	1656	312	708
		#i.i.	41	255	73	409	60	58
		#c.c.	28	24	1	99	59	1
		cycles	no	no	no	yes	no	yes
	Science (Eurovoc)	\|V\|	50	186	100	97	37	125
		\|E\|	42	342	139	218	30	164
		#i.i.	11	133	25	72	7	25
		#c.c.	9	15	1	13	7	1
		cycles	no	yes	no	yes	no	no
	Science (Wordnet)	\|V\|	217	424	290	251	136	370
		\|E\|	174	690	459	929	104	647
		#i.i.	52	304	88	195	32	67
		#c.c.	48	90	1	9	32	1
		cycles	no	no	no	yes	no	no
Dutch	Environment (Eurovoc)	\|V\|	116	317	85	n.a.	n.a.	n.a.
		\|E\|	100	379	92
		#i.i.	24	73	24
		#c.c.	20	8	1
		cycles	no	no	no
	Food	\|V\|	459	1616	350
		\|E\|	399	1974	395
		#i.i.	84	329	85
		#c.c.	68	71	1
		cycles	no	no	no
	Food (Wordnet)	\|V\|	610	1433	515
		\|E\|	500	1868	573
		#i.i.	147	320	145
		#c.c.	122	63	1
		cycles	no	no	no
	Science	\|V\|	192	143	46
		\|E\|	166	145	49
		#i.i.	40	40	15
		#c.c.	30	26	1
		cycles	no	no	no
	Science (Eurovoc)	\|V\|	40	561	193
		\|E\|	31	774	203
		#i.i.	11	163	40
		#c.c.	10	21	1
		cycles	no	no	no
	Science (Wordnet)	\|V\|	213	427	221
		\|E\|	169	452	230
		#i.i.	55	89	59
		#c.c.	50	52	1
		cycles	no	no	no
French	Environment (Eurovoc)	\|V\|	130	327	128	n.a.	n.a.	n.a.
		\|E\|	117	415	181
		#i.i.	23	79	37
		#c.c.	17	7	1
		cycles	no	no	no
	Food	\|V\|	531	1649	522
		\|E\|	500	2224	699
		#i.i.	109	352	125
		#c.c.	49	53	1
		cycles	no	no	no
	Food (Wordnet)	\|V\|	712	1533	707
		\|E\|	679	2157	964
		#i.i.	172	373	193
		#c.c.	68	45	1
		cycles	no	no	no
	Science	\|V\|	201	587	249
		\|E\|	181	885	298
		#i.i.	43	186	58
		#c.c.	24	18	1
		cycles	no	no	no
	Science (Eurovoc)	\|V\|	48	145	83
		\|E\|	38	158	113
		#i.i.	11	45	24
		#c.c.	10	21	1
		cycles	no	no	no
	Science (Wordnet)	\|V\|	208	419	265
		\|E\|	169	458	336
		#i.i.	54	89	76
		#c.c.	43	40	1
		cycles	no	no	no
Italian	Environment (Eurovoc)	\|V\|	102	341	70	n.a.	n.a.	n.a.
		\|E\|	91	474	69
		#i.i.	18	87	14
		#c.c.	13	4	1
		cycles	no	no	no
	Food	\|V\|	459	1490	328
		\|E\|	417	1864	332
		#i.i.	103	314	66
		#c.c.	49	58	1
		cycles	no	no	no
	Food (Wordnet)	\|V\|	644	1505	471
		\|E\|	589	2053	486
		#i.i.	153	354	107
		#c.c.	73	53	1
		cycles	no	no	no
	Science	\|V\|	211	564	197
		\|E\|	194	773	199
		#i.i.	42	172	35
		#c.c.	23	22	1
		cycles	no	no	no
	Science (Eurovoc)	\|V\|	56	144	54
		\|E\|	45	149	55
		#i.i.	13	46	12
		#c.c.	12	24	1
		cycles	no	no	no
	Science (Wordnet)	\|V\|	211	430	208
		\|E\|	165	463	209
		#i.i.	55	97	54
		#c.c.	48	42	1
		cycles	no	no	no

Gold Standard Evaluation

Language	Domain	Measure	Baseline	JUNLP	TAXI	NUIG-UNLP	USAAR	QASSIT
English	Environment (Eurovoc)	Precision	0.5	0.1296	0.3382	0.1579	0.8085	0.1479
		Recall	0.2146	0.2299	0.2682	0.2759	0.1456	0.2069
		Fscore	0.3003	0.1658	0.2992	0.2008	0.2468	0.1725
		F&M	0.0	0.0814	0.2384	0.0007	0.0007	0.4349
	Food	Precision	0.4705	0.1320	0.3372	n.a.	0.0603	n.a.
		Recall	0.1859	0.2508	0.2376		0.1651
		Fscore	0.2665	0.1730	0.2787		0.0883
		F&M	0.0019	0.2608	0.2021		0.0
	Food (Wordnet)	Precision	0.5	0.1475	0.2583	n.a.	0.7056	n.a.
		Recall	0.2576	0.3376	0.3388		0.2418
		Fscore	0.34	0.2053	0.2932		0.3601
		F&M	0.0022	0.1925	0.3260		0.0021
	Science	Precision	0.6262	0.1377	0.3876	0.0984	0.3814	0.1794
		Recall	0.2882	0.3097	0.3484	0.3505	0.2559	0.2731
		Fscore	0.3947	0.1906	0.3669	0.1537	0.3063	0.2165
		F&M	0.0163	0.1774	0.3634	0.0090	0.0020	0.5757
	Science (Eurovoc)	Precision	0.6190	0.1316	0.2950	0.1330	0.6333	0.2134
		Recall	0.2097	0.3629	0.3306	0.2339	0.1532	0.2823
		Fscore	0.3133	0.1931	0.3118	0.1696	0.2468	0.2431
		F&M	0.0056	0.1373	0.3893	0.1517	0.0023	0.3893
	Science (Wordnet)	Precision	0.6897	0.2058	0.3747	0.1755	0.8173	0.2025
		Recall	0.2655	0.3142	0.3805	0.3606	0.1881	0.2898
		Fscore	0.3834	0.2487	0.3776	0.2361	0.3058	0.2384
		F&M	0.0016	0.0494	0.2255	0.0027	0.0008	0.2255
Dutch	Environment (Eurovoc)	Precision	0.53	0.1425	0.5543	n.a.	n.a.	n.a.
		Recall	0.1985	0.2022	0.1910
		Fscore	0.2888	0.1672	0.284
		F&M	0.0007	0.0097	0.1910
	Food	Precision	0.5363	0.1292	0.4608
		Recall	0.1470	0.1763	0.1259
		Fscore	0.2320	0.1491	0.1977
		F&M	-	0.0	0.1226
	Food (Wordnet)	Precision	0.526	0.1601	0.3857
		Recall	0.1963	0.2231	0.1649
		Fscore	0.2859	0.1864	0.2311
		F&M	0.0008	0.0009	0.1165
	Science	Precision	0.6024	0.1655	0.5306
		Recall	0.2227	0.1935	0.2097
		Fscore	0.3252	0.1784	0.3006
		F&M	0.0057	0.0206	0.2215
	Science (Eurovoc)	Precision	0.7742	0.1486	0.4778
		Recall	0.1935	0.2561	0.2160
		Fscore	0.3097	0.1881	0.2975
		F&M	0.0	0.0	0.1987
	Science (Wordnet)	Precision	0.5976	0.2257	0.4739
		Recall	0.2531	0.2556	0.2732
		Fscore	0.3556	0.2397	0.3466
		F&M	0.0026	0.0020	0.1699
French	Environment (Eurovoc)	Precision	0.5043	0.1373	0.2928	n.a.	n.a.	n.a.
		Recall	0.2218	0.2143	0.1992
		Fscore	0.3081	0.1674	0.2371
		F&M	0.0051	0.0110	0.1836
	Food	Precision	0.466	0.1210	0.3433
		Recall	0.1619	0.1867	0.1666
		Fscore	0.2401	0.1468	0.2243
		F&M	-	0.0	0.1398
	Food (Wordnet)	Precision	0.4153	0.1312	0.2562
		Recall	0.2077	0.2084	0.1819
		Fscore	0.2769	0.1610	0.2127
		F&M	0.0006	0.0006	0.2068
	Science	Precision	0.5856	0.1435	0.3993
		Recall	0.2350	0.2816	0.2639
		Fscore	0.3354	0.1901	0.3178
		F&M	0.0114	0.0748	0.3042
	Science (Eurovoc)	Precision	0.8684	0.2089	0.3363
		Recall	0.2661	0.2661	0.3065
		Fscore	0.4074	0.2340	0.3207
		F&M	0.0	0.0	0.3192
	Science (Wordnet)	Precision	0.6568	0.2664	0.3780
		Recall	0.2853	0.3136	0.3265
		Fscore	0.3979	0.2881	0.3503
		F&M	0.0022	0.1462	0.2597
Italian	Environment (Eurovoc)	Precision	0.5604	0.0970	0.7536	n.a.	n.a.	n.a.
		Recall	0.1917	0.1729	0.1955
		Fscore	0.2857	0.1243	0.3104
		F&M	0.0011	0.0011	0.1776
	Food	Precision	0.4365	0.1019	0.4277
		Recall	0.1396	0.1457	0.1089
		Fscore	0.2115	0.1199	0.1736
		F&M	-	0.0	0.1311
	Food (Wordnet)	Precision	0.4363	0.1305	0.4218
		Recall	0.1929	0.2012	0.1539
		Fscore	0.2676	0.1583	0.2255
		F&M	0.0868	0.0	0.0868
	Science	Precision	0.5670	0.0095	0.5176
		Recall	0.2477	0.1552	0.2320
		Fscore	0.3448	0.2703	0.3204
		F&M	0.0095	0.0095	0.1933
	Science (Eurovoc)	Precision	0.7556	0.2282	0.6
		Recall	0.2742	0.2742	0.2661
		Fscore	0.4024	0.2491	0.3687
		F&M	0.0034	0.0033	0.1976
	Science (Wordnet)	Precision	0.6182	0.2224	0.5024
		Recall	0.2576	0.2601	0.2652
		Fscore	0.3636	0.2398	0.3471
		F&M	0.0	0.0	0.1735

Gold Standard Evaluation (Average results for each language across domains)

Language	Measure	Baseline	JUNLP	TAXI	NUIG-UNLP	USAAR	QASSIT
English	Average Precision	0.57	0.15	0.33	0.14	0.57	0.19
	Average Recall	0.24	0.30	0.32	0.30	0.19	0.26
	Average Fscore	0.33	0.20	0.32	0.19	0.26	0.22
Dutch	Average Precision	0.59	0.16	0.48	n.a.	n.a.	n.a.
	Average Recall	0.20	0.22	0.20
	Average Fscore	0.30	0.19	0.28
French	Average Precision	0.58	0.17	0.33
	Average Recall	0.23	0.25	0.24
	Average Fscore	0.33	0.20	0.28
Italian	Average Precision	0.56	0.13	0.54
	Average Recall	0.22	0.20	0.20
	Average Fscore	0.31	0.19	0.29
Overall Multilingual (Other than English)	Average Precision	0.58	0.15	0.45
	Average Recall	0.22	0.22	0.21
	Average Fscore	0.31	0.19	0.28

Manual Evaluation

(Precision for maximum 100 random novel relations)

Language	Domain	JUNLP	TAXI	NUIG-UNLP	USAAR	QASSIT
English	Environment (Eurovoc)	0.02	0.11	0.08	0.22	0.07
	Food	0.2	0.36	n.a.	0.73	n.a.
	Food (Wordnet)	0.18	0.32	n.a.	0.81	n.a.
	Science	0.06	0.14	0.09	0.71	0.07
	Science (Eurovoc)	0.02	0.02	0.04	0.0	0.05
	Science (Wordnet)	0.06	0.22	0.05	0.47	0.22
Dutch	Environment (Eurovoc)	0.27	0.24	n.a.	n.a.	n.a.
	Food	0.22	0.69
	Food (Wordnet)	0.28	0.71
	Science	0.41	0.8
	Science (Eurovoc)	0.26	0.43
	Science (Wordnet)	0.18	0.88
French	Environment (Eurovoc)	0.24	0.23
	Food	0.21	0.46
	Food (Wordnet)	0.23	0.58
	Science	0.32	0.58
	Science (Eurovoc)	0.53	0.6
	Science (Wordnet)	0.66	0.56
Italian	Environment (Eurovoc)	0.18	0.41
	Food	0.18	0.83
	Food (Wordnet)	0.32	0.69
	Science	0.39	0.90
	Science (Eurovoc)	0.25	0.82
	Science (Wordnet)	0.24	0.84

SemEval-2016 Task 13