Priberam: A Turbo Semantic Parser with Second Order Features

This paper presents our contribution to the SemEval-2014 shared task on Broad-Coverage Semantic Dependency Parsing. We employ a feature-rich linear model, including scores for ﬁrst and second-order dependencies (arcs, siblings, grandparents and co-parents). Decoding is performed in a global manner by solving a linear relaxation with alternating directions dual decomposition (AD 3 ). Our system achieved the top score in the open challenge, and the second highest score in the closed track.


Introduction
The last decade saw a considerable progress in statistical modeling for dependency syntactic parsing (Kübler et al., 2009). Models that incorporate rich global features are typically more accurate, even if pruning is necessary or decoding needs to be approximate (McDonald et al., 2006;Bohnet and Nivre, 2012;Martins et al., 2009Martins et al., , 2013. This paper applies the same rationale to semantic dependency parsing, in which the output variable is a semantic graph, rather than a syntactic tree. We extend a recently proposed dependency parser, TurboParser (Martins et al., 2010(Martins et al., , 2013, to be able to perform semantic parsing using any of the three formalisms considered in this shared task (DM, PAS, and PCEDT). The result is TurboSemanticParser, which we release as open-source software. 1 We describe here a second order model for semantic parsing ( §2). We follow prior work in semantic role labeling (Toutanova et al., 2005;Jo-This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ 1 http://labs.priberam.com/Resources/ TurboSemanticParser Figure 1: Example of a semantic graph in the DM formalism (sentence #22006003). We treat top nodes as a special semantic role TOP whose predicate is a dummy root symbol. hansson and Nugues, 2008;Das et al., 2012;Flanigan et al., 2014), by adding constraints and modeling interactions among arguments within the same frame; however, we go beyond such sibling interactions to consider more complex grandparent and co-parent structures, effectively correlating different predicates. We formulate parsing as a global optimization problem and solve a relaxation through AD 3 , a fast dual decomposition algorithm in which several simple local subproblems are solved iteratively ( §3). Through a rich set of features ( §4), we arrive at top accuracies at parsing speeds around 1,000 tokens per second, as described in the experimental section ( §5).

A Second Order Model for Parsing
Figure 1 depicts a sentence and its semantic graph. We cast semantic parsing as a structured prediction problem. Let x be a sentence and Y(x) the set of possible dependency graphs. We assume each candidate graph y ∈ Y(x) can be represented as a set of substructures (called parts) in an underlying set S (e.g., predicates, arcs, pairs of adjacent arcs). We design a score function f which decomposes as a sum over these substructures, f (x, y) := s∈S f s (x, y s ). We parametrize this function using a weight vector w, and write each atomic function as f s (x, y s ) := w·φ s (x, y s ), where φ s (x, y s ) is a vector of local features. The decoding problem consists in obtaining the best-Algorithm 1 Decoding in an Arc-Factored Model 1: input: Predicate scores σP (p), arc scores σA(p → a), labeled arc scores σLA(p r → a). 2: Initialize semantic graph G ← ∅ 3: for p = 0 to L do 4: Initialize σ ← σP (p), frame A(p) ← ∅ 5: for a = 1 to L do 6: Set r ← arg maxr σLA(p r → a)

7:
if σA(p → a) + σLA(p r → a) > 0 then 8: end for 12: if σ > 0 then set G ← G ∪ { p, A(p) } 13: end for 14: output: semantic graph G. scored semantic graph y given a sentence x: (1) Our choice of parts is given in Figure 2. The second order parts are inspired by prior work in syntactic parsing, modeling interactions for pairs of (unlabeled) dependency arcs, such as grandparents (Carreras, 2007) and siblings (Smith and Eisner, 2008;Martins et al., 2009). The main novelty is co-parent parts, which, to the best of our knowledge, were never considered before, as they only make sense when multiple parents are allowed. If all parts were basic, decoding could be done independently for each predicate p, as illustrated in Algorithm 1. The total runtime, for a sentence with L words, is O(L 2 |R|), where R is the set of semantic roles. Adding consecutive siblings still permits independent decoding for each predicate, but dynamic programming is necessary to decode the best argument frame, increasing the runtime to O(L 3 |R|). The addition of consecutive co-parents, grandparents, and arbitrary siblings and co-parents breaks this independency and sets a demand for approximate decoding. Even without second-order parts, the inclusion of hard constraints (such as requiring some roles to be unique, see §3) also makes the problem harder. 2 Rather than looking for a model in which exact decoding is tractable, which could be even more stringent for parsing semantic graphs than for dependency trees, we embrace approximate decoding strategies. Namely, our approach is based on Figure 2: Parts considered in this paper. The top row illustrate the basic parts, representing the event that a word is a predicate, or the existence of an arc between a predicate and an argument, eventually labeled with a semantic role. Our secondorder model looks at some pairs of arcs: arcs bearing a grandparent relationship, arguments of the same predicate, predicates sharing the same argument, and consecutive versions of these two.
dual decomposition, a class of optimization techniques that tackle the dual of combinatorial problems in a modular and extensible manner (Komodakis et al., 2007;. We employ alternating directions dual decomposition (AD 3 ; Martins et al., 2011). Like the subgradient algorithm of , AD 3 splits the original problem into local subproblems, and seeks an agreement on the overlapping variables. The difference is that the AD 3 subproblems have an additional quadratic term to accelerate consensus, achieving a faster convergence rate both in theory and in practice (Martins et al., , 2013. For several factors (such as logic factors representing AND, OR and XOR constraints, budget constraints, and binary pairwise factors), these quadratic subproblems can be solved efficiently. For dense or structured factors, the quadratic subproblems can be solved as a sequence of local Viterbi decoding steps, via an active set method (Martins, 2014); this local decoding operation is the same that needs to be performed in the subgradient algorithm. We describe these subproblems in detail in the next section.

Solving the Subproblems
Predicate and Arc-Factored Parts. We capture all the basic parts with a single component. As stated in §2, local decoding in this component has a runtime of O(L 2 |R|), by using Algorithm 1.
Unique Roles. We assume some roles are unique, i.e., they can occur at most once for the same predicate. 3 To cope with unique roles, we add hard constraints of the kind where R uniq is the set of unique roles. This set is obtained from the training data by looking at the roles that never occur multiple times in the gold argument frames. 4 The constraint above corresponds to a ATMOSTONE factor, which is built-in in AD 3 and can be decoded in linear time (rendering the runtime O(L 2 |R uniq |) when aggregating all such factors). These have also been used by Das et al. (2012) in frame-semantic parsing.
Grandparents, Arbitrary Siblings and Coparents. The second-order parts in the middle row of Figure 2 all involve the simultaneous inclusion of a pair of arcs, without further dependency on the remaining arcs. We handle each of these parts using a simple pairwise factor (called PAIR in the AD 3 toolkit). The total runtime to locally decode these factors is O(L 3 ).
Predicate Automata. To handle consecutive siblings, we adapt the simple head automaton model (Alshawi, 1996;Smith and Eisner, 2008; to semantic parsing. We introduce one automaton for each predicate p and attachment direction (left or right). We describe right-side predicate automata; their left-side counterparts are analogous. Let a 0 , a 1 , . . . , a k+1 be the sequence of right modifiers of p, with a 0 = START and a k+1 = END. Then, we have the following component capturing consecutive siblings: Maximizing f CSIB p,→ via dynamic programming has a cost of O(L 2 ), yielding O(L 3 ) total runtime. Argument Automata. For consecutive coparents, we introduce another automaton which is analogous to the predicate automaton, but where arrows are reversed. Let p 0 , p 1 , . . . , p k+1 be the sequence of right predicates that take a as argument (the left-side case is analagous), with p 0 = START and p k+1 = END. We define: f CCP a,← (a ← p 1 , . . . , a ← p k ) = k+1 j=1 σ CCP (a, p j−1 , p j ).

Features
We define binary features for each part represented in Figure  Predicate Features. Our predicate features are: • PREDWORD, PREDLEMMA, PREDPOS. Lexical form, lemma, and POS tag of the predicate.
• PREDREL. † Syntactic dependency relation between the predicate and its head.
• PREDHEADWORD/POS. † Form and POS tag of the predicate syntactic head, conjoined with the predicate word and POS tag.
• PREDMODWORD/POS/REL. † Form, POS tag, and dependency relation of the predicate syntactic dependents, conjoined with the predicate word and POS tag.
Arc Features. All features above, plus the following (conjoined with arc direction and label): • ARGWORD, ARGLEMMA, ARGPOS. The lexical form, lemma, and POS tag of the argument.
• ARGREL. † Syntactic dependency relation between the argument and its head.
• LEFTWORD/POS, † RIGHTWORD/POS. † Form/POS tag of the leftmost/rightmost dependent of the argument, conjoined with the predicate word and POS tag.
• LEFTSIBWORD/POS, † RIGHTSIBWORD/POS. † Form/POS tag of the left/right sibling of the argument, conjoined with the predicate tag.
• PREDCONTEXTWORD, PREDCONTEXTPOS, PREDCONTEXTLEMMA. Word, POS, and lemma on the left and right context of the predicate (context size is 2).
• PREDCONTEXTPOSBIGRAM/TRIGRAM. Bigram and trigram of POS tags on the left and right side of the predicate.
• PREDARGPOSCONTEXT. Several features conjoining the POS of words surrounding the predicate and argument (similar to the contextual features in McDonald et al. (2005)).
Exact and binned arc length (distance between predicate and argument), conjoined with the predicate and argument POS tags.
• POSINBETWEEN, WORDINBETWEEN. POS and forms between the predicate and argument, conjoined with their own POS tags and forms.
• RELPATH, † POSPATH. † Path in the syntactic dependency tree between the predicate and the argument. The path is formed either by dependency relations or by POS tags.
Second Order Features. These involve a predicate, an argument, and a "companion word" (which can be a second argument, in the case of siblings, a second predicate, for co-parents, or the argument of another argument, for grandparents). In all cases, features are of the following kind: • POSTRIPLET. POS tags of the predicate, the argument, and the companion word.
• UNILEXICAL. One word form (for the predicate/argument/companion) and two POS tags.
• BILEXICAL. One POS tag (for the predicate/argument/companion) and two word forms.
• PAIRWISE. Backed-off pair features for the companion word form/POS tag and the word form/POS of the predicate/argument.

Experimental Results
All models were trained by running 10 epochs of max-loss MIRA with C = 0.01 (Crammer et al., 2006). The cost function takes into account mismatches between predicted and gold dependencies, with a cost c P on labeled arcs incorrectly predicted (false positives) and a cost c R on gold labeled arcs that were missed (false negatives). These values were set through cross-validation in the dev set, yielding c P = 0.4 and c R = 0.6 in all runs, except for the DM and PCEDT datasets in the closed track, for which c P = 0.3 and c R = 0.7.
To speed up decoding, we discard arcs whose posterior probability is below 10 −4 , according to a probabilistic unlabeled first-order pruner. Table 1 shows a significant reduction of the search space with a very small drop in recall. Table 2 shows our final results in the test set, for a model trained in the train and development partitions. Our system achieved the best score in the open track (an LF score of 86.27%, averaged over DM, PAS, and PCEDT), and the second best in the closed track, after the Peking team. Overall, we observe that the precision and recall in PCEDT are far below the other two formalisms, but this difference is much smaller when looking at unlabeled scores. Comparing the results in the closed and open tracks, we observe a consistent improvement in the three formalisms of around 1% in F 1 from using syntactic information. While this confirms previous findings that syntactic features are important in semantic role labeling (Toutanova et al., 2005;Johansson and Nugues, 2008), these improvements are less striking than expected. We conjecture this is due to the fact that our model in the closed track already incorporates a variety of contextual features which are nearly as informative as those extracted from the dependency trees.
Finally, to assess the importance of the second order features, Table 3 reports experiments in the dev-set that progressively add several groups of features, along with runtimes. We can see that siblings, co-parents, and grandparents all provide valuable information that improves the final scores (with the exception of the PCEDT labeled scores, where the difference is negligible). This comes at only a small cost in terms of runtime, which is around 1,000 tokens per second for the full models.   progressively adding several groups of features, until the full model is obtained. We report unlabeled/labeled F 1 and parsing speeds in tokens per second. Our speeds include the time necessary for pruning, evaluating features, and decoding, as measured on a Intel Core i7 processor @3.4 GHz.

Conclusions
We have described a system for broad-coverage semantic dependency parsing. Our system, which is inspired by prior work in syntactic parsing, implements a linear model with second-order features, being able to model interactions between siblings, grandparents and co-parents. We have shown empirically that second-order features have an impact in the final scores. Approximate decoding was performed via alternating directions dual decomposition (AD 3 ), yielding fast runtimes of around 1,000 tokens per second.