SemEval-2015 Task 15: A Corpus Pattern Analysis Dictionary-Entry-Building Task - PDF

Please download to get full document.

View again

of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Presentations

Published:

Views: 3 | Pages: 10

Extension: PDF | Download: 0

Share
Related documents
Description
SemEval-2015 Task 15: A Corpus Pattern Analysis Dictionary-Entry-Building Task Vít Baisa Masaryk University Ismaïl El Maarouf University of Wolverhampton Jane Bradbury
Transcript
SemEval-2015 Task 15: A Corpus Pattern Analysis Dictionary-Entry-Building Task Vít Baisa Masaryk University Ismaïl El Maarouf University of Wolverhampton Jane Bradbury University of Wolverhampton Adam Kilgarriff Lexical Computing Ltd Silvie Cinková Charles University Octavian Popescu IBM Research Center Abstract This paper describes the first SemEval task to explore the use of Natural Language Processing systems for building dictionary entries, in the framework of Corpus Pattern Analysis. CPA is a corpus-driven technique which provides tools and resources to identify and represent unambiguously the main semantic patterns in which words are used. Task 15 draws on the Pattern Dictionary of English Verbs (www.pdev.org.uk), for the targeted lexical entries, and on the British National Corpus for the input text. Dictionary entry building is split into three subtasks which all start from the same concordance sample: 1) CPA parsing, where arguments and their syntactic and semantic categories have to be identified, 2) CPA clustering, in which sentences with similar patterns have to be clustered and 3) CPA automatic lexicography where the structure of patterns have to be constructed automatically. Subtask 1 attracted 3 teams, though none could beat the baseline (rule-based system). Subtask 2 attracted 2 teams, one of which beat the baseline (majority-class classifier). Subtask 3 did not attract any participant. The task has produced a major semantic multidataset resource which includes data for 121 verbs and about 17,000 annotated sentences, and which is freely accessible. 1 Introduction It is a central vision of NLP to represent the meanings of texts in a formalised way, amenable to automated reasoning. Since its birth, SEMEVAL (or SENSEVAL as it was then; (Kilgarriff and Palmer, 2000)) has been part of the programme of enriching NLP analyses of text so they get ever closer to a meaning representation. In relation to lexical information, this meant finding a lexical resource which identified the different meanings of words in a way that made high-quality disambiguation possible, represented those meanings in ways that were useful for the next steps of building meaning representations. Most lexical resources explored to date have had only limited success, on either front. The most obvious candidates published dictionaries and WordNets look like they might support the first task, but are very limited in what they offer to the second. FrameNet moved the game forward a stage. Here was a framework with a convincing account of how the lexical entry might contribute to building the meaning of the sentence, and with enough meat in the lexical entries (e.g. the verb frames) so that it might support disambiguation. Papers such as (Gildea and Jurafsky, 2002) looked promising, and in 2007 there was a SEMEVAL task on Frame Semantic Structure Extraction (Baker et al., 2007) and in 2010, one on Linking Events and Their Participants (Ruppenhofer et al., 2010). While there has been a substantial amount of follow-up work, there are some aspects of FrameNet that make it a hard target. It is organised around frames, rather than words, so inevitably its priority is to give a co- 315 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages , Denver, Colorado, June 4-5, c 2015 Association for Computational Linguistics herent account of the different verb senses in a frame, rather than the different senses of an individual verb. This will tend to make it less good for supporting disambiguation. Frames are not data-driven : they are the work of a theorist (Fillmore) doing his best to make sense of the data for a set of verbs. The prospects of data-driven frame discovery are, correspondingly, slim. While FrameNet has worked hard at being systematic in its use of corpus data, FrameNetters looked only for examples showing the verb being used in the relevant sense. From the point of view of a process that could possibly be automated, this is problematic. An approach which bears many similarities to FrameNet, but which starts from the verb rather than the frame, and is more thoroughgoing in its empiricism, is Hanks s Corpus Pattern Analysis (Hanks and Pustejovsky, 2005; Hanks, 2012; Hanks, 2013). 2 Corpus Pattern Analysis Corpus Pattern Analysis (CPA) is a new technique of language analysis, which produces the main patterns of use of words in text. Figure 1 is a sample lexical entry from the main output of CPA, the Pattern Dictionary of English Verbs 1 (PDEV). This tells us that, for the verb abolish, three patterns were found. For each pattern it tells us the percentage of the data that it accounted for, its grammatical structure and the semantic type (drawn from a shallow ontology of 225 semantic types 2 ) of each of the arguments in this structure. For instance, pattern 1 means: i) that the subject is preferably a word referring to [[Human]] or [[Institution]] (semantic alternation), and ii) that the object is preferably [[Action]], [[Rule]] or [[Privilege]]. It also tells us the implicature (which is similar to a definition in a traditional dictionary) of a sentence exemplifying the pattern: that is, if we have a sentence of the pattern [[Institution Human]] abolish [[Action=Punishment Rule Privilege]], then we know that [[Institution Human]] formally declares that [[Action=Punishment Rule Privilege]] is no longer legal or operative. Abolish has only one sense. For many verbs, there will be multiple senses, each with one or more pattern. There are currently full CPA entries for more than 1,000 verbs with a total of over 4,000 patterns. For each verb a random sample of (by default) 250 corpus instances was examined, used to build the lexical entry, and tagged with the senses and patterns they represented. For commoner verbs, more corpus lines were examined. The corpus instances were drawn from the written part of the British National Corpus 3 (BNC). PDEV has been studied from different NLP perspectives, all mainly involved with Word Sense Disambiguation and semantic analysis (Cinková et al., 2012a; Holub et al., 2012; El Maarouf et al., ; El Maarouf and Baisa, 2013; Kawahara et al., ; Popescu, 2013; Popescu et al., ; Pustejovsky et al., 2004; Rumshisky et al., ). For example, (Popescu, 2013) described experiments in modeling finite state automata on a set of 721 verbs taken from PDEV. The author reports an accuracy of over 70% in pattern disambiguation. (Holub et al., 2012) trained several statistical classifiers on a modified subset of 30 PDEV entries (Cinková et al., 2012c) using morpho-syntactic as well as semantic features, and obtained over 80% accuracy. On a smaller set of 20 high frequency verbs (El Maarouf and Baisa, 2013) reached a similar 0.81 overall F1 score with a supervised SVM classifier based on dependency parsing and named entity recognition features. The goal of Task 15 at SemEval 2015 are i) to explore in more depth the mechanics of corpus-based semantic analysis and ii) to provide a high-quality standard dataset as well as baselines for the advancement of semantic processing. Given the complexity and wealth of PDEV, a major issue was to select relevant subtasks and subsets. The task was eventually split into three essential steps in building a CPA lexical entry, that systems could tackle separately: 1. CPA parsing: all sentences in the dataset to be syntactically and semantically parsed. 2. CPA clustering: all sentences in the dataset to be grouped according to their similarities 1 2 3 Pattern Implicature Pattern Implicature Pattern Implicature Institution or Human abolishes Action or Rule or Privilege Institution or Human formally declares that Action = Punishment or Rule or Privilege is no longer legal or operative Institution 1 or Human abolishes Institution 2 or Human Role Institution 1 or Human formally puts an end to Institution 2 or Human Role Process abolishes State of Affairs Process brings State of Affairs to an end 58.8% 24.4% 14.4% Figure 1: PDEV Entry for abolish. Tag subj obj iobj advprep acomp scomp Definition Subject Object Indirect Object Adverbial Preposition or other Adverbial/Verbal Link Adverbial or Verb Complement Noun or Adjective complement Table 1: Syntactic tagset used for subtask CPA lexicography: all verb patterns found in the dataset to be described in terms of their syntactic and semantic properties. 3 Task Description In order to encourage participants to design systems which could successfully tackle all three subtasks, all tasks were to be evaluated on the same set of verbs. As opposed to previous experiments on PDEV, it was decided that the set of verbs from the test dataset would be different from the set of verbs given in the training set. This was meant to avoid limiting tasks to supervised approaches and to encourage innovative approaches, maybe using patterns learnt in an unsupervised manner from very large corpora and other resources. This also implied that the dataset would be constructed so as to make it possible for systems to generalize from the behaviour and description of one set of verbs to a set of unseen verbs used in similar structures, as human language learners do. Although this obviously makes the task harder, it was hoped that this would put us in a better position to evaluate current limits of automatic semantic analysis. 3.1 Subtask 1: CPA Parsing The CPA parsing subtask focuses on the detection and classification (syntactic and semantic) of the arguments of the verb. The subtask is similar to Semantic Role Labelling (Carreras and Marquez, 2004) that arguments will be identified in the dependency parsing paradigm (Buchholz and Marsi, 2006), using head words instead of phrases. The syntactic tagset was designed specially for this subtask and kept to a minimum, and the semantic tagset was based on the CPA Semantic Ontology. In Example (1), this would mean identifying government as subject of abolish, from the [[Institution]] type, and tax as object belonging to [[Rule]]. The expected output is represented in XML format in Example (2). (1) In 1981 the Conservative government abolished capital transfer tax capital transfer tax and replaced it with inheritance tax. (2) In 1981 the Conservative entity syn= subj sem= Institution government /entity entity syn= v sem= - abolished /entity capital transfer entity synt= obj sem= Rule tax /entity capital transfer tax and replaced it with inheritance tax The only dependency relations shown are those involving the node verb. Thus, for example, the dependency relation between Conservative and government is not shown. Also only the relations in Table 1 are shown. The relation between abolished and replaced is not shown as it is not one of the targeted dependency relations. The input text consisted of individual sentences one word per line with both ID and FORM fields, and in which only the target verb token was pre-tagged. 3.2 Subtask 2: CPA Clustering The CPA clustering subtask is similar to a Word Sense Discrimination task in which systems have to 317 Layer Annotator dataset observations categories Kappa (Cohen) F-score Annotator 1 both 3, Syn Annotator 2 train 4, Annotator 3 test 1, Annotator 1 both 3, Sem Annotator 2 train 4, Annotator 3 test 1, Table 2: Inter-annotator figures where annotators are compared to the expert (annotator 4) who reviewed all the annotations (Microcheck Task 1). predict which pattern a verb instance belongs to. With respect to abolish (Figure 1), it would involve identifying all sentences containing the verb abolish which belonged to the same pattern (one of the patterns in Figure 1) and tagging them with the same number. 3.3 Subtask 3: CPA Automatic Lexicography The CPA automatic lexicography subtask aims to evaluate how systems can approach the design of a lexicographical entry within CPA s framework. The input was, as for the other tasks, plain text with node verb identified. The output format was a variant of that shown in Figure 1, simplified to a form which would be more tractable by systems while still being a relevant representation from the lexicographical perspective. Specifically, contextual roles were discarded and semantic alternations were decomposed into semantic strings 4 so that pattern 1 in Figure 1 would give rise to six strings (with V for the verb, here abolish): [[Human]] V [[Action]] [[Human]] V [[Rule]] [[Human]] V [[Privilege]] [[Institution]] V [[Action]] [[Institution]] V [[Rule]] [[Institution]] V [[Privilege]] This transformation from the PDEV format as in Figure 1 was done automatically and checked manually. These strings are different to (and generally more numerous than) the patterns evaluated in subtask 2. The goal of this subtask was to generalize sentence examples for each verb and create a list of possible semantic strings. This subtask was autonomous with respect to other subtasks in that participants did not have to return the set of sen- 4 See (Bradbury and El Maarouf, 2013). tences which matched their candidate patterns, patterns were evaluated independently. 4 Task Data 4.1 The Microcheck and Wingspread Datasets All subtasks (except the first) include two setups and their associated datasets: the number of patterns for each verb is disclosed in the first dataset but not in the second. This setup was created to see whether it would influence the results. The two datasets were also created in the hope that system development would start on the first small and carefully crafted dataset (Microcheck) and only then be tested on a larger and more varied subset of verbs (Wingspread) Annotation Process Both Microcheck and Wingspread start from data extracted from PDEV and the manually patterntagged BNC. We took only verbs declared as complete and started by the same lexicographer, so that each verb had been checked twice: once by the lexicographer who compiled the entry and once by the editor-in-chief. Some tagging errors may have slipped in but the tagging is generally of high quality (Cinková et al., 2012a; Cinková et al., 2012b). Additional checks have been performed on Microcheck, since this was the dataset chosen for subtask 1, for which data had to be created. This section describes the annotation process. PDEV contains only one kind of link between a given pattern and a given corpus instance: each verb token found in the sample is tagged with a pattern identifier, and the pattern then specifies syntactic 5 The datasets as well as the systems outputs will soon be made publicly available on the task website. 318 V P I IMP %MP V P I IMP %MP boo ascertain teeter totter begrudge tense avert belch breeze attain wing avoid brag adapt sue advise bluff ask afflict SUM 59 2,423 1,689 bludgeon AVERAGE Table 3: Statistics on the Wingspread test dataset with V standing for verb, P for patterns, I for instances, IMP for instances of majority pattern, and %MP for proportion of the majority pattern. V P I IMP %MP V P I IMP %MP appreciate apprehend crush decline continue undertake SUM ,280 operate AVERAGE Table 4: Statistics on the Microcheck test dataset; abbreviations as for previous table. roles and their semantic types. The job in subtask 1 annotation consists of tagging the arguments of each token in the sample, both syntactically and semantically (see Table 1 for tagsets of each layer). The syntactic information was the same as for subtask 3 except that category names were shortened and pairs of categories were merged in two places. 6 The annotation was carried out by 4 annotators, with 3 for the training data and 3 for test data, and 2 annotators annotating both training and test data, one of them being an expert PDEV annotator. Annotators could ask for feedback on the task at any moment, and any doubts were cleared by the expert annotator. Each pair of annotators annotated one share of the dataset, and their annotation was double-checked by the expert annotator. The agreement was not very high (e.g. Annotator 2, see Table 2) in some cases so the double-check by the expert annotator was crucial. Table 2 reports the agreement in terms of F-score and Cohen s Kappa (Cohen, 1960) between each annotator and the expert annotator. 7 6 See task15/index.php?id=appendices 7 The expert did not start from scratch, but from other anno- 4.3 Statistics on the Data Strict rules were implemented to develop a highquality and consistent dataset: 1. PDEV patterns discriminate exploited 8 uses of a pattern using a different tag; these were left aside for the CPA task. 2. For the test set, when patterns contained at least one semantic type or grammatical category which was not covered in the training set, they were discarded. 3. Only patterns which contained more than 3 examples were kept in the final dataset. Applying these filters led to the Microcheck dataset, containing 28 verbs (train: 21; test: 7), 378 patterns (train: 306; test: 72) with 4,529 annotated sentences (train: 3,249; test: 1,280) and to the Wingspread dataset set containing 93 verbs (train: tators work. Since his target was the conformity of the tagging with guidelines as well as with CPA s principles, we maintain that the expert would have produced a very similar output had he not started from the product of other annotators, who themselves used the output of a system to speed up their work. 8 An exploitation corresponds to an anomalous use of a pattern, as in a figurative use. 319 73; test: 20), 856 patterns (train: 652; test: 204), and 12,440 annotated sentences (train: 10,017; test: 2,423). More detailed figures for the test datasets are provided in Tables 3 and Metrics The final score for all subtasks is the average of F- scores over all verbs (Eq. 1). What varies across subtasks is the way Precision and Recall are defined. F1 verb = 2 Precision verb Recall verb Precision verb + Recall verb nverb i=1 Score Task = F1 verb i n verb (1) Subtask 1. Equation 2 illustrates that Precision and Recall are computed on all tags, both syntactic and semantic. To count as correct, tags had to be set on the same token as in the gold standard. Correct tags Precision = Retrieved tags Correct tags Recall = Reference tags (2) Subtask 2. Clustering is known to be difficult to evaluate. Subtask 2 used the B-cubed definition of Precision and Recall, first used for coreference (Bagga and Baldwin, 1999) and later extended to cluster evaluation (Amigó et al., 2009). Both measures are averages of the precision and recall over all instances. To calculate the precision of each instance we count all correct pairs associated with this instance and divide by the number of actual pairs in the candidate cluster that the instance belongs to. Recall is computed by interchanging Gold and Candidate clusterings (Eq. 3). Precision i = Pairs i in Candidate found in Gold Pairs i in Candidate Recall i = Pairs i in Gold found in Candidate Pairs i in Gold (3) Subtask 3. This task was evaluated as a slot-filling exercise (Makhoul et al., 1999), so the scores were computed by taking into account the kinds of errors that systems make over the 9 slots: errors of Insertion, Substitution, Deletion. Equation 4 formulates how Precision and Recall are computed. Correct Precision = Correct + Subst + Ins Correct Recall = Correct + Subst + Del (4) In order not to penalize systems, the best match was computed for each Candidate pattern, and one candidate pattern could match more than one Gold pattern. When a given slot was filled both in the Gold data and the Candidate data, this counted as a match. When not, it was a Deletion. If a slot was filled in the run but not in the gold, it was counted as an Insertion. When a match (aligned slots) was also a semantic type match, it was Correct (1 point). When not, it was a Substitution; the CPA ontology was used to allow for partial matches, allowing hypernyms and hyponyms. For that particular task, the maximum number of Candidate patterns was limited to 150% with respect to the number i
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks