Summary: Train-O-Matic: Supervised Word Sense Disambiguation with no (manual) effort
Authors: Tommaso Pasini, Roberto Navigli, in Artificial Intelligence
Volume 279, February 2020
The summary below was automatically created by Flexudy. Feel free to download the app from your play store.
Have fun reading
Word Sense Disambiguation (WSD) is the task of associating the correct meaning with a word in a given context. WSD provides explicit semantic information that is beneficial to several downstream applications, such as question answering, semantic parsing and hypernym extraction. Unfortunately, WSD suffers from the well-known knowledge acquisition bottleneck problem: it is very expensive, in terms of both time and money, to acquire semantic annotations for a large number of sentences. To address this blocking issue we present Train-O-Matic, a knowledge-based and language-independent approach that is able to provide millions of training instances annotated automatically with word meanings. The approach is fully automatic, i.e., no human intervention is required, and the only type of human knowledge used is a task-independent WordNet-like resource. Moreover, as the sense distribution in the training set is pivotal to boosting the performance of WSD systems, we also present two unsupervised and language-independent methods that automatically induce a sense distribution when given a simple corpus of sentences. We show that, when the learned distributions are taken into account for generating the training sets, the performance of supervised methods is further enhanced. Experiments have proven that Train-O-Matic on its own, and also coupled with word sense distribution learning methods, lead a supervised system to achieve state-of-the-art performance consistently across gold standard datasets and languages. Importantly, we show how our sense distribution learning techniques aid Train-O-Matic to scale well over domains, without any extra human effort. Introduction Word Sense Disambiguation is the task of associating a meaning, selected from a fixed set of concepts, with a word in a given context . It is a key task in both computational lexical semantics and in artificial intelligence more broadly , inasmuch as it addresses the lexical ambiguity of text by making the meaning of words occurring in a given context explicit. Indeed, removing the ambiguity of words in a text can be beneficial to a variety of different NLP applications, such as: machine translation, where the meaning information for the target word can be helpful to choose the correct translation; hypernym extraction, which requires a disambiguation step in order to pick the right hypernym of a word which otherwise * Corresponding author. (2020) 103215 might have different hypernyms depending on the context; semantic search, which, in contrast to lexical search, takes advantage of synonymy and polysemy to provide more accurate results. The NLP community conducts two main lines of research in WSD: knowledge-based and supervised WSD. We present two different ways for computing the probability of a sense s to appear in a given sentence σ containing the target word w . Then, we compute the probability distribution over the senses of a target word in a sentence by assigning a score defined by the cosine similarity between the sentence and the senses’ vectors and renormalising them to form a distribution. Estimate that providing a number of examples for a sense that is proportional to its frequency in the test set would dramatically increase the performance by more than 10 points. Thus, while a lot of effort in the past years has been devoted to developing deep neural networks for improving the state of the art on the SemEval benchmarks, few works only have focused on injecting different types of knowledge, i.e., the word sense distribution, into the training data in order to reflect the sense distribution of the target text. Even if it is not always possible to know the sense distribution of the target word, it can still be estimated in various ways: Bennett et al. Given our interest in scaling over multiple languages, we are now going to present two methods that we previously introduced in . EnDi (Entropy based Distribution Learning) and DaD (Domain aware Distribution learning) are two methods that learn the word sense distribution given an input corpus of raw sentences. A sense distribution computation example for the word plane. Two people on the plane died. The flight was delayed due to trouble with the plane. Only one plane landed successfully. 0.73 0.10 0.17 The cabinetmaker used a plane for the finish work. sense distribution γ σw , both methods apply the Sentence Scoring phase of Train-O-Matic (which in what follows we call Train-O-Matic-1) to all the sentences in the input raw corpus. Note that at this stage we have a different sense distribution for each target word and for each sentence it appears in that is computed as explained in Section 2.2.2; now, instead, our aim is to build a single sense distribution for each different lexeme (i.e. lemma and part of speech). To build such distributions EnDi applies an entropy-based filter on the sentence-level sense distributions in order to filter out all those that have a uniform shape, and it then averages the remaining ones. DaD, on the other hand, exploits a graph-based algorithm and a mapping between knowledge graph nodes and domain labels. It first estimates the distribution of the domains in the input corpus and later propagates their probabilities over the graph synsets. The first method takes as input a lexicon L containing all the words for which we need to compute the sense distribu- tions, the output of Train-O-Matic-1, i.e., the sets of sense distributions at sentence level We take advantage of these methods (EnDi and DaD) for two main purposes: 1) to obtain an estimation for the most common sense baseline that also reflects the potential domain bias of the target text, and 2) to build the training set by leveraging the sense order in the distribution, i.e., assigning to each word sense a number of examples proportional to its position in the estimated word sense distribution. These results confirm our intuition that excluding syntagmatic rela- tions (such as those contained in Wikipedia and thus in BabelNet) limits the performance of Train-O-Matic. In Table 4 we show the results of IMS trained on Train-O-Matic when using Wikipedia as underlying corpus instead of the United Nations corpus. We now move to comparing the performance of Train-O-Matic, which is fully automatic, against corpora which are an- notated manually (SemCor) and semi-automatically (OMSTI). the main runway can be used by planes with up to around 180 passengers [. . . ] 7 Number of unique and total tokens comprised in each dataset for which at least one training example is provided by OMSTI, EuroSense, SenseDefs and Train-O-Matic for each English dataset. 1350 3948 1557 4300 dataset in the evaluation framework and on all datasets merged together (last column), when it is trained with the various corpora described above. As can be seen, Train-O-MaticW iki obtains higher performance than OMSTI (up to 5.5 points above) on 3 out of 5 datasets, scoring 1 point above OMSTI overall. The MFS is in the same ballpark as T-O-MW iki , performing better on some datasets and worse on others. We note that IMS trained on T-O-MW iki succeeds in surpassing or obtaining the same results as IMS trained on SemCor on SemEval-15 and SemEval-13. We view this as a significant achievement given that no manual annotations were involved in the creation of our corpus. Because overall T-O-MW iki outperforms T-O-MU N , in what follows we report all the results with T-O-MW iki , except for the domain-oriented evaluation (see Section 6.7). We now compare the vanilla Train-O-Matic against its neural variant in which pretrained embeddings (W E) are used instead of PPR to perform the sentence scoring step (see Section 2.2.2). -MW E is semi-supervised and exploits sense annotations in order to build a vector for each synset. Thus, it is interesting to see that the vanilla version is still able to outperform its alternative on SemEval-13 and SemEval-15 by 1.6 and 3 points, while scoring only 0.6 points lower on the concatenation of all datasets (ALL). We also note that T-O-MW E beats OMSTI — which is also semi-supervised — on all the datasets but Senseval-2, and gets very close to SemCor, especially on Senseval-3 and SemEval-07 where the vanilla version of T-O-M, instead, performs worse. Together with all these baselines, we also report the performance achieved by IMS when we use the WordNet sense distribution and the two predicted by EnDi and DaD. Finally, as the word sense distribution has proven to play an important role for the classifier performance, we investigated whether, 14 T. Pasini, R. Navigli / Artificial Intelligence 279 (2020) 103215 Table 10 Evaluation of IMS with Oracle distribution when random senses are assigned to each word occur- rences. We now compare the results attained by IMS trained on Train-O-MaticW iki when standard WordNet MFS and sense distribution were used (see Tables 11 and 12), against those in Tables 13 and 14 where IMS is trained on the following three different datasets: 1. In Table 17 we show results in all the languages of SemEval-13 and SemEval-15 achieved by IMS trained on vanilla Train-O-Matic and when our MFS is used as backoff strategy. The main difference from our approach lies in its need for a manually annotated dataset to start the label propagation algorithm. System would have been interesting, but neither the proprietary training data nor the code are available at the time of writing. Other interesting approaches which also aim at augmenting the information available in the network are the one introduced by Luo et al. In order to generalize effectively, these supervised systems require large numbers of training instances annotated with senses for each target word occurrence. Approaches include variants of Personalised PageRank and densest subgraph approximation algorithms which, thanks to the availability of multilingual resources such as BabelNet, can easily be extended to perform WSD in arbitrary languages. — a topic modelling based approach for word sense distribution learning — to induce the distribution of word senses, while in  we introduced two unsupervised approaches for learning word sense distributions. Our experiments proved that it is possible to boost multilingual WSD performance by exploiting just the learned sense distribution and, moreover, that it is possible to automatically adapt the training set to any domain-specific document with- out any human intervention. As future work we aim at using new resources such as SyntagNet and VerbAtlas to integrate the additional knowledge coming from words collocations and verb frames into Train-O-Matic hence refining the quality of the generated training set, adding new sentences for words and senses that are still not covered, i.e., those with other POS tags such as verbs adjectives and adverbs.
Did you enjoy reading? Follow us on Medium and give us feedback to help us improve our Summarizer.