SensEmBERT

Context-Enhanced Sense Embeddings for Multilingual Word Sense Disambiguation

SensEmBERT is a knowledge-based approach that brings together the expressive power of language modelling and the vast amount of knowledge contained in a semantic network to produce high-quality latent semantic representations of word meanings in multiple languages. Our vectors lie in a space comparable with that of BERT contextualized word embeddings, thus allowing a word occurrence to be easily linked to its meaning by applying a simple nearest neighbour approach. We release vectors for all WordNet nominal senses in 5 languages (English, Italian, Spanish, French and German).

SensEmBERT was supported by the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon 2020 research and innovation programme.


Reference Paper

Abstract

Contextual representations of words derived by neural language models have proven to effectively encode the subtle distinctions that might occur between different meanings of the same word. However, these representations are not tied to a semantic network, hence they leave the word meanings implicit and thereby neglect the information that can be derived from the knowledge base itself. In this paper, we propose SensEmBERT, a knowledge-based approach that brings together the expressive power of language modelling and the vast amount of knowledge contained in a semantic network to produce high-quality latent semantic representations of word meanings in multiple languages. Our vectors lie in a space comparable with that of contextualized word embeddings, thus allowing a word occurrence to be easily linked to its meaning by applying a simple nearest neighbour approach. We show that, whilst not relying on manual semantic annotations, SensEmBERT is able to either achieve or surpass state-of-the-art results attained by most of the supervised neural approaches on the English Word Sense Disambiguation task. When scaling to other languages, our representations prove to be equally effective as their English counterpart and outperform the existing state of the art on all the Word Sense Disambiguation multilingual datasets.

Reference

Bianca Scarlini, Tommaso Pasini and Roberto Navigli
SensEmBERT: Context-Enhanced Sense Embeddings for Multilingual Word Sense Disambiguation
Slides
Poster
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence
@inproceedings{scarlinietal:2020,
	  title={{SensEmBERT: Context-Enhanced Sense Embeddings for Multilingual Word Sense Disambiguation}},
	  author={Scarlini, Bianca and Pasini, Tommaso and Navigli, Roberto},
	  booktitle={Proceedings of the Thirty-Fourth Conference on Artificial Intelligence},
	  publisher={Association for the Advancement of Artificial Intelligence},
	  year={2020}
	}

Download

The package contains 6 files: "sensembert_<LANG-CODE>_<METHOD>.txt", where <LANG-CODE> identifies the language and <METHOD> is either kb (knowledge-based) or supervised depending on the embedding's version.
Each file is in the word2vec format:

  • the first line contains two values separated by space containing the size of the vectors and the number of vectors in the file.
  • each subsequent line represents a vector where elements are separated by space and the first component is the id of the sense and the other components are the features.



Creative Commons License SensEmBERT is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License.


Contacts

Bianca Scarlini

PhD Student @ Sapienza

scarlini[at]di.uniroma1.it

Tommaso Pasini

Postdoc @ Sapienza

pasini[at]di.uniroma1.it

Roberto Navigli

Full Professor @ Sapienza

navigli[at]di.uniroma1.it