Unsupervised Neologism Normalization Using Embedding Space Mapping

January 1, 2019

Abstract

This paper presents an approach for detecting and normalizing neologisms in social media content. Neologisms refer to recent expressions that are specific to certain entities or events and are being increasingly used by the public, but have not yet been accepted in mainstream language. Automated methods for handling neologisms are important for natural language understanding and normalization, especially for informal genres with user generated content. We present an unsupervised approach for detecting neologisms and then normalizing them to canonical words without relying on parallel training data. Our approach builds on the text normalization literature and introduces adaptations to fit the specificities of this task, including phonetic and etymological considerations. We evaluate the proposed techniques on a dataset of Reddit comments, with detected neologisms and corresponding normalizations.

Download

Publication Type

Paper

Conference / Journal Name

EMNLP 2019 Workshop on Noisy User Text

Authors

Nasser Zalmout

Aasish Pappu

Kapil Thadani

BibTeX


@inproceedings{
    author = {},
    title = {‌Unsupervised Neologism Normalization Using Embedding Space Mapping‌},
    booktitle = {Proceedings of EMNLP 2019 Workshop on Noisy User Text‌},
    year = {‌2019‌}
}