TNT: Text Normalization Based Pre-Training of Transformers for Content Moderation

September 15, 2020
Abstract

In this paper, we present a new language pre-training model TNT (Text Normalization based pre-training of Transformers)for content moderation. Inspired by the masking strategy and text normalization,  TNT is developed to learn language representation by training transformers to reconstruct text  from four action types typically seen in text manipulation: substitution, swap, deletion, and insertion. Furthermore, the normalization involves prediction of both action types and token labels, enabling TNT to learn from more challenging tasks than the standard task of predicting the masked word. As a result, the experiments demonstrate that  TNT outperforms strong baselines on the hatespeech classification task. Additional text normalization experiments and case studies show that TNT has performed well in misspell correction.

Download
Publication Type
Paper
Conference / Journal Name
EMNLP 2020

BibTeX


@inproceedings{
    author = {},
    title = {‌TNT: Text Normalization Based Pre-Training of Transformers for Content Moderation‌},
    booktitle = {Proceedings of EMNLP 2020‌},
    year = {‌2020‌}
}