Content-Based Email Classification at Scale
Understanding the content of email messages can enable new features that highlight what matters to users, making email a more useful tool for people to manage their lives. We present work from a consumer email platform to build multilabel models to classify messages according to a mail-specific, content-based taxonomy that represents the topic, type, and objective of an email. While state-of-the-art Transformer-based language models can achieve impressive results for text classification, these models are too costly to deploy at the scale of email. Using a knowledge distillation framework, we first build a complex, accurate teacher model from limited human-labeled training data and then use a large amount of teacher-labeled data to train lightweight student models that are suitable for deployment. The student models retain up to 91% of the predictive performance of the teacher model while reducing inference cost by three orders of magnitude. Deployed to production in Yahoo Mail, these models classify billions of emails every day and power features that help people tackle their inboxes.