Hadoop-MTA: A System for Multi Data-Center Trillion Concepts Auto-ML Atop Hadoop
The ever-growing computation capability dis- tributed infrastructure brings tremendous opportunities for mining and analysis of data that was impossible otherwise. Meanwhile, the inherent computation model of distributed system also brings unique and non-trivial challenges for traditional Auto- ML, including the explosion of data dimensions, the expected absence of features, and the heterogeneity of information. This is especially the case in modern Internet enterprises, where data in the scale of trillions are stored in multiple data centers, and the discovery of subtle signals could incur significant impact in revenue and welfare. How can we best harness the large scale distributed machine learning, but without keeping engineers constantly in the loop? In this work, we present Hadoop-MTA, a system for Multi Data-center, Trillion Concepts, Auto-ML on top of the Hadoop distributed computation environment that leverages sparsity aware heterogeneous knowledge graph representation and dimensionality agnostic parallel learning. Through multiple large scale experiments, we find that Hadoop- MTA significantly output-performs competitive state of the art distributed learning algorithms and scales well to trillion scale data-sets. Our model is rolled out to Hadoop serving infrastruc- ture in Yahoo covering billions of unique identities and shows improvements 129.5% accuracy and 106.5 % weighted F1-score (more than 2x) on key targeting use cases.