Communications and Signal Processing Seminar
Compressing the Internet Using Deep Learning: Numerical Characterization of the Internet Host Population
Add to Google Calendar
Global scans of the IPv4 address space have provided researchers with large amounts of raw information about discoverable hosts on the public Internet. However, a growing number of connected machines, such as Internet of Things (IoT) devices, underlines the need for scalable methods to enable real-time decision making. In this study, we explore techniques for extracting universal numerical representations of raw records obtained from global scans of the public Internet across many ports/protocols. We first propose a semi-automated technique for learning the schema of tree-like (e.g., JSON) documents that are typically used to store parsed information obtained from network scans, and generate high-dimensional binary feature vectors capturing data from all observed fields in inspected documents. We then apply an unsupervised deep generative model, namely variational autoencoders, to obtain lightweight latent embeddings of our binary feature vectors. We show that by capturing data from many protocols, such as services offered by a web server, we can construct universal embeddings of Internet hosts that are applicable to a wide range of learning tasks by feeding them to machine learning models. We evaluate the proposed framework for detecting/forecasting maliciousness of hosts, inferring masked attributes of hosts, and further discuss the interpretability of trained models. The reduced computational and memory requirements for processing these numerical embeddings, and the ability to reuse available measurements across many applications, allows real-time monitoring of large collections of hosts without the need to design and collect custom network probes for each specific task.
Dr. Armin Sarabi received his MS and PhD degrees in EE:Systems from the University of Michigan in 2013 and 2017, respectively. His research focuses on the applications of machine learning and quantitative analysis for inspecting the security of real-world systems. He was a data scientist at Quadmetrics and Fair Isaac Corporation where he worked on forecasting the cyber-security risk of organizations using network measurements and demographic data. He is currently a postdoctoral researcher at the University of Michigan, Department of Electrical Engineering and Computer Science.