STARDUST: Sustainable Tools for Analysis and Research on Darknet Unsolicited Traffic
This project aims at maintaining continued operation of the UCSD Network Telescope infrastructure and maximizing its utility to researchers from various disciplines.
Principal Investigators: Alberto Dainotti Alistair King
Funding source: CNS-1730661 Period of performance: October 1, 2017 - March 21, 2021.
Project Summary
The UCSD Network Telescope (UCSD-NT) is a passive monitoring system, which captures unsolicited Internet traffic sent to a large segment of unassigned IPv4 address space. For over a decade, this instrumentation has enabled global visibility into macroscopic Internet phenomena that few other data sources can offer. It has provided relevant data used in a broad set of sub-disciplines in Computer & Information Science & Engineering (CISE) and beyond: from network and systems security and stability, to machine learning and big data processing techniques, and, most recently, for studies of cyberwarfare and political repression of communication. In 2011 we enhanced the Telescope instrumentation to enable access to raw and live telescope traffic data, thus expanding the scope of possible research questions and the circle of researchers using the data. As of January 2017 we were aware of over 100 publications (a lower bound) - without UCSD co-authors - that used UCSD-NT data. Yet the infrastructure was lagging behind the increasing demands in terms of storage, computing resources, and system administration. These issues hindered our ability to continue sharing UCSD-NT data with researchers, and required compromises that limited the availability of this unique resource.
The STARDUST project will help extend and sustain operation of the UCSD-NT infrastructure. We will upgrade and modernize the current infrastructure to handle the predicted growth in traffic, leverage virtualization and NSF-funded HPC platforms at the San Diego Supercomputer Center for computational data analysis, and introduce meta-data semantics to simplify many tasks researchers typically want to do with UCSD-NT data. The proposed modifications will leave researchers more time (and available HPC resources) to focus on their specific scientific questions. Moreover, the project will forge an interdisciplinary collaboration between researchers from the field of computer networks and HPC scientists and engineers to experiment with novel approaches for research on live traffic analysis.
The stabilized and enhanced infrastructure capabilities will better serve a diverse range of academic researchers, the vast majority of whom have no access to any other source of global Internet traffic data. The proposed enhancements will support invaluable hands-on experience in operationally relevant network security and traffic analysis research engaging a wide audience of computer science faculty and students in the use of our tools and data. Project results will contribute to advancing knowledge in diverse CISE disciplines, e.g., facilitating the development of efficient strategies for early detection and mitigation of cyber attacks, supporting macroscopic Internet performance and reliability assessments, and opening a new domain for the application of live streaming big data analysis and in situ machine learning techniques.
Project Milestones
- Task 1 : Upgrade and modernize the UCSD-NT infrastructure (Years 1 and 2);
- Task 2 : Transition the data analysis infrastructure to use NSF HPC resources (Years 1, 2, and 3);
- Task 2.1 : Deploy cloud-compute support using novel virtualization features on Comet supercomputer (Years 1 and 2);
- Task 2.2 : Develop and deploy live packet capture and distribution software (Years 1 and 2);
- Task 2.3 : Support dynamically provisioned specialized Big Data environments (Years 1 and 2);
- Task 3 : Reduce processing complexity and simplify data analysis (Years 1, 2, and 3).
- Task 4 : Communal activities (Years 1, 2, and 3).
Project Timeline
Subtask | Description | Date | Status |
---|---|---|---|
4.1 | Open project web site | Oct 2017 | done |
4.2 | Start a mailing list of STARDUST users | Sep 2019 | done |
4.3 | Create internal project wiki | Nov 2017 | done |
1.1 | Purchase and deploy a high performance 10 Gbps capture card with accurate time stamping | Dec 2017 | done |
1.2 | Upgrade connected device interfaces (NP-router, storage server) to 10 Gbps | Dec 2017 | done |
3.1 | Extend Corsaro and related libraries to tag FlowTuple information with meta-data - geolocation - origin AS - spoofed source |
Jun 2018 | done |
4.4 | Organize and host the first DUST Workshop | Sep 2019 | done |
2.1.1 | Provision and deploy a virtualized cloud environment | Mar 2020 | done |
2.2.1 | Customize and extend the WDcap packet capture software to forward traffic over a 10 Gbps management network interface to a CAIDA server | Sep 2018 | done |
1.3 | Purchase and deploy an additional storage server and attached disk array (~200 TB capacity) | Dec 2018 | done |
2.2.2 | Customize and extend the libtrace "RT" format for encapsulation and distribution of captured traffic | Dec 2018 | done |
2.3.1 | Develop and deploy an interface to request resources for processing historical telescope data | Jul 2020 | done |
2.1.2 | Develop and document a pre-configured OS image tailored for telescope data analysis | May 2020 | done |
2.3.2 | Develop helper routines/APIs for Spark and Hadoop to retrieve historical data directly from the archive during processing | Jun 2020 | done |
3.2 | Deploy meta-data tagging system on cloud-compute environment | Jun 2019 | done |
2.1.3 | Develop and deploy management interfaces to export a snapshot of a researcher's VM for archiving | Sep 2019 | done |
2.1.4 | Implement resource accounting on a per-user basis, analyze its applicability | Aug 2020 | done |
2.3.3 | Develop sample analysis scripts and documentation for implementing longitudinal analyses | Jul 2020 | done |
3.3 | Deploy several "operational" analysis VMs to process the telescope traffic and derive multi-level aggregated datasets | Sep 2019 | done |
4.5 | Write and publish AUP for access to the new virtual environments | Aug 2020 | done |
4.6 | Announce new UCSD-NT capabilities online | Jul 2020 | done |
3.4 | Explore efficient indexing of FlowTuple records to enable analysis of traffic with specific meta-data characteristics | Dec 2019 | done |
3.5 | Extend the Corsaro3 FlowTuple plugin to support publication of flows to a Kafka cluster | Mar 2020 | done |
4.7 | Organize and host the second DUST Workshop | Mar 2021 | done |
3.6 | Create streams containing specified subsets of the overall traffic (e.g., one stream per country) | Jun 2020 | done |
4.10 | Refine the upgraded UCSD-NT platform based on users' feedback | Sep 2020 | done |
Additional Content
STARDUST: Sustainable Tools for Analysis and Research on Darknet Unsolicited Traffic: Proposal
An abbreviated version of the original proposal.