Collecting representative Internet measurement data has remained a challenging and often elusive goal for the networking community. Obstacles include the Internet's scale and scope, technical challenges in capturing, fltering and sampling high data rates, diffculty obtaining measurements across a decentralized network with radically distributed ownership, cost of building and operating instrumentation, and political hurdles. Even (or especially) with all these obstacles, the demand for and importance of representative Internet data sets is increasing -- which is good news for rigorous scientifc Internet research. The primary driver of this demand is the now pervasive acknowledgement that we are unable to keep up with cybersecurity threats to various critical and increasingly interdependent infrastructures, and that a primary limiting factor in the escalating arms race is our surprisingly still primitive approach to sharing cyberinfrastructure data.
CAIDA has developed an Internet Measurement Data Catalog -- IMDC -- an index of information (metadata) about data sets and their availability under various usage policies. This catalog confronted a signifcant challenge in network science: reducing the cost of searching for data by organizing metadata about accessible Internet data sets into a single repository. We developed the underlying DatCat architecture and prototype software implementation to support the IMDC.
We propose to integrate the lessons we have learned during our research, development and operational experience with the IMDC to expand the underlying software capabilities to support the cybersecurity research and cyberinfrastructure development communities. Our three primary deployment goals are to: (1) reduce the burden on those contributing data via a streamlined interface and tools for easier indexing, annotation and navigation of relevant data; (2) convert from use of a proprietary database backend (Oracle) to a completely open source solution; and (3) to expand DatCat's relevance to the cybersecurity and other research communities. This last goal includes outreach activities such as workshops and demonstrations at security-related PI meetings, creating and indexing new data sets -- ccTLD DNS zone fles -- which have been declared critically lacking by the cybersecurity community, and creation of public web forums for discussion of specifc and broader data-sharing issues.
The proposed software development activities will support a range of measurable benefits to cyberinfrastructure research: maximizing the re-use of existing Internet data; decreasing the time spent collecting redundant data; reducing the effort needed to start a new study; promoting validation and reproducibility of analyses and results; enabling longitudinal and cross-disciplinary studies of the Internet; and opening up new cross-domain areas of transformative networking research.
The success of the catalog and related workshops will facilitate wide dissemination of Internet measurement data to researchers and security experts across academic, commercial, and government sectors. By including education-oriented data collections in the catalog, this project promises to link research and education, and improve access to Internet research for underrepresented groups in computer science and engineering.
The schedule of work below shows how we plan to accomplish the proposed tasks in two years of the project.
|Task 1: Expanding DatCat capabilities to Streamline the User Experience|
|1.1||Implement standalone Collection and Publication objects||Year 1||done|
|1.2||Implement web-based submission interface||Year 1||done|
|Task 2: Migrating the backend database to open source software|
|2.1||Replicate the IMDC schema in an open database platform||Year 1 (1st and 2nd quarter)||done|
|2.2||Migrate the data to the newly created database/schema||Year 1 (1st and 2nd quarter)||done|
|2.3||Modify and update the IMDC web application to convert any vendor-specific database connection to the selected database solution||Year 1 (1st and 2nd quarter)||done|
|Task 3: Expanding DatCat's Community to Cybersecurity and other fields|
|3.1||Start building zone files from passive DNS data and other sources||Year 1||done|
|3.2||Organize the 1st workshop||Year 1 (3rd quarter)||done|
|3.3||Publish the 1st workshop report||Year 1 (4th quarter)||done|
|3.4||Index metadata and annotations for derived TLD zone files||Year 2||done|
|3.5||Assist researchers cataloging their cybersecurity-related datasets||Year 2||done|
|3.6||Add 3 categories to Datcat public forum: Dataset Request, Dataset Discussion, and General Discussion||Year 2||done|
|3.7||Organize the 2nd workshop||Year 2 (3rd quarter)||done|
|3.8||Publish the 2nd workshop report||Year 2 (4th quarter)||done|
We will use tangible metrics to evaluate the success of the DatCat software developed and deployed: feedback from users at workshops and on web-based surveys; and quantitative metrics of number, size, and breadth of data sets indexed by the end of project.