Metadata Management Software Tools to Support Cybersecurity Research and Development of Sustainable Cyberinfrastructure (DatCat)
Sponsored by:
National Science Foundation (NSF)
The focus of this project is on enhancing the utility of Internet Measurement Data Catalog (IMDC) for the needs of cybersecurity and cyberinfrastructure research community as well as in support of the new NSF Data Sharing Policy.

Funding source: NSF OCI-1127500. Period of performance: August 1, 2011 - July 31, 2014.

Project Summary

Collecting representative Internet measurement data has remained a challenging and often elusive goal for the networking community. Obstacles include the Internet's scale and scope, technical challenges in capturing, fltering and sampling high data rates, diffculty obtaining measurements across a decentralized network with radically distributed ownership, cost of building and operating instrumentation, and political hurdles. Even (or especially) with all these obstacles, the demand for and importance of representative Internet data sets is increasing -- which is good news for rigorous scientifc Internet research. The primary driver of this demand is the now pervasive acknowledgement that we are unable to keep up with cybersecurity threats to various critical and increasingly interdependent infrastructures, and that a primary limiting factor in the escalating arms race is our surprisingly still primitive approach to sharing cyberinfrastructure data.

CAIDA has developed an Internet Measurement Data Catalog -- IMDC -- an index of information (metadata) about data sets and their availability under various usage policies. This catalog confronted a signifcant challenge in network science: reducing the cost of searching for data by organizing metadata about accessible Internet data sets into a single repository. We developed the underlying DatCat architecture and prototype software implementation to support the IMDC.

We propose to integrate the lessons we have learned during our research, development and operational experience with the IMDC to expand the underlying software capabilities to support the cybersecurity research and cyberinfrastructure development communities. Our three primary deployment goals are to: (1) reduce the burden on those contributing data via a streamlined interface and tools for easier indexing, annotation and navigation of relevant data; (2) convert from use of a proprietary database backend (Oracle) to a completely open source solution; and (3) to expand DatCat's relevance to the cybersecurity and other research communities. This last goal includes outreach activities such as workshops and demonstrations at security-related PI meetings, creating and indexing new data sets -- ccTLD DNS zone fles -- which have been declared critically lacking by the cybersecurity community, and creation of public web forums for discussion of specifc and broader data-sharing issues.

The proposed software development activities will support a range of measurable benefits to cyberinfrastructure research: maximizing the re-use of existing Internet data; decreasing the time spent collecting redundant data; reducing the effort needed to start a new study; promoting validation and reproducibility of analyses and results; enabling longitudinal and cross-disciplinary studies of the Internet; and opening up new cross-domain areas of transformative networking research.

The success of the catalog and related workshops will facilitate wide dissemination of Internet measurement data to researchers and security experts across academic, commercial, and government sectors. By including education-oriented data collections in the catalog, this project promises to link research and education, and improve access to Internet research for underrepresented groups in computer science and engineering.

Management Plan

The schedule of work below shows how we plan to accomplish the proposed tasks in two years of the project.

SubtaskDescriptionProjected TimelineStatus
Task 1: Expanding DatCat capabilities to Streamline the User Experience
1.1Implement standalone Collection and Publication objects Year 1done
1.2Implement web-based submission interfaceYear 1done
Task 2: Migrating the backend database to open source software
2.1Replicate the IMDC schema in an open database platformYear 1 (1st and 2nd quarter)done
2.2Migrate the data to the newly created database/schemaYear 1 (1st and 2nd quarter)done
2.3Modify and update the IMDC web application to convert any vendor-specific database connection to the selected database solutionYear 1 (1st and 2nd quarter)done
Task 3: Expanding DatCat's Community to Cybersecurity and other fields
3.1Start building zone files from passive DNS data and other sourcesYear 1done
3.2Organize the 1st workshop Year 1 (3rd quarter)done
3.3Publish the 1st workshop reportYear 1 (4th quarter)done
3.4Index metadata and annotations for derived TLD zone filesYear 2ongoing
3.5Assist researchers cataloging their cybersecurity-related datasetsYear 2done
3.6Add 3 categories to Datcat public forum: Dataset Request, Dataset Discussion, and General DiscussionYear 2done
3.7Organize the 2nd workshopYear 2 (3rd quarter)
3.8Publish the 2nd workshop reportYear 2 (4th quarter)

We will use tangible metrics to evaluate the success of the DatCat software developed and deployed: feedback from users at workshops and on web-based surveys; and quantitative metrics of number, size, and breadth of data sets indexed by the end of project.

