Funding source: NSF OCI-1127500. Period of performance: August 1, 2011 - December 31, 2014.
It has become clear that in order to be able to keep up with pervasive cybersecurity threats to various critical and often interdependent infrastructures, researchers desperately need to collect and share more cyberinfrastructure data. Yet despite the increasing demand for and importance of Internet data sets, procuring representative Internet measurement data remains a challenging and often elusive goal for the networking community. Obstacles include the Internets scale and scope, technical challenges in capturing, filtering, sampling, and storing high rates and volumes of data, difficulty conducting measurements across a decentralized network with radically distributed ownership, cost of building and operating instrumentation, and political and legal hurdles.
To maximize the re-use of existing Internet data, CAIDA has previously developed an Internet Measurement Data Catalog (IMDC, also referred to as DatCat) -- an index of information about data sets (metadata) and their availability under various usage policies. Over the course of this project, we streamlined and improved the IMDC and expanded its underlying software capabilities. Most importantly, we have successfully curtailed the overhead of metadata entry to incent contribution to the catalog by researchers who collect and curate data and have to volunteer their time to index metadata. Reducing this burden is crucial to the success of the catalog. We refined the search capabilities and improved the output of search results. We also converted the database backend from use of a proprietary software (Oracle) to a completely open source solution. To engage the community, we regularly showcased DatCat at CAIDA workshops and various relevant meetings, and created a public web forum for discussion of specific and broader data-sharing issues.
By the end of the project, we have seen the beginning of organic use of the IMDC outside of our direct efforts to seed the catalog with our own data sets and those of close collaborators. As a sign of growing community acceptance, the 2015 Internet Measurement Conference Call for Papers included notice of an award to the paper that contributes a novel dataset to the community with a requirement to make the dataset publicly available through DatCat or CRAWDAD. We also explored the possibility of using DatCat framework for the DHS-funded PREDICT project, and presented the DatCat status and updates to the PREDICT community.
Intellectual Merit. IMDC supports a wide range of measurable benefits to cyberinfrastructure research: simplifies the process of searching for data by organizing metadata about accessible Internet data sets into a single repository; decreases the time and effort that might be spent collecting redundant data; lowers the threshold needed to start a new study; promotes validation and reproducibility of analyses and results; enables longitudinal and cross-disciplinary studies of the Internet; and facilitates new cross-domain areas of transformative networking research.
Broader Impact. The catalog opens up a wealth of Internet data and statistics to anyone interested in bolstering their expertise in Internet science and technology, including groups underrepresented in computer science and engineering. By indexing education-oriented data sets into DatCat, we make current operationally relevant Internet data available for use in classrooms and thus efficiently link research and education. The success of our catalog and related extensive outreach efforts will facilitate wider dissemination of Internet measurement data to researchers and security experts across academic, commercial, and government sectors.