NSF SDCI Proposal - SDCI Data: Improvement of contribution/curation tools for DatCat

The proposal for the NSF SDCI proposal, "SDCI Data: Improvement of contribution/curation tools for DatCat," is presented below in text format. For a printable version, view the PDF version.

1  Project Summary

One of the most significant problems facing systems and networking researchers is a lack of access to the data necessary to investigate important questions. Researcher access to underlying network infrastructures to collect data remains mired in the complex, non-technical, and in many ways intractable problems of economics, ownership, competitive advantage, security, privacy, and legality. Fixing the Internet measurement data problem is well beyond the research community's - or NSF's - grasp, but it is now within our grasp to increase researcher access to data that has been collected by collating information about accessible data sets into a single repository. June 2006 marked the public debut of CAIDA's data catalog that supports this indexing of data sets, the most effective solution to the data sharing problem currently possible. We propose to now build supporting software that will facilitate use of the catalog for its intended purpose: to improve the integrity of network science [1].

1.1  Intellectual Merit

CAIDA began development of DatCat, the Internet Measurement Data Catalog in mid-2002. The first and most fundamental objective driving IMDC's design and implementation is to facilitate searching for and sharing of data among researchers. Internet measurement data collected at great effort goes underutilized because other researchers have no way of knowing it exists, nor is there any way to share meta-data about measurements such as issues that could bias research results. The current state of Internet and network research is scientifically weak; we do not require nor perform reproducible research. The two greatest obstacles to reproducible Internet research are the inability to share data or work on common datasets, and severe under-documentation of data collection and processing. DatCat provides a mechanism to overcome both obstacles, strengthening the standards of network science by facilitating reproducible research.

After painstaking development of a powerfully extensible annotation system, DatCat opened for public browsing on June 12, 2006 with twelve datasets from two organizations. Since that time, thousands of researchers have visited the catalog, many using the available data. DatCat was enthusiastically received by the network research community, with NSF workshop participants clamoring for us to further expand DatCat's functionality, and in particular asking for automated contribution tools that would reduce the amount of time necessary to thoroughly index data sets. We seek three years of SDCI funding for supporting tools for DatCat in order to respond directly to these community requests.

In response to the these workshop recommendations and feedback from beta-testers, we propose to develop software tools to: (1) facilitate data download; (2) expedite metadata collation and contribution; (3) allow metadata and query result extraction; (4) provide an Application Programming Interface (API) that allows researchers to develop their own tools to enhance the utility of DatCat as a resource for the broader scientific community.

1.2  Broader Impact

Improving access to data via the DatCat Internet Measurement Catalog will enable a wide breadth of scientific projects, including: support for validation of scientific research; development of new measurement technology; evaluation of proposed future Internet architectures e.g. GENI; and empirical answers to questions of critical national security and public policy importance; and research in economics, psychology, physical infrastructure design, bioengineering, and many other fields. CAIDA's decade of experience in collection, curation, and provision of Internet data, our decade of experience in developing and supporting scientific tools for the research community, and our years of experience developing DatCat as a community resource inform our development efforts and enable us to better meet the measurement and analysis needs of the network research community in specific and the scientific research and engineering community as a whole.

SDCI Data: Improvement of Data contribution and curation tools for DatCat: the Internet Measurement Data Catalog

Project Description

2  Overview: Internet Measurement Data Catalog

images/datcat_logo_IMDC.png

2.1  Motivation to Develop and Support an Internet Measurement Data Catalog

Internet researchers face many daunting challenges, including keeping up with the conditions of ever-changing operational environments, privacy concerns, legal complications, and resource access. One of the most fundamental problems remains access to current operational data on Internet infrastructure. For many projects the relevant datasets simply do not exist, and researchers must go through a laborious process of securing permission and deploying measurement infrastructure before they can begin to study a problem. For others, the necessary data may exist and even be available. Unfortunately, if word-of-mouth has insufficiently propagated the information about the data ownership and access procedures, researchers may waste time and effort creating a new dataset, use a dataset inappropriate for a given research problem, or possibly even abandon the research.

In addition, the dearth of centralized knowledge about the few datasets that are known to exist in the community leads researchers to use these datasets well past their window of representativeness. Correspondingly, lack of awareness of datasets limits longitudinal study of network conditions, since comparable datasets that span months or years are difficult to find.

While the resource, legal, and privacy concerns limiting new Internet data collection efforts remain largely intractable, significant research could be promoted through more widespread use of existing data. To that end, CAIDA began developing an Internet Measurement Data Catalog - an index of existing datasets possibly available for research.

In addition to the obvious utility of locating datasets relevant to Internet research, the focus on detailed indexing of data provides a forum for robust documentation of data collection procedures. This functionality has the potential to dramatically increase the quality of research, since it will inform researchers in advance about aspects of the collection process and resulting artifacts that can bias analysis results. Vern Paxson's "Strategies for Sound Internet Measurement" paper[2] describes in detail both common pitfalls and best practices for data collection efforts, and explicitly recommends a cultural change in the research community toward prioritizing the documentation of metadata relevant to interpretation and further use of a given dataset. Our data catalog directly supports this goal.

Currently, critical experimental design details are largely exempt from the scientific review process, as paper length limits and the perception that data collection minutiae are boring and irrelevant cause data collection details to be elided from papers. In other cases, authors do not know the collection process or history behind the data they are using. An independent repository of dataset information allows researchers sufficient space to describe the data collection process, resulting in better documentation of a dataset's strengths and weaknesses that is accessible to paper reviewers and future users of that dataset. Enhanced ability to determine that the results of a paper reflect the system being studied, rather than an artifact of a data collection process, increases the integrity of the entire discipline of Internet measurement-based research.

When researchers are able to find datasets relevant to the topic they wish to investigate, new scopes of research become possible, including comparison across many sites at a single point in time and trend analysis over long periods of time. Moreover, since such studies would have clearly documented data sources, they promise the heretofore elusive science of reproducible Internet research.

In addition to the initial details of collection and processing that a researcher includes about the dataset they index, we wanted other researchers to be able to annotate datasets and give feedback about both problems and interesting features they discover in datasets, thereby increasing their utility.

We hope that as the number of cataloged datasets increased, standard annotations will allow some research questions to be answered directly using available metadata. For example, some believe that the utility of a network grows as a square of the number of participants [3]1. This naturally raises questions about how the amount and nature of traffic on the Internet grows as a function of the number of interconnected end hosts. Using just the annotations in a data catalog, one can easily examine the correlation between the number of packets traversing a link and the number of end hosts transmitting those packets at hundreds of measurement points across several years. Many other interesting questions could be answered using only annotations in the catalog.

In summary, the goals for DatCat are as follows:

  • indent=1cm

  • to facilitate searching for and sharing of data among researchers

  • to enhance documentation of datasets via a public annotation system

  • to advance science by promoting reproducible research

In August 2001, we submitted a proposal to the National Science Foundation  [4] to build a public system for dataset registry and annotation that could incorporate both CAIDA data and any other data available in the networking community. In early summer 2002, we began to build the Internet Measurement Data Catalog (IMDC) in earnest.

2.2  Community Feedback Motivates Proposed Software Development

Once the initial design for DatCat was complete, CAIDA hosted a workshop to ensure that the plans seemed useful to the community it was to serve. On June 3, 2004, thirty people with diverse representation from both the data collection and Internet research communities [5] gathered to view our prototype catalog and development plan and to discuss their existing Internet data sets, their policies for sharing that data, and their methods of dataset management and distribution [6].

The response to the proposed data catalog was overwhelmingly positive. Attendees encouraged us to get the catalog online as soon as possible, as they were anxious to make use of it. They felt that DatCat, if populated with relevant high quality data sets and properly maintained, would greatly benefit the Internet research community. It would advance the reproducibility of analyses and results, enable longitudinal and cross-disciplinary studies of the Internet, and open up new cross-domain areas of networking research. The participants highly encouraged continuation of CAIDA's IMDC work and of NSF support of this project. Workshop attendees made many suggestions that have been incorporated into DatCat, and others that we would like to implement given further support for the DatCat project. Section 4 describes these suggestions in detail.

DatCat was also discussed at the Workshop on Community Oriented Network Measurement Infrastructure, co-chaired by PI Claffy, on March 30, 2005 [7]. The workshop was motivated by the increasing awareness that the Internet's size and scope call for large scale distributed network measurement. Workshop attendees discussed community-based measurement systems for both passive and active data, and they encouraged the development of data catalogs like DatCat to distribute measurements collected via these systems.

Feedback from the DatCat user community in response to beta tests of our data contribution system yielded a number of insights. Contributors loved our tool, crl_stats, to automatically generate metadata for pcap format network trace files, but lamented the lack of automated contribution tools for other network formats and the amount of time necessary to generate necessary metadata manually. This feedback leads us to seek additional funding to develop additional tools to minimize the time and effort required for researchers to share data with the community.

3  Infrastructure Development

With our initial NSF plus and a one year extension, we developed a flexible database that indexes metadata from diverse scientific measurements. We also developed a portal that allows researchers easy access to available data, including the ability to browse featured data collections and search across all indexed fields to find data with characteristics of interest. DatCat opened for public browsing on June 12, 2006 with 4.8 terabytes of data from two organizations. By October 24, 2006 more than 100 researchers had created accounts, a significant vote of confidence in DatCat's future, as accounts were not required to access any DatCat features. By November 13, 2006 more than 2,000 visitors had browsed the catalog, with 98 going on to view CAIDA data. DatCat currently generates seventy percent of the data requests CAIDA processes. We began alpha tests of the public contribution API in June 2006, and progressed to beta tests on November 7, 2006. DatCat opened to public annotations on October 20, 2006. Prior support for DatCat ended on August 31, 2006.

DatCat, the Internet Measurement Data Catalog, can be viewed at http://imdc.datcat.org/. We invite reviewers of this proposal to visit DatCat and spend a few minutes exploring the portal. While we attempt to describe the capabilities and user interface of the catalog in this section, words are an inadequate substitute for experiencing DatCat first hand.

3.1  DatCat Architecture

The Internet Measurement Data Catalog has six core conceptual objects that make up the catalog metadata: Data, Formats, Collections, Packages, Locations, and Contacts. All catalog objects share some common required fields, including the object's name, its creator, its contributor, a primary contact, the object's creation date, the object's last modified date, a persistent handle for external reference to the object, and a short description. Most objects have keywords associated to enhance the ability of the user to find the object in the catalog.

3.1.1  Using the Catalog

Making a catalog like DatCat user-friendly for locating data is a significant challenge because the catalog contains many corpora of highly similar files. For example, DatCat indexes almost 60,000 macroscopic topology measurements that differ only in the locations and times they were collected. These large sets of highly similar files complicate two areas of the user interface: they make it difficult to capture the variety of results to a specific search, and they make it difficult for the user to get a sense of what is available in the catalog as a whole. We address the former problem through a flexible search result display that allows access to summarized information about objects without leaving the search results page. To address the latter problem, we developed a Browse interface targeted at users who come to DatCat without a specific type of data in mind.

The Browse interface focuses on Collections as semantic groupings of available data. It provides collection summaries for both a rotating set of Featured Data, and a list of collections recently added to the catalog. In the future, we hope to recruit Guest Editors from the community to update the featured data list. The browse interface also displays the keywords that span the breadth of data in the catalog, so users can scan through the list to get a sense of what is available, and click to see the Collections that contain data with that keyword.

For users who come to DatCat with a specific goal in mind, we provide a powerful search interface. For quick access, a Google-like single query field allows users to quickly find the data they need. A more in-depth Advanced search interface allows searching across both the fields of an individual object and the relationships between objects (for example, all data objects with a given keyword that are of a given format).

Once users have selected data, a step-by-step process, complete with a progress bar that lets users know where they are and what steps of their search remain, help users to get access to the data of interest.

4  Proposed Work

The tasks we propose under this grant fall into two categories: expansion of the tools DatCat provides to assist researchers in contributing data, and development of tools to export and process information from the catalog for further research use.

This grant focuses on substantively different activities (end user contribution and metadata extraction) than those described in our 2006 CRI proposal. The CRI proposal focuses on maintenance and development of DatCat itself and outreach efforts focused on increasing the volume of data in the Catalog.

As described in Section 2.2, the network community [5] enthusiastically welcomed the idea of publicly available catalog of the Internet measurement data. Workshop participants highly encouraged continuation of CAIDA's IMDC work and NSF support of this project, and they requested the development of additional tools to make it easier for researchers to contribute data and metadata to the catalog.

Relevant recommendations from workshop attendees included:

  • indent=1cm

  • expand automation to curtail the time required to contribute data

  • catalog scripts and tools along with data

  • provide convenient means to export search results with support for XML output

  • release the database code to other groups for their internal use

In response to the workshop recommendations and the feedback from beta-testers of DatCat's contribution interface, we propose to develop: tools to facilitate data download (Section 4.1), tools to expedite metadata collation and contribution (Section 4.2), tools to allow metadata and query result extraction (Section 4.3), and an Application Programming Interface (API) (Section 4.4) that allows researchers to develop their own tools to enhance the utility of the Internet Measurement Data Catalog as a resource for the broader scientific community.

4.1  Tools to Facilitate Data Download

DatCat is a metadata catalog; it provides pointers to the location of many datasets, but it does not serve the datasets themselves. This intentionally limited scope allows DatCat to provide relevant information about the existence of all datasets even though security, privacy, economic, and ownership restrictions prevent some data from being widely distributed.

Because DatCat does not distribute data, researchers who have used the Catalog to find relevant, available datasets must follow links from datcat to other repositories and manage the data download process themselves. A Graphical User Interface that allows users to select many pointers to data files and datasets from DatCat and "drag and drop" the actual data from various independent repositories into a directory on their local machine will save time and frustration and allow researchers to focus on what they wish to accomplish with the data, rather than the tedious details of data acquisition.

4.2  Tools for Metadata Contribution

CAIDA currently makes available two tools, crl_stats [8] and sk_stats, for automatically identifying generic metadata, such as file size, data format, package format, etc. This automated metadata can be generated recursively on hierarchical directory structures. We also distribute tools for generating more sophisticated format-specific metadata from a number of file formats, including pcap [9], dag [10], CoralReef crl [11], NLANR/MCI crl [12], NLANR Time Sequenced Header (tsh) [13], and arts++ [14]. Information automatically recorded includes the capture length (snaplen), maximum OSI level, link layer type, active measurement source address, minimum TTL, and counts of distinct ICMP types and codes, IPv4 packets, IPv6 packets, bytes, OSI layer 3 bytes, OSI layer 3 packets, IPv4 addresses, IPv4 source addresses, IPv4 destination addresses, flows, IPv6 packets, IPv6 flows, active measurement traces, complete active measurement traces, destinations probed, destinations responding, and the measurements performed on each hop.

The existing tools have been embraced by the community as they ease the onerous task of collating metadata for measurements users wish to contribute. However many archives of network measurement data are in formats not currently supported by DatCat tools. Attendees at CAIDA's measurement catalog workshop and beta testers of the DatCat contribution interface have enthusiastically requested additional tools to generate metadata and annotations for DatCat entries.

Some of the file types and metadata we would like to support include:

  • indent=1cm

  • RouteViews MRT routing tables and updates,

  • DNS data, including query types and volumes,

  • Cisco Netflow (version 5 and 9),

  • syslog

  • IETF IPFIX

  • routing configuration files,

  • snmp mib data,

  • rrdtool/mrtg data (providing longitudinal data spanning years for links on the Internet),

  • continuous database snapshots (datapository, RIPE RIS),

  • NETI@home and NETDIMES,

In addition to basic information from a wide-variety of file types, we also propose the development of tools to automatically extract interesting features in data. Using existing algorithms, we hope to facilitate automatic identification of such relevant features as packet size distribution, malformed packets, peer-to-peer network usage statistics, streaming media use, data collection outages, denial-of-service attacks, Internet worms, botnets, port scans, DNS workloads (including query types), DNS poisoning, routing loops, and hijacked IP prefixes.

This list identifies features we know are of interest to the broader science and engineering community. Further, because the Internet is critical infrastructure for the global economy, this information has relevance well beyond science and engineering. This metadata will allow investigation of previously unanswerable questions in the areas of economics and public policy, and provide actuarial data on the function of the Internet. Information on the prevalence of various Internet threats is of interest to researchers, engineers, operators, policy makers, investors, and every Internet user, regardless of scientific discipline.

This list is by no means exhaustive. Through workshops, conference presentations, and general solicitation, we will gather feedback from the community to ensure that we identify and document neoteric, relevant dataset features that pique community interest. Current requests are a starting point, rather than a restrictive list, of the data features our tools will automatically identify and index in DatCat.

4.2.1  Client-side Tools

Both our automated metadata generation and our feature extraction tools will be available for download by the community. While our motivation is to increase the scope and utility of the Internet Measurement Data Catalog, the tools themselves can be used without DatCat contributions; they provide a summary of relevant information about a dataset to any researcher with access to measurement data. The use of standard tools for feature extraction promotes scientific integrity of research done on those features by providing a standard definition for those features.

4.2.2  Server-side Tools

We will also provide automated metadata generation and feature extraction tools that can be run from our servers on local data collections. This feature removes the often onerous task of software download and execution from Data Catalog users. Any user with files in supported formats will be able to generate metadata effortlessly with the click of a mouse. This mode of operation will use increased bandwidth between DatCat and the user, so it is most appropriate for small datasets. Nonetheless, we feel it will significantly broaden the scope of DatCat contributions, as researchers will small datasets are less likely to want to invest significant time installing software and manually collating metadata just to add a few files to the Catalog.

4.3  Tools for Metadata Extraction

As the volume of indexed and annotated data increases, DatCat evolves from a broker providing pointers to relevant datasets to a rich data source in its own right. As queries from the Catalog can answer research questions, inform public policy, and generally advance a broad range of scientific and engineering objectives, users require more sophisticated methods of exporting catalog data and query results. We propose to develop of an XML/XSLT-based interface to DatCat to support further refinement of metadata queries. Once this basic interface is complete, we will focus on extraction tools that allow higher-level conceptual mapping to catalog data. Increasing the scope and reducing the technical complexity of queries will allow researchers outside of computer science increased access to Internet data. For example, social scientists will have access to unique data to study the interpersonal and business relationships reflected in Internet communications.

4.4  Application Programming Interface

In addition to the metadata contribution and metadata extraction tools we provide, we will also provide and document an Application Programming Interface (API) to allow any researcher to create his or her own tools to identify data features or collect specific metadata.

Such an API will facilitate more sophisticated operations, such as automated extraction of metadata from the catalog, feature extraction from the catalog metadata, and contribution of the results back to the catalog for future research access. Examples of queries that could be performed this way include correlating diurnal patterns in data and identifying datasets that match specific patterns, e.g. the prevalence of various work schedules as viewed via 8-hour duration activity windows in Internet data. Because Internet use is a human behavior, information of this type facilitates biological, social science, and behavioral science research.

This mode of operation facilitates research of greater scope, both longitudinally and in the volume of concurrent data. Broad questions that were previously unanswerable because fetching and operating on a large set of raw data was prohibitive can be answered more quickly and easily via previously identified metadata and features.

4.5  Project Plan

Six Months: In the first six months, we will begin development on the DatCat metadata API, identify metadata for BGP MRT data, DNS trace data, and Cisco Netflow versions 5 and 9, and develop client-side tools to automatically generate metadata for these formats. Milestones achieved will include:

  • beta-test client side tools to generate and contribute metadata

Twelve Months: We will expand the DatCat metadata API to support data export and develop XML export tools to allow researchers to offload metadata queries. We will identify relevant metadata and develop client-side tools to automatically import router configuration files, SNMP MIB data, and RRDtool/MRTG data. We will also seek community feedback on our progress and file support priorities via workshop and conference attendance and adjust our Project Plan to ensure that we are meeting community needs. Milestones achieved will include:

  • release client side tools to generate and contribute metadata to DatCat from MRT, DNS trace, and Cisco Netflow data files.

  • release an API along with skeleton code to allow community members to rapidly develop their own tools to automatically contribute extracted features and other metadata from proprietary or not-yet-supported formats

Eighteen Months: We will develop a tool to manage access and download of data that researchers have identified in DatCat. We will identify metadata for syslog files, IETF IPFIX data, continuous database snapshots, and NETDIMES/Neti@home data and develop tools to automatically generate and contribute metadata to DatCat. Milestones achieved will include:

  • release client side tools to generate and contribute metadata to DatCat from router configuration files, SNMP MIB, and rrdtool/MRTG data files.

  • release the query export API along with skeleton code to allow researchers to extract metadata, process it, and contribute newly created metadata back to the Catalog.

Twenty-Four Months: We will integrate all of the previously developed client-side metadata generation tools into a server-side tool to allow researchers to create and contribute metadata without downloading and installing tools locally. We will begin development of automated feature extraction for metadata generation. Milestones achieved will include:

  • release client-side tools to generate and contribute metadata to DatCat from syslog, IETF IPFIX, continuous database snapshots, and NETDIMES/Neti@home data files.

  • release data download tool to enhance researcher acquisition of data indexed in DatCat

Thirty Months: We will continue development of automated feature extraction for metadata generation. We will integrate feature extraction into server-side tools for metadata contribution. We will develop an interface for higher-level conceptual queries to facilitate cross-disciplinary research without requiring a high-level of programming expertise. Milestones achieved will include:

  • release server-side tools to allow creation and contribution of metadata and feature descriptions without researcher download and installation of tool packages

  • release client-side tools for feature extraction and automated metadata contribution

Thirty-Six Months: Milestones completed will include:

  • identify and extract metadata for additional community-relevant data features via both client- and server-side applications

  • debut interface for higher-level conceptual queries of DatCat metadata

4.6  Software Licensing

To encourage further development of contribution tools by all users, we have chosen version two of the Gnu General Public License (GPLv2) [15] for the software developed in this effort. GPLv2 requires that any further contributions to this code base be made publicly available for all users, in accordance with our core goal: to increase the utility and usability of DatCat to user communities working on a broad range of scientific research and engineering projects.

4.7  Tangible Metrics for Success

In this section, we describe quantitative measures by which the progress of our software development efforts can be evaluated. As a whole, the success of the tools we propose will be reflected in increased volume and diversity of data indexed in DatCat and an increase in the number of users creating accounts and performing queries on the Internet Measurement Data Catalog.

4.7.1  Tools to Facilitate Data Download

A working prototype of a tool to allow data download will:

  • take as input the results of user selection from a DatCat data query

  • allow the user to specify a destination directory on the local machine for the data

  • provide estimates of download time for the selected data

  • notify the user when the download is complete

  • automatically check for data integrity using the md5 checksums stored in DatCat for all data

Once a working prototype has been developed, we will enhance error reporting and recovery and distribute the tool to DatCat users. The success of this tool can be measured by the number of researchers who use it to access DatCat data.

4.7.2  Tools for Metadata Contribution

Because this category includes many small tools designed to capture the metadata of disparate formats, prototype development will include identification of format-specific metadata and production tools will incorporate additional error checking and performance optimizations. The success of these tools will be measured by the extent to which users adding data in supported formats employ them to automatically create metadata.

4.7.3  Tools for Metadata Extraction

A working prototype for metadata extraction will provide an XML-based output of all of the fields of all of the objects returned by a DatCat query. A production version will include human-readable and machine-parsable display options for the extracted data.

4.7.4  Application Programming Interface (API)

A working prototype of the API will drive the tools described above in the data access, metadata contribution, and metadata extraction categories. The production version of the API will include any user-requested features and complete documentation of the interface and skeleton code to allow rapid development of applications that interface with DatCat.

4.8  Broader Impact on Science and Engineering

Because the Internet is a critical communications and data acquisition tool for researchers across all scientific and engineering disciplines, as well as government and commercial entities who fund research, investigation into the use and function of the Internet ultimately benefits all sciences.

The Oxford English Dictionary defines the pursuit of science as "A branch of study which is concerned either with a connected body of demonstrated truths or with observed facts systematically classified...". Along those same lines, research is defined as "A search or investigation directed to the discovery of some fact by careful consideration or study of a subject." Because DatCat provides an otherwise unavailable path to observing networked systems, the scope of research and education activities enabled by DatCat is fundamentally every area of computer science research that involves currently deployed Internet infrastructure.

DatCat currently makes metadata for seventeen datasets available to researchers:

  • indent=1cm

  • Day in the Life of the Internet collection [2006] (7,810 files)

  • Day in the Life of the Internet collection [2007] (9,544 files)

  • UCSD SIGCOMM Wireless Traces (25 files)

  • AOL 500k User Session Collection (10 files)

  • Router Adjacency Data (2 files)

  • Autonomous System (AS) Adjacency Data (2,451 files)

  • Autonomous System (AS) Relationships Data (79 files)

  • Autonomous System (AS) Taxonomy Data (113 files)

  • CAIDA Dataset on the Code-Red Worms (14 files)

  • CAIDA Dataset on the Witty Worm [public version] (7 files)

  • Skitter Macroscopic Topology Data (60,478 files)

  • OC48 peering point traces (119 files)

  • CAIDA Dataset on the Witty Worm [restricted version containing raw traffic traces] (132 files)

  • Denial-of-Service Backscatter-TOCS [2001-2004] (231 files)

  • Denial-of-Service Backscatter 2004-2005 (63 files)

  • Denial-of-Service Backscatter 2006 (168 files)

  • DNS RTT Dataset (5,047 files)

In 2005, more than 9,400 researchers downloaded six terabytes of data from CAIDA data servers. Forty-four papers were published by non-CAIDA researchers using this data, with many more currently in progress [16]. We expect significantly increased data usage for 2006, as the number of datasets available grew from seven in 2005 to fourteen in 2006.

In three months, more than one hundred researchers have created DatCat accounts. More than 2,000 people visited DatCat, with at least 902 using DatCat as a starting point for requesting data from CAIDA. We expect to gain momentum from this auspicious start as the volume of data indexed in DatCat grows and DatCat is publicized to the Internet research community.

Available data will inform projects in most areas of interest, including congestion control modeling (especially VOIP and real-time streaming protocols, )Internet topology modeling, Internet worm spread modeling, Internet worm quarantine and countermeasures, packet and flow sampling techniques, flow size estimation, IP geolocation, anomaly detection, routing and queuing algorithms, BGP convergence, ISP hierarchy, application fingerprinting (including content-based approaches), characterization of online games, traffic matrices, packet reordering, domain name system performance, and protocol performance.

4.8.1  Support for Research through Metadata Provision

As the volume of metadata in DatCat increases, so will the range of analysis that is possible without access to raw data. For passively collected trace data, information on volume and rate of traffic, diversity of communicants, and types of encoding used are all available directly through DatCat. Widespread availability of this summary information can support a broader research agenda than can be supported by a few isolated data sets. For example, validating research into realistic traffic simulation requires information about the characteristics of natively collected traffic. Identifying this ground truth information is a diversion from efforts to create an accurate simulator. When researchers have metadata readily available as background information, they remain focused on the research problems they wish to solve.

Another significant form of scientific inquiry enabled by metadata provision are the currently rare longitudinal studies of network properties, overall system function, and human behavior. DatCat aids investigators in locating data to perform longitudinal studies, but the hundreds of terabytes of data that must be downloaded and processed first are often prohibitive to researchers with scarce resources. Incrementally collected and indexed DatCat metadata allows longitudinal study of data characteristics with simple database queries without needing raw data. Development of an XML-based export mechanism for DatCat search results is critical to supporting valuable longitudinal studies of networked systems.

4.8.2  Support for Research into Future Internet Architectures

As the National Science Foundation considers its role in supporting the research and development of new and innovative Internet architectures [17,18], it behooves us to draw on lessons that we have learned about the current Internet. Indeed, it would be unwise to undertake such an ambitious venture without clearly identifying the problems and successes of the current Internet. Research areas include: validating or refuting [implicit] assumptions about the current network (traffic, naming, routing, security) that are driving its evolution; and applying what we have learned from Internet data to the design of new network architectures, including how to facilitate the types of data collections needed to support architectural goals [19].

DatCat is infrastructure-agnostic - metadata fields relating to particular devices and protocols are annotations that can be dynamically created as the need for them arises. Thus DatCat is poised to support data collected on new hardware and software platforms - even those of entirely new architectures - from its inception.

Municipal and community wireless network engineering will benefit greatly from both metadata contribution tools and data extraction tools. They often work without significant research and data analysis support, due to minimal budgets and lack of experienced staff, yet information about Internet traffic and basic network function is critical to the success of their efforts. Many community networks explicitly allow research use of data, so streamlined contribution tools allow municipal and community wireless networks to collaborate and share information necessary for their growth and development. DatCat will serve the public interest enormously by facilitating the location and distribution of data among interested public sector networks.

4.8.3  Support for Research Beyond Computer Systems and Networks

In addition to the benefits provided by enhanced understanding and increased Internet performance provided by systems and network research, the tools developed under this proposal will directly benefit researchers working in economics, psychology, physical infrastructure design, bioengineering, security, municipal network engineering, and many others.

  • While keen interest in the economics of networked systems exists in universities, think tanks, and government agencies, data necessary to support such research has been sparse or nonexistent. The tools we develop will both maximize the volume of data available to researchers and reduce barriers of technical comprehension that impede access by researchers outside the fields if systems and networking.

  • Internet data provides a unique reflection into human behavior, including the structure and evolution of social networks, and even the psychology of Internet addiction. The metadata indexed by the catalog provides unique insight into these areas of current interest.

  • Because networked systems are a ubiquitous and necessary part of both home and business life, they begin to influence the physical design of structures. Evolution in Internet usage patterns, particularly current activities, provide critical insight into the design of buildings that must support Internet access in the decades to come.

  • Medical devices are increasingly networked, providing monitoring and variation in treatment to users outside of medical facilities. The tools we design allow bioengineers to understand what properties of networked systems can be relied upon when human lives are relying on robust, redundant infrastructure.

  • Physical security systems engineering requires communication and coordination of sensors, connectivity to external sites. With personal, business, medical, and financial information stored and manipulated on networked systems, research into the reliability and integrity of data remains an area of broad interest. DatCat metadata is particularly well-suited to informing this research, since it provides both datasets with significant security events, and many samples of normal background behavior.

  • Many currently-relevant legal, social, and governance/ policy questions are starved for data that DatCat could provide. Tools to facilitate contribution of and access to DatCat data will provide critical data for research in these fields. Indeed, the DatCat tools developed under this proposal will support goals and missions of of ICANN [20], OARC [21], CERT [22], PREDICT [23], the Federal Trade Commission [24], the Federal Communications Commission [25], the Department of Homeland Security [26], and the National Security Agency [27].

5  Why CAIDA is the most appropriate team for this project

CAIDA is recognized as a world leader in Internet measurement and data analysis, and has provided several landmark studies of Internet performance, workload, and topology issues [28]. CAIDA has years of experience in development, implementation, and evaluation of measurement infrastructure, as well as with anonymization and analysis tools for the gathered data. CAIDA's long-standing trust relationships with many Internet service providers and equipment vendors facilitate monitor deployment and informed analyses. To technical, operational, and policy communities, CAIDA is among the most trusted sources of objective measurement tools and analyses.

CAIDA uses its decade of experience with data collection, curation, and provision in developing an Internet Measurement Data Catalog to support access to network research data. We have already populated DatCat with almost 5 terabytes of data from two organizations, spanning more than 50 keywords. The systems and networking community has enthusiastically embraced DatCat, with more than two thousand researchers visiting DatCat in its first three months of operation.

CAIDA has almost a decade of experience with the development and release of tools to support research in both the systems and networking communities and in broader scientific communities, including the biological sciences, chemistry, cognitive science, psychology, and sociology. CAIDA developed, released, and currently supports more than thirty tools in support of workload characterization, routing measurement, bandwidth estimation, passive trace collection, six types of realtime report generation software, geographic visualization, active network core measurement, multicast, 3D graph visualization and manipulation, diurnal pattern display, geolocation of IP addresses, Domain Name System query and workload characterization, store and serve WHOIS data, rank Autonomous System outdegree, plot network paths, 2D network graph visualization, and graphical traceroute. CAIDA's CoralReef Software Suite consists of flexible, high performance Internet traffic data collection and analysis tools with components including modules for realtime trace collection and analysis, tools for trace reading, writing, and conversion between a variety of popular trace data formats, drivers for specialized trace collection hardware, and Perl and C programming APIs to interface with trace contents so that researchers can ignore the details of link encapsulation and data formats and focus on novel analysis. Ongoing development of CoralReef provides researchers with tools for data collection and network monitoring even as the underlying infrastructure evolves to support ever-increasing bandwidth in Internet links.

Housed at SDSC, CAIDA represents a unique combination of relevant experience, talents and facilities necessary to achieve the proposed goals.

6  Community Outreach

The definition of success for a Community Resource project is its ability to meet the needs of the targeted community, both at the project's inception, and as the user base evolves. DatCat's architecture is inherently flexible to ensure that it can change as the data collected and used for research change. Yet knowing what to modify to meet community needs requires maintaining a healthy dialog between developers and users. CAIDA plans to ensure that DatCat meets community needs by attending relevant conferences and by hosting workshops to elicit user feedback and increase and expand DatCat's usability.

6.1  Workshops

Using other sources of funding, CAIDA will host a number of workshops in support of DatCat. First, we will invite researchers with valuable datasets to an annual workshop specifically focused on creating metadata and indexing it in DatCat. A collegial environment of measurement-oriented researchers will promote future collaborations, and CAIDA personnel familiar with creating DatCat entries will field any questions or complications that arise as researchers enter their data.

These workshops will provide a venue for gathering critical feedback on our contribution tools; we intend to use a rapid prototyping process to maximize our responsiveness to community feedback.

During the second year of proposed work, CAIDA will host a workshop to help researchers learn to use DatCat to perform research, identify features of interest in datasets, and use our feature extraction and contribution API to annotate DatCat datasets that contain those features. Workshop material will include:

  • how to create specific queries to find data,

  • how to export query results from DatCat for further processing,

  • how to create new annotation types to describe data features and use the contribution API to add annotations describing these features back into the catalog,

  • how to configure a DatCat account.

Finally, CAIDA intends to host a workshop in the third year of proposed work to provide university professors with experience using DatCat, including providing them with class projects for computer science and engineering classes that use DatCat data and metadata. We expect the tools developed under this proposal to be especially useful in an academic setting, as students have limited time to spend on software development while they strive to complete class projects before the end of the term.

7  Alternative Approaches

A number of databases exist for storing network measurement data and performing various sophisticated queries on the data [29][30]. Allowing data queries from the community provides a highly useful, but this approach is limited to providing access to the data stored in the databases, and it provides no tools targeted at promoting metadata contribution or otherwise sharing query results.

The EU's MoMe project [31] and Dartmouth's NSF-sponsored CRAWDAD [32] both focus on storing and serving Internet data. MoMe provides limited support for generating graphs from the underlying data. CRAWDAD targets only wireless network data. These projects are valuable community resources but both require that the data be stored by the project, and thus the scope of data they can index is ultimately limited by intractable security, privacy, and data ownership concerns.

We know of no other open source or commercial tools that generate metadata for Internet measurement data of various formats. We are aware of no tools designed to collate and export information on Internet measurements in a context-aware manner.

DatCat is the only community-supported catalog that can index Internet measurement metadata without taking possession of the data itself. The tools we propose to develop will further that effort by expanding the scope of metadata that can be contributed to the catalog with minimal effort on the part of the researchers in possession of the data. They will facilitate easy processing and export of catalog metadata for researchers across a broad range of scientific disciplines.

8  Collaboration

The number of potential collaborators greatly exceeds those we have pursued thus far, or even those with whom we could potentially partner in the next three years. Therefore, we focus our efforts in this proposal in the development of tools that will lower the cost in time and effort for participating in DatCat. While we welcome collaboration with data repositories, the easier it is for community members to contribute data and metadata on their own, the more useful DatCat will be.

CAIDA has already developed active relationships with many other data collection and distribution projects. In September 2006, we hosted the CRAWDAD (Community Resource for Archiving Wireless Data At Dartmouth) [32] developers for a workshop to discuss inter-operation between DatCat and CRAWDAD. With a few minor updates to both architectures now complete, we expect the CRAWDAD maintainers to be able to index CRAWDAD data into DatCat by March 2007. We've also begun outreach efforts to include many other projects, including Datapository [29], MoMe [31], the Abilene Observatory [33], CERT [22], RouteViews [34], Vern Paxson's payload-containing enterprise trace [35], and a historical data collection from Bill Cheswick [36]. This is by no means an exhaustive list of those with whom we expect to collaborate; rather, a list of those we've been able to approach, or who have approached us in the four months DatCat has been open to the public. We consider outreach and cooperation with other projects to be critical to DatCat's success and a hallmark of responsible use of scarce infrastructure dollars. When many projects are able to work together to serve the community, everyone wins.

8.1  Integrating Diversity into CAIDA Activities

Based at UC, San Diego, CAIDA has a strong record of integrating diversity into our activities. Since July 1999, the composition of our 54 paid interns has included 15 females, 30 Asian, and 2 Hispanic students. Our 17 volunteer interns in that same period have included one female and 7 Asian students.

Community resources like DatCat are critical to the success of underrepresented groups in computer science and engineering, including women and minority groups. In most cases, organizing the collection of data requires trust relationships with engineering and management personnel in many organizations. Discovering available data resources by word-of-mouth also requires a great deal of people networking. Individuals from underrepresented groups are often at a social disadvantage simply by virtue of their uniqueness. Resources like DatCat that provide access to information on available data with a lesser degree of "knowing the right people" level the playing field by providing everyone with equal access high-quality data necessary to support high-quality research.

9  Results from Prior Support

  1. CAIDA: Cooperative Association for Internet Data Analysis. ANI-9711092. $3,199,580. Sep 1997 - Aug 2002. (Claffy) This collaborative undertaking brings together organizations in the commercial, government, and research sectors. CAIDA provides a neutral framework to support cooperative technical endeavors, and encourages the creation and dissemination of Internet traffic metrics and measurement methodologies. Results of this collaborative research and analytic environment can be seen on published web pages on the CAIDA web site www.caida.org. CAIDA also develops advanced Internet measurement and visualization tools.

  2. Internet Atlas. ANI-99-96248 $468,834. Jan 1999 - Dec 2002. (Claffy) This effort involves developing techniques and tools for mapping the Internet, focusing on Internet topology, performance, workload, and routing data. A gallery that assesses state-of-the-art in this nascent sector is published on the web.

  3. Correlating Heterogeneous Measurement Data to Achieve System-Level Analysis of Internet Traffic Trends. ANI-0137121, $1,013,794 Sep 2002 - Aug 2005 (Claffy and Moore) As it grows, the Internet is becoming more fragile in many ways. The complexity in managing or repairing damage to the system can only be navigated with sustained understanding of the evolving commercial Internet infrastructure. The research and tools proposed under this effort lead to such insights. In particular, richer access to data will facilitate development of tools for navigation, analysis, and correlated visualization of massive network data sets and path specific performance and routing data that are critical to advancing both research and operational efforts. concentration of administration of Internet infrastructure.

  4. Routing and Peering Analysis for Enhancing Internet Performance and Security. ANI-0221172, $882,999 Oct 2002 - Sep 2005 (Claffy) CAIDA performs topology analysis and characterizes sources of growth and instability of the routing system, applying graph theory and comparing combinatorial approaches for identifying strategic locations in the macroscopic Internet.

  5. Quantitative Network Security Analysis. CCR-0311690, $384,183 Aug 2003 - Jul 2006. (Moore) Much information about the state of large-scale malicious activity on the Internet is anecdotal. Under this grant, we are developing a combination of network analysis techniques and network measurement infrastructure to analyze large-scale Internet security threats, such as denial of service attacks or Internet worms. In addition to our own research and analysis of these events, datasets of interesting events collected by the UCSD Network Telescope have been made available to other researchers.

  6. New Directions in Accounting and Traffic Measurement. ANI-0137102, $649,754 Sep 2002 - Aug 2006 (Moore) As network link bandwidths increase, the ability to measure every single packet meaningfully in an operational setting decreases. To assist, we developed several novel techniques for generating accurate measurement reports which degrade gracefully under adverse network traffic conditions. One of these approaches, Adaptive NetFlow, was designed to be implementable in routers and produce reports which are essentially the same as those typically collected operationally today.

  7. SCI: ITR-(NHS+EVS)-(dmc+SIM): Improving the Integrity of Domain Name System (DNS) Monitoring Trends. SCI-0427144, $3,397,981 Sep 04 - Aug 06 (Claffy) This project helps to address National and Homeland Security recommendations by the President's Critical Infrastructure Protection Board to develop a 'cyberspace network operations center (NOC)'. The long-term mission of this proposal - to provide data needed to support DNS research - also has relevance to the real Internet and how it supports economic prosperity and a vibrant civil society. Indeed, the data, models, communications analysis, and simulation functionalities to be provided have the potential to dramatically improve the quality of the lens with which we view the Internet as a whole.

  8. NeTS-NR Toward Mathematical Rigorous Next-Generation Routing Protocols for Realistic Network Topologies. CNS-0434996, $900,000 Oct 04 - Sep 07 (Claffy and Krioukov) CAIDA proposes to open a new area of research focused on applying key theoretical routing results in distributed computation to extremely practical purposes, i.e. fixing the Internet. Our agenda is ambitious, but firmly justified by a set of several previous results, all spectacularly unexpected, which have revealed a huge gap in our fundamental understanding of data networks. Our agenda has three related and clearly defined tasks: 1) execute the next step on the path toward construction of practically acceptable next-generation routing protocols based on mathematically rigorous routing algorithms; 2) validate the applicability of the above algorithms against several sources of real Internet topology data; 3) build and evaluate a model for Internet topology evolution, which reflects fundamental laws of evolution of large-scale networks.

  9. CRI Community-Oriented Network Measurement Infrastructure. CNS-0551542, $583,900 Sept 06 - Sep 08 (Claffy and Moore) Internet research critically depends on measurement, but effective Internet measurement raises several daunting issues for the research community and funding agencies. There is increasing awareness that obtaining a better understanding of the structure and dynamics of Internet topology, routing, workload, performance, and vulnerabilities calls for large-scale distributed network measurement infrastructure. CAIDA proposes to upgrade both of our current measurement infrastructures (passive and active) to provide the research community data from the wide area Internet that will target the need for validation of current and proposed efforts in large-scale network modeling, simulation, empirical analysis, and architecture development to answer questions of critical national security and public policy importance.

References

[1]
National Academy of Science, "Network science," 2006. https://www.nap.edu/catalog/11516/network-science.

[2]
V. Paxson, "Strategies for sound internet measurement," in Proceedings of the ACM Internet Measurement Conference, Oct. 2004.

[3]
A. Odlyzko and B. Tilly, "A refutation of Metcalfe's Law and a better estimate for the value of networks and network interconnections," Mar. 2005. http://www.dtc.umn.edu/~odlyzko/doc/metcalfe.pdf.

[4]
CAIDA, "Correlating Heterogeneous Measurement Data to Achieve System-Level Analysis of Internet Traffic Trends." https://www.caida.org/projects/trends/.

[5]
CAIDA, "Participants of the ISMA Data Catalog Workshop." https://www.caida.org/workshops/isma/0406/list.

[6]
CAIDA, "ISMA Data Catalog 2004 Workshop Report." https://www.caida.org/workshops/isma/0406/final_report.

[7]
k claffy, M. Crovella, T. Friedman, C. Shannon, and N. Spring, "Community-Oriented Network Measurement Infrastructure (CONMI) Workshop Report," in CONMI Workshop, 2005. https://catalog.caida.org/details/paper/2005_conmi.

[8]
"The crl_stats packet trace metadata extractor." https://www.caida.org/catalog/software/coralreef/doc/doc/applications#crl_stats.

[9]
"The pcap raw packet trace file format." http://www.tcpdump.org/.

[10]
"The dag raw packet trace file format." http://www.endace.com/.

[11]
"The CoralReef raw packet/cell trace format." https://www.caida.org/catalog/software/coralreef/.

[12]
"The NLANR/MCI raw packet/cell trace format." http://pma.nlanr.net/Traces/coral.format.html.

[13]
"The Time Synchronized Header raw packet trace format." http://pma.nlanr.net/Traces/tsh.format.html.

[14]
"The arts++ IPv4 Paths format." https://www.caida.org/tools/utilities/arts/.

[15]
"The GNU General Public License (GPL) Version 2, June 1991." http://opensource.org/licenses/gpl-license.php.

[16]
"Papers Published Using CAIDA Datasets." https://www.caida.org/data/publications/.

[17]
NSF workshop report, "Overcoming barriers to disruptive innovation in networking," tech. rep., January 2005.

[18]
M. Baard, "Net pioneer wants new internet," https://www.wired.com/2005/06/net-pioneer-wants-new-internet/.

[19]
"fima@caida.org, mailing list for discussion of Future Internet Measurement Architectures."

[20]
"Internet Corporation for Assigned Names and Numbers." http://www.icann.org/.

[21]
"ISC Operations, Analysis, and Research Center." https://www.dns-oarc.net/index.php/oarc/faq/general.

[22]
CERT, "Computer emergency response team." http://www.cert.org/.

[23]
Department of Homeland Security, "PREDICT project: Protected Repository for Defense of Infrastructure against Cyber Threats." http://www.predict.org/.

[24]
"Federal Trade Commission." http://www.ftc.gov/.

[25]
"Federal Communications Commission." http://www.fcc.gov/.

[26]
"DHS: Department of Homeland Security." http://www.dhs.gov/.

[27]
"National Security Agency/Central Security Service." http://www.nsa.gov/.

[28]
CAIDA. https://catalog.caida.org/details/paper/.

[29]
David Anderson and Nick Feamster, "The datapository: A collaborative network data analysis and storage facility," 2005. http://www.datapository.net/.

[30]
"RIPE Routing Information Service." https://www.ripe.net/analyse/internet-measurements/routing-information-service-ris/routing-information-service-ris.

[31]
The MOME Project Consortium, "Information Technologies Society - Cluster of European Projects aimed at MOnitoring and MEasurement." http://www.ist-mome.org/.

[32]
"A Community Resource for Archiving Wireless Data At Dartmouth." https://crawdad.org.

[33]
Internet2, "Abilene Network Observatory." http://abilene.internet2.edu/observatory/.

[34]
David Meyer, "University of Oregon Route Views Project." http://www.routeviews.org/.

[35]
M. Allman, M. Bennett, M. Casado, S. Crosby, J. Lee, R. Pang, V. Paxson, and B. Tierney, "LBNL/ICSI Enterprise Tracing Project," 2005. http://www.icir.org/enterprise-tracing/index.html.

[36]
B. Cheswick and H. Burch, "Internet Mapping Project." https://www.bell-labs.com/about/history/#gref, 2000.


Footnotes:

1 In general, Metcalfe's law states that the value of a communication network is proportional to the square of the number of users.

2Until recently, we were unable to track users downloading CAIDA data after visiting DatCat, so the actual number of users is likely to be at least three times greater.


File translated from TEX by T T H, version 3.72.
On 7 Mar 2007, 14:38.
Published