Mission Statement: CAIDA investigates both practical and theoretical aspects of the Internet, with particular focus on topics that:
- provide insight into the macroscopic function of Internet infrastructure, behavior, usage, and evolution
- foster a collaborative environment in which data can be acquired, analyzed, and (as appropriate) shared
- improve the integrity of the field of Internet science
- inform science, technology, and communications public policies
- Executive Summary
- Economics and Policy
- Infrastructure Projects
- Web Site Usage
- Organizational Chart
- Funding Sources
- Operating Expenses
This annual report covers CAIDA's activites in 2008, summarizing highlights from our research, infrastructure, and outreach activities. Our current research projects, funded by the U.S. National Science Foundation (NSF) and the Department of Homeland Security (DHS), include several measurement-based studies of the Internet's core infrastructure, with focus on the health and integrity of the global Internet's topology, routing, addressing, and naming systems. We made fundamental advances in several of our research projects this year, supported by increased coverage by our measurement infrastructure, and increased collaborations with colleagues around the world. We completed the first full calendar year of a continous provisioning of the most comprehensive annotated view of IPv4 topology thus far. We scientifically discerned which IPv4 topology probing method worked the best, and began to integrate and optimize our IP alias resolution techniques for large graphs. We also began to deploy IPv6 Ark nodes, and early IPv6 probe destination lists.
Some of our topology research focused on how different routing approaches in nature are maximally efficient on certain types of peculiarly structured topologies, conveniently, those structured like the Internet AS graph. Further, we found that self-similarity of clustering in real complex networks provides strong empirical evidence that some hidden metric spaces underlie these networks. In trying to model self-similar (scale-free) networks embedded into such a hidden space, we discover that a certain approach to routing -- greedy routing -- is phenomenally successful and efficient in such a model. We are still exploring the ramifications of this intense discovery, and the even more intriguing breakthrough that this hidden space seems to be hyperbolic. Our research into network growth dynamics also yielded two papers with surprising results about different regimes of network growth: (1) that there may be a vast pre-asymptotic regime of complex network growth that gives rise to power-law like effects in degree distribution; (2) a simple customer-provider-based modification of the preferential attachment model can account for Internet topology evolution, including the ISP consolidation toward monopoly.
Per our mission, our infrastructure activities aim to narrow the growing gap that impedes the field of network research, as well as telecommunications policy and infrastructure sustainability: a dearth of available empirical data on the public Internet since the infrastructure privatized in the mid-1990s. In 2008 we continued to maintain a catalog of Internet measurement data sets, contributed to and used the (DHS-funded) PREDICT repository of datasets to support cybersecurity research, and developed and deployed new active and passive measurement infrastructure. We continued expanding our newest active measurement infrastructure, now collecting the most comprehensive set of IPv4 topology measurements ever made available to researchers, enhanced with DNS information. Our data repository includes weekly archives of complete Internet AS-level topologies enriched with AS relationship information, and weekly updates of AS (ISP) rankings. We also coordinated and analyzed another DITL's worth of data, and wrote a few web pages and published a paper in CCR; hopefully someone else will pick up DITL next year since we've not had dedicated funding for it yet.
We also led and participated in tool development to support analysis, indexing, and dissemination of Internet infrastructure data. Highlights include updates to our real-time report generator of passively observed traffic, geographical visualizations of DNS workload to a given set of servers, updates to our IPv4 and IPv6 AScore posters, and visual maps of IPv4 address space consumption.
In 2007 (annual report) CAIDA began to expand its scope to include economics and policy research. Our notable contribution in 2008 was a set of blog entries that became a short Internet research tutorial for lawyers. Finally, we engaged in a variety of outreach activities, including web sites, peer-reviewed papers, technical reports, presentations, blogging, animations, and workshops. Details of our activities are below. CAIDA's program plan for 2007-2010 is available at https://www.caida.org/home/about/progplan/progplan2007/. Please do not hesitate to send comments or questions to info at caida dot org.
CAIDA's topology research agenda is focused on three strategic areas: 1) macroscopic topology measurement; 2) analysis of the observable AS-level and router-level hierarchy; 3) topology modeling in support of routing research.
In 2008 CAIDA made steady progress in all three of its topology research focus areas.
Macroscopic Topology Measurement:
We continued large-scale macroscopic topology measurements using our set of monitors distributed worldwide and coordinated by Archipelago (Ark), our state-of-the-art global measurement platform. We are aware of the gaps in geography and topology coverage -- still not well-quantified by researchers -- induced by the relatively small number of vantage points. Yet with 30 Ark monitors deployed in 21 countries by the end of 2008, we achieve our most comprehensive view of IPv4 topology thus far, completing the first full calendar year of the IPv4 Routed /24 Topology Dataset.
On December 12, 2008, we began to use the Ark infrastructure to collect IPv6 topology data. We released the IPv6 Topology Dataset for researchers to get a view of the nascent IPv6 global topology as seen by six Ark monitors. More IPv6 topology data will be available in 2009.
Led by Matthew Luckie in the WAND research group at the University of Waikato, we conducted experiments to see which traceroute probing methods captured the most topology information and published our results in "Traceroute Probe Method and Forward IP Path Inference" in IMC '08. We also released the corresponding Traceroute Probe Method 2008-08 Dataset.
We implemented a new technique for alias resolution measurements on Ark platform. Our new tool kapar is a scalable version of APAR developed by M. Gunes and K. Sarac at the University of Texas. Using publicly available data from four networks, we tested the efficiency and veracity of various combinations of alias resolution methods including iffinder, TTL measurements, and kapar. We published our findings as a CAIDA technical report, "IP Alias Resolution Techniques". A paper is in preparation.
Analysis of the Observable Topology:
We continue to annotate the IPv4 topology graph with automated DNS reverse lookups of IP addresses discovered by the probes.
We continued work on techniques to accurately annotate Internet topologies based on observations and inference of Internet structural and commercial characteristics. Our efforts focused on the AS-level Internet with AS links annotated by business relationships between ASes. We infer these relationships, recognizing their bidirectional nature, and annotate each link as a customer-provider or a peer-to-peer (settlement-free interconnection) relationship. The paper "Graph Annotations in Modeling Complex Network Topologies" was accepted for publication in ACM Transactions on Modeling and Computer Simulation (TOMACS).
We created new versions of our popular AS Core Graph visualizations for both IPv4 and IPv6. The 2008 IPv4 graph was the first to make use of the topology data collected on the new Ark platform. Because we did not have enough IPv6 support in the Ark infrastructure, the 2008 IPv6 graph relied on data collected by volunteers responding to a request sent to the North American Network Operators' Group (NANOG) mailing list. We will have semi-automated IPv6 topology discovery running on Ark early in 2009.
Several users of CAIDA's AS relationship inference data asked us why it contained AS relationship cycles, e.g., cases where AS A is a provider of AS B, B is a provider of C, and C is a provider of A, or other cycle types. We published a paper, "On Cycles in AS Relationships" in ACM SIGCOMM Computer Communications Review (CCR), v.38, n.3, p.102-104, 2008 that provides our answers.
We demonstrated that the self-similarity of some scale-free networks with respect to a simple degree-thresholding renormalization scheme finds a natural interpretation in the assumption that network nodes exist in hidden metric spaces. Clustering, i.e., cycles of length three, plays a crucial role in this framework as a topological reflection of the triangle inequality in the hidden geometry. We prove that a class of hidden variable models with underlying metric spaces are able to accurately reproduce the self-similarity properties that we measured in the real networks. Our findings indicate that hidden geometries underlying these real networks are a plausible explanation for their observed topologies and, in particular, for their self-similarity with respect to the degree-based renormalization. We published our results in "Self-similarity of Complex Networks and Hidden Metric Spaces" in Physical Review Letters, v. 100, 078701, 2008.
We studied the paradox associated with networks growing according to super-linear preferential attachment: super-linear preference cannot produce scale-free networks in the thermodynamic (asymptotic) limit, but there are super-linearly growing network models that perfectly match the scale-free structure of some real networks, including as the Internet. We demonstrated that a super-linearly growing network model can reproduce, in its pre-asymptotic regime, the structure of a real network, if the model captures some sufficiently strong structural constraints, e.g., rich-club connectivity. These findings suggest that real scale-free networks of finite size may exist in pre-asymptotic regimes of network evolution processes that lead to degenerate network formations in the thermodynamic limit. We published our results in "Scale-free networks as pre-asymptotic regimes of super-linear preferential attachment" in Physical Review E, v.78, 026114, 2008.
In collaboration with S. Shakkottai from Texas A&M University we refined our model of Internet growth which attempts to explain preferential attachment based on economic realities of the AS-level Internet. We simulated a growing AS-level topology and annotated links in the model graph with AS relationships (customer-provider or peer-to-peer). We compared the degree distributions for all different types of nodes in the simulated topology to those observed in measured Internet topologies, and found that the distributions are similar. To our knowledge, this is the first Internet evolution model that is realistic, analytically tractable, and entirely based on physical, that is, measurable, parameters. The paper was accepted for publication in The First International Conference on Complex Sciences: Theory and Applications (Complex 2009).
- Improving topology measurements
- Led by Matthew Luckie, published the paper, "Traceroute Probe Method and Forward IP Path Inference" in IMC '08 and released the dataset Traceroute Probe Method 2008-08 Dataset.
- Published a technical report, "IP Alias Resolution Techniques".
- Began regular IPv6 topology measurements in December 2008.
- Analysis of topology
- Created new versions of the AS Core Graph visualizations for both IPv4 and IPv6.
- Published a paper, "On Cycles in AS Relationships" in ACM SIGCOMM Computer Communications Review (CCR), v.38, n.3, p.102-104, 2008
- The paper, "Graph Annotations in Modeling Complex Network Topologies" was accepted for publication in ACM Transactions on Modeling and Computer Simulation (to appear in 2009).
- Topology modeling
- Published the paper, "Self-Similarity of Complex Networks and Hidden Metric Spaces" in Physical Review Letters vol. 100, no. 078701, 2008.
- Published the paper, "Scale-free networks as pre-asymptotic regimes of super-linear preferential attachment" in Physical Review E, v.78, 026114, 2008.
- The paper, "Evolution of the Internet AS-level Ecosystem" was accepted for publication in The First International Conference on Complex Sciences: Theory and Applications (Complex 2009).
- CAIDA Visualization of the Internet topology is on display at New York's Museum of Modern Art
- Ongoing data releases
- We made publicly available a number of topology datasets.
- The adjacency matrix of the observed Internet AS-level graph computed daily from Ark measurements;
- AS relationship repository where we archive, on a weekly basis, the complete Internet AS-level topologies enriched with AS relationship information for every pair of AS neighbors;
- bi-weekly updates of AS-ranking data
- daily files of the DNS reverse name lookups for the IPv4 core traceroute data.
A. Jamakovic (TU Delft) continued her work from 2007 on application of the dK-series methodology to study randomness of various complex networks, which we hope to publish in 2009.
This research received support from:
- NSF grant (CRI 05-51542) "Toward Community-Oriented Network Measurement Infrastructure",
- NSF grant (CNS-0722070) NeTS-FIND: Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures without Topology Updates",
- NSF grant (NeTS-NR 04-540) "Toward Mathematically Rigorous Next-Generation Routing Protocols for Realistic Network Topologies",
- DHS Science and Technology Directorate contract (N66001-08-C-2029) "Cybersecurity: Leveraging the Science and Technology of Internet Mapping for Homeland Security", and
- a University Research Program gift from Cisco Systems, Inc..
Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures without Topology Updates
CAIDA's research in Internet routing continued to focus on two related topics: greedy routing based on hidden metric spaces underlying real networks; and the relationship between routing efficiency and the structure of the network topology. Leveraging a decade of CAIDA institutional knowledge of topology discovery, collection, and analysis, we have a bold research objective: a far-reaching solution to the routing scalability problems of today's Internet. But our work in this area has profound implications for network science in other disciplines (physics, biology, chemistry, social sciences).
To foster our research goals, we developed and pursued the following step-by-step program:
- Obtain empirical evidence that hidden spaces do underlie complex networks and that they are metric;
- Identify navigability mechanisms that influence the efficiency of greedy routing in complex networks;
- Find the basic geometrical and topological properties of hidden spaces that make them maximally congruent with respect to the identified navigability mechanisms;
- Obtain empirical evidence that hidden spaces underlying real networks do possess these properties; and
- Find mappings of nodes in real networks to the identified spaces or their models.
We studied the process of routing information through networks as a universal phenomenon existing in both natural and man-made complex systems. In many complex networks found in nature, nodes communicate efficiently even without full knowledge of global network connectivity. We demonstrated that the peculiar structural characteristics of observable complex networks is consistent with maximizing communication efficiency when using greedy routing approaches without global knowledge. We also described a general mechanism that explains this connection between network structure and function. The paper "Navigability of complex networks" was published in Nature Physics online.
We began the follow-on work to the above, and submitted for publication a paper, "Efficient Navigation in Scale-Free Networks Embedded in Hyperbolic Metric Spaces". This paper shows that the hierarchical structure of complex networks is congruent with negatively curved geometries hidden beneath observed topologies, i.e., the hidden metric space is hyperbolic. Mapping nodes to these hidden metrics leads to scale-free topologies in the observable network, and even more pleasantly surprising, greedy routing on this embedding can achieve 100% reachability and optimal paths. The question remains as to whether we can find hidden metric spaces to map and better navigate real world networks such as the Internet.
- Research results
- The paper "Navigability of Complex Networks" was published by Nature Physics.
- The paper "Efficient Navigation in Scale-Free Networks Embedded in Hyperbolic Metric Spaces" was published in arxiv.
- In August 2008, the Santa Fe Institute (SFI) hosted an interdisciplinary "Networks and Navigation" Workshop jointly organized and supported by SFI and CAIDA.
Our routing research received support from:
- NSF grant (CNS-0722070) NeTS-FIND: Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures without Topology Updates",
- NSF grant (NeTS-NR 04-540) "Toward Mathematically Rigorous Next-Generation Routing Protocols for Realistic Network Topologies",
- DHS Science and Technology Directorate contract (N66001-08-C-2029) "Cybersecurity: Leveraging the Science and Technology of Internet Mapping for Homeland Security.", and
- a University Research Program gift from Cisco Systems, Inc..
CAIDA researchers conduct DNS measurements and develop tools, models, and analysis methodologies for use by DNS operators and researchers.
Measurements of traffic at the DNS Root Servers
During our January 2008 CAIDA/WIDE workshop, participants discussed the Day in the Life of the Internet (DITL) project, reflected on the lessons and results of the 2007 collection event, and compiled a sample list of the top research questions and the corresponding data that researchers would like to procure from the DITL project.
In collaboration with ISC and OARC, we held the third large-scale simultaneous "Day in the Life of the DNS Root Servers" data collection event on March 18-19, 2008 (DITL 2008). We captured tcpdump traces at nearly all anycast instances of the A, C, E, F, H, old-J, K, L, old-L, and M root servers and from two alternative Open Root Server Network (ORSN) servers. In comparison with DITL 2007, the total amount of data doubled. This unique dataset represents the most comprehensive measurements of the root servers to date, and provides researchers with unprecedented insight into the root server workload characteristics and performance. summary of the collection event, and cataloged the data into DatCat. These data are available to the research community via the DNS-OARC. Academic researchers can participate in the DNS-OARC for free.
We began analysis of the DNS root server data collected during the DITL 2008 and presented our findings at NANOG42 and at the DNS-OARC 2008 DNS Ops Workshop in Brooklyn. We published a paper, "A Day at the Root of the Internet" in the ACM Computer Communications Review, v. 38, pp.41-46, 2008.
Sebastian Castro (a visiting student from Chile) collaborated with CAIDA researchers to analyze DNS data collected during the 2006, 2007, and 2008 DITL events. He focused on the "heavy hitters" analysis, which raised more questions than we answered. The sources of heavy pollution (queries that cannot possibly be appropriate) change over time, often associated with application software that is lazy (laissez-faire) about managing its own DNS traffic. One clear pattern is continuous growth -- there is an order of magnitude more pollution (invalid queries) at the roots than valid queries, and the number of invalid queries grows faster than the number of valid queries. Perhaps more importantly, there is no organization who has the incentive and capital to spend to fix this pollution. The root cause of a lot of it has to do with writing lazy software because it is cheaper.
We published a paper "Influence Maps - a novel 2-D visualization of massive geographically distributed data sets" in the Internet Protocol Forum in October, 2008. In this paper, we present a novel visualization technique -- the Influence Map -- which renders a compressed representation of geospatially distributed Internet data.
Other DNS measurements:
To complement the IPv4 Routed /24 Topology Dataset, in March 2008, we began using our custom-built bulk DNS lookup service to resolve the fully-qualified domain names for IP addresses seen by our monitors. We make these names available in the DNS Names for IPv4 Routed /24 Topology Dataset.
Duane Wessels, who became director of DNS-OARC in June 2008, continued our open resolvers survey and posted daily reports that identify open DNS resolvers. These resolvers represent a dangerous vulnerability to Internet users since they allow resource squatting, are easy to poison, and can be used in widespread Distributed Denial of Service (DDoS) attacks.
In collaboration with Prof. N. Brownlee (University of Auckland (UA), New Zealand), we maintained NeTraMet traffic meters installed at various locations in the US, New Zealand, and Japan, and continued monitoring requests to, paired with responses from, root/gTLD servers generated by large campus/enterprise networks. The resulting longitudinal dataset is available for researchers. It contains information useful for evaluating performance conditions and trends on the global Internet, although note that DNS RTTs are influenced by several factors, including remote server load, network congestion, route instability, and local effects such as link or equipment failures.
- DNS Root servers traces
- Co-organized A day in the life of the Internet collection event on March 18-19, 2008.
- Presented a Day In The Life of the Internet 2008 Data Collection Event at NANOG42.
- Presented an analysis of the 2008 data at the DNS-OARC 2008 DNS Ops Workshop in Brooklyn.
- At the CAIDA-WIDE workshop in January 2008, we presented:
- DNS nameserver database at OARC,
- October 2007 survey of open resolvers in the Internet,
- Lessons from DITL 2007 - and what we should do different in 2008,
- Bulk DNS Lookup Service,
- DITL 2007 Collection Summary,
- IPv6 Collection 2008 a View of the IPv6 Networks,
- DNS: comparison of 2006 and 2007 snapshots,
- Comprehensive approach to the analysis of DITL DNS data, and
- Cataloging DITL data for research use.
- Published a paper, "A Day at the Root of the Internet", in ACM CCR, v. 38, pp. 41-46, 2008 .
- Published a paper, "Influence Maps - a novel 2-D visualization of massive geographically distributed data sets", in the Internet Protocol Forum in October, 2008.
- Indexed the DITL 2007 and DITL 2008 data into DatCat.
- At the CAIDA-WIDE-CASFI workshop in August 2008, we presented:
- Other DNS measurements
- Began serving DNS lookups for the IPv4 Routed /24 Topology Dataset.
- Data releases
- We make a number of DNS datasets available publicly or by request.
- five years of data on RTTs from several campuses to root/gTLD servers
- daily reports identifying open DNS resolvers
- OARC DNS root traces for January 10-11, 2006, January 9-10 2007, and March 18-19, 2008.
- database of reverse DNS lookups
This research received support from:
- NSF grant SCI-0427144 "Improving the Integrity of Domain Name System (DNS) Monitoring and Protection", and
- a gift from the WIDE Project.
In the face of exhaustion of the Internet Assigned Numbers Authority (IANA) IPv4 address resources in the next several years, CAIDA seeks salient data and objective quantitative analysis that will inform the development of address allocation policies that will accommodate continued growth and innovation of the Internet. We hope to foster discussion of scenarios that accommodate four realties: 1) the current Internet has become critical infrastructure for governments, organizations, and individuals throughout the world, 2) the Internet (on any timeline) requires an upgrade to a more scalable and sustainable addressing solution 3) the fact that any such solution requires an infusion of capital and skilled labor, and 4) the major organizations currently associated with ownership, maintenance, and upgrade of such Internet infrastructure do not currently enjoy resources that would allow for such investments in the required upgrades. All four S's -- security, scalability, sustainability, and stewardship -- in one messy problem.
In 2008, CAIDA worked with ARIN to collect and analyze information on IPv6 uptake. We conducted two surveys; The March 2008 survey went to respondents from the ARIN region, a September 2008 survey collated responses from all regions. Claffy presented these results at the April and October ARIN Public Policy Meetings, and also presented remotely to the Internet Society's (ISOC) Advisory Council in November.
Claffy chaired a panel at TPRC's 36th Research Conference on Communication, Information, and Internet Policy hosted by the Center for Technology and the Law, George Mason University Law School, Arlington, Virginia in September 2008.
- We published a presentation and cleaner audio track of the October 2008 ARIN talk (below) and posted it on our blog.
- CAIDA/ARIN IPv6 Surveys
- k claffy chaired a panel at TPRC 2008 36th Research Conference on Communication, Information, and Internet Policy on "Regulating the Next Generation Network."
REU student Jennifer Hsu worked with CAIDA staff to produce A History of Internet Infrastructure Ownership, but gave up on it until we get a better source of available data.
This research received support from a gift made by the American Registry for Internet Numbers (ARIN).
We continued activity on the COMMONS project, mostly in learning what policy changes or support would be required to support such an experiment. Last year we proposed that NLR and/or Internet2 offer measurement technology and connectivity to community networks in exchange for opt-in access to measurement of the resulting interdomain network and its costs. This year we shifted our efforts on trying to educate enough people in the communications policy community about Internet technology problems so we can have a more interdisciplinary conversation next year.
In collaboration with co-author Sascha Meinrath, we published, "The COMMONS Initiative: Cooperative Measurement and Modeling of Open Networked Systems", in the CommLaw Conspectus: Journal of Communications Law and Policy, Volume 16.2. This article proposes to develop a requirements document and roadmap to support the use of a national OC-192 transit backbone for community wireless networks and other public sector networks to reach each other. This would enable a large-scale, incentive-based network of Internet workload, performance, economic, and behavioral measurement on an unprecedented national, inter-segment, inter-provider scale. First we should talk for a couple of years about how to respect privacy in Internet research.
In March 2008, kc claffy attended a meeting hosted by Google and Stanford Law School - Legal Futures, which inspired a follow set of blog postings and slightly updated pdf on, "Ten Things Lawyers Should Know About the Internet".
In its second year, the COMMONS project reported the following major milestones.
- We published, "The COMMONS Initiative: Cooperative Measurement and Modeling of Open Networked Systems", in the CommLaw Conspectus: Journal of Communications Law and Policy, Volume 16.2.
- We blogged the top ten things lawyers should know about the Internet, and edited them into a booklet.
- COMMONS participants
- COMMONS participant, ISC, continued providing passive data via real-time anonymized traffic reports generated by the Coralreef Report Generator software in exchange for R&E network access will provide vetted researchers with anonymized packet headers upon request.
We supervised an undergraduate student, Connie Lyu, who worked with Adobe Illustrator to add graphics and layout to the top ten things lawyers should know about the Internet blog entries to produce the booklet version.
This research was supported by a gift from Cisco Systems, Inc..
The core objective of the "Community-Oriented Network Measurement Infrastructure" (CONMI) project is to provide needed data sets to the scientific community studying the Internet. To accomplish this goal, CAIDA deploys both active and passive infrastructure to measure a wide cross-section of the Internet and collect and distribute the resulting data.
In 2008, we continued to focus our efforts on the two tasks described in the proposal: 1) implementing Archipelago, our state-of-the-art, community-oriented, active measurement infrastructure; and 2) deploying monitors capable of collecting passive traces on Internet links, including new monitors for OC192 backbone links and web pages displaying reports from publicly accessible realtime traffic monitors.
Archipelago (Ark): A Coordination-Oriented Measurement
The Archipelago (Ark) Project made progress in 2008 on monitor deployment and software development. By the end of 2008, CAIDA had 30 Ark montiors deployed in 21 countries, including eight monitors in the US. The topology of Archipelago includes much of that of our previous skitter infrastructure. which used PCs located in networks around the world that send measurement results via the Internet to a central server located at CAIDA at the San Diego Supercomputer Center. The new design pays a great deal of attention to communication & coordination, software installation & execution environment, and data storage & management. In 2008, the Ark infrastructure was primarily used for ongoing topology measurements in the IPv4 address space; we will expand its research scope in 2009.
We continue to expand the Ark infrastructure, adding 1-2 monitors per month. Increasing the number of monitors is vital for topology measurements since it reduces the gaps in coverage, decreases the cycle times and allows us to increase the number of traces attempted for each destination /24. The application load gets distributed across the teams and monitors based on resource availability. Locations interested in hosting an Ark monitor, should send a message to email@example.com.
We added infrastructure for automated IPv4 Routed /24 AS Links Dataset creation, automated ongoing DNS lookup of IP addresses seen in the Routed /24 Topology traces, and tcpdump captures of DNS query/response traffic.
We released the rb-wartslib library that enables warts data processing from Ruby. We also produced numerous scripts to further automate collection, data management, and archival. All tools for downloading and managing collected data stress scalability and fault tolerance.
With a nod toward the future, we implemented a prototypical probing methodology for Internet Protocol version 6 (IPv6) on six nodes of the Ark infrastructure with the requisite IPv6 connectivity. We will expand IPv6 measurements in 2009.
Deployment of Passive Monitors for Trace Collection
on Backbone Links
Early in the year, we spent much time and effort dealing with problematic older unsupported hardware hoping to repurpose it, but have not had any resources to upgrade these older systems. Finally, in July 2008, we successfully deployed four passive traffic monitors on high-speed, tier 1, OC192 Internet backbone links. Working with Equinix and a tier 1 ISP, we sited two monitors (four hosts) to tap two bidirectional links, one from Seattle, WA to Chicago, IL and another from San Jose, CA to Los Angeles, CA.
The first 1-hour trace was obtained in March 2008 as a part of our global Internet measurement experiment, 'Day in the Life of the Internet 2008'. The data was anonymized using the Crypto-PAn prefix-preserving anonymization technique and is available for use by vetted researchers. We continue monthly collection of traces at the Equinix facilities.
We improved our CoralReef based traffic report generator that produces publicly accessible realtime reports from the data we capture on traffic monitors. We also publish observed packet size distributions.
Archipelago Measurement Infrastructure:
- Deployed 30 Ark monitors conducting the team-probing experiment for topology discovery.
- Using Google's mapping API, created and published a current map of the Ark measurement infrastructure.
- Presented Archipelago Measurement Infrastructure: Status and Experience at the 10th CAIDA-WIDE-CASFI Workshop in August, 2008.
- Produced numerous scripts to automate and coordinate Ark data collection, management, and archival.
Deployment of Passive Measurement Infrastructure
- Deployed four passive traffic monitors on high-speed, tier 1, OC192 Internet backbone links.
- Developed software to publish real-time traffic reports
- Published the paper "Internet Traffic Classification Demystified: Myths, Caveats, and the Best Practices" in CONEXT 2008 (Madrid, Spain).
CAIDA hosted visits of graduate students Alberto Dainotti and Maurizio Dusi who worked on various research tasks for the CONMI project.
The CONMI project received support from the NSF grant (CRI 05-51542) "Toward Community-Oriented Network Measurement Infrastructure."
The Protected Repository for the Defense of Infrastructure against Cyber Threats (PREDICT) was designed to provide sensitive security datasets to qualified researchers, while preserving privacy and preventing data misuse. PREDICT seeks to provide a secure technical and policy framework to process applications for data sharing from network providers that include tools for collection, processing, and hosting of data that PREDICT makes available through the program as well as secured infrastructure to support serving datasets to researchers.
CAIDA's involvement in the PREDICT effort included assisting with development of background pieces of the project, from iterating on NDAs to deploying measurement infrastructure to curation of data. CAIDA acts as a data provider and a data hosting site, serving denial-of-service backscatter data, Internet worm data, Network Telescope data, and IP topology data to approved researchers. We also hired part-time an attorney with experience in cyberlaw, who gave us feedback on our data descriptions and IRB application, described below.
CAIDA continued delivering both active and passive data as a PREDICT data provider. Our Macroscopic Topology data deliverable is the IPv4 Routed /24 Topology Dataset collected on Ark infrastructure (which has replaced previous skitter-based measurements). Our passive data deliverables are monthly hour-long traffic traces captured by two monitors (four hosts) on two bidirectional links, one from Seattle, WA to Chicago, IL and another from San Jose, CA to Los Angeles, CA. We clean, anonymize, and distribute these traces to researchers.
In October 2008, we submitted an application to the UCSD Human Research Protections Program (HRPP) office requesting review of our research protocol by the campus Institutional Review Board (IRB). The application covered the general traffic and other data analysis work we have done for the last 10 years, not including any research involving payload (which we define as anything past the TCP/IP header). Although we expected it to go to a full panel review, our application was given expedited review and approved within 10 days. Since we would like to begin a longer conversation with our IRB regarding appropriate conduct during network research, we plan to submit a follow up application that will propose privacy-respecting payload analysis and we will ask that the application specifically get a full panel review.
We wrote a report describing the landscape of anonymization tools for network data, "summary of anonymization best practice techniques" and created an anonymization bibliography on our web site.
We participated in the first ACM Workshop on Network Data Anonymization (NDA 2008) which convened in Washington DC in association with the 15th ACM Conference on Computer and Communications Security (CCS). The workshop focused on the theory and practice of anonymization as it applies to network data for use by the Internet measurement research community and operators deploying network measurement technologies. PI Dr. Claffy moderated a panel discussion on Economic, Ownership, and Trust Issues in Network Data Sharing. Workshop participants seemed to agree on the need for two documents:
- An ethics-based code of conduct for the network measurement community. PREDICT will host a workshop in 2009 to discuss this topic.
- A case for legislative change. All three attorneys at the workshop echoed the belief that by the time Internet measurement-related legislation comes up for revision, the network measurement community better have a compelling story for what we want changed and why.
As a result of internal assessment of the research utility and use of the anonymized telescope data balanced against the (small) privacy risk and (large) cost of upkeep, CAIDA decommissioned the current network telescope data collection infrastructure on October 13, 2008 but then had to immediately kickstart it again because of the onset of the Conficker worm. With much help from Professor Stefan Savage and his team in CSE, we are re-implementing the network telescope with a fresh research agenda, data collection and curation methodology, as well as new hardware, in 2009.
We collaborated with Internet2 to draft a proposal for the Network Research Review Council, which would act in some ways like an IRB for the Internet2 community. Although Internet2 is faced with the same privacy, fear of lawsuits, and operational cost issues as commerical providers, we hope the NRRC can help Internet2 navigate these issues to better provide network data to researchers.
- Improvements in data collection infrastructure
- Completed conversion from skitter-based active measurements to the Archipelago platform.
- Installed, configured, tested and deployed four OC192 monitors.
- Adding Metadata to Predict Portal for Researchers
- Submitted four quarters of the Denial-of-Service (DoS) Backscatter Datasets.
- Submitted Denial-of-Service (DoS) Backscatter-TOCS Dataset.
A graduate student Wolfgang John analyzed data collected on the UCSD Network Telescope looking for worm traffic.
The PREDICT project received support from the DHS contract, (NBCHC 070133) "Supporting Research and Development of Security Technologies through Network and Security Data Collection."
CAIDA's Internet Measurement Data Catalog (IMDC) facilitates access, archiving, and long-term storage of Internet data as well as sharing Internet measurement metadata among Internet researchers. Since its launch in June 2006 at www.datcat.org the catalog has received contributions of metadata for over 100 collections indexing 150,000+ files totaling over 26TB of data. Funding for the project ended in 2006, but we still hope to add some usability features: (1) incorporate extensive user feedback into development of a streamlined contribution mechanism requiring much less time from the contributor; (2) perform more detailed log analysis of DatCat user behavior, and refine user interface to optimize user time searching the catalog; (3) maintain and extend the catalog with additional and newer datasets. We're making slight progress on (1) and (3); well get back to this in 2009 or 2010 if it's still considered useful.
- New Contributions of Metadata
- During 2008, the DatCat catalog received 37 entries documenting the metadata for collections and publications.
- Collection: Day in the Life of the Internet, March 18-19, 2008 (DITL-2008-03-18)
- Collection: CAIDA Anonymized 2008 Internet Traces Dataset
- Collection: Mesh Routing Data Collection - Routing tables and ScanWireless Scans
- Publication: Traceroute Probe Method and Forward IP Path Inference published 2008-10 in ACM SIGCOMM Internet Measurement Conference
- During 2008, the DatCat catalog received 37 entries documenting the metadata for collections and publications. Highlights include:
Though no longer directly funded, DatCat received some support in 2008 from:
CAIDA's mission includes providing access to tools for Internet data collection, analysis and visualization to facilitate network measurement and management. However, CAIDA does not receive specific funding for support and maintenance of the tools we develop. Please check our home page for a complete listing and taxonomy of CAIDA tools.
2008 Tool Development
The CoralReef Software suite, developed by CAIDA, provides a comprehensive software solution for data collect and analysis from passive Internet traffic monitors, in real time or from trace files. Real-time monitoring support includes system network interfaces (via libpcap), FreeBSD drivers for a number network capture cards, including the popular Endace DAG (10GE/OC192, POS and ATM) cards. The package also includes programming APIs for C and perl, and applications for capture, analysis, and web report generation. This package is maintained by CAIDA developers with the support and collaboration of the Internet measurement community.
We released CoralReef version 3.8.2 late in 2008.
Anonymization Tools Taxonomy
In late 2008, we published the Anonymization Tools Taxonomy to help those searching for tools to help perform anonymization of Internet log files and trace data. The Anonymizations Tools Taxonomy provides a summary of each tool along with pointers to more detailed information in addition to review comments, when available. We also released the Summary of Anonymization Best Practice Techniques as part of the DHS PREDICT Project.
CAIDA Tools Download Report
The table below displays all the CAIDA developed tools distributed via our home page at https://www.caida.org/tools/ and the number of downloads of each version during 2008.
As a change from 2007, this year's download reports do not contain accesses by spiders, crawlers, or other robots, nor does it count multiple accesses by the same downloader.
Currently Supported Tools
Tool Description Downloads coralreef A software suite to collect and analyze data from passive Internet traffic monitors. 1157 dsc A system for collecting and exploring statistics from DNS servers. 2,154 dnsstat An application that collects DNS queries on UDP port 53 to report statistics. 222 dnstop A libpcap application that displays tables of DNS traffic. 7,405 sk_analysis_dump A tool for analysis of traceroute-like topology data. 242 walrus A tool for interactively visualizing large directed graphs in 3D space. 3,569 libsea A file format and a Java library for representing large directed graphs. 406 Chart::Graph A Perl module that provides a programmatic interface to several popular graphing packages. Note: Chart::Graph is also available on CPAN.org. The numbers here reflect only downloads directly from caida.org, as download statistics from CPAN are not available. 127 plot-latlong A tool for plotting points on geographic maps. 255
Past Tools (Unsupported)
Tool Description Downloads Mapnet A tool for visualizing the infrastructure of multiple backbone providers simultaneously. 14,065 GeoPlot A light-weight java applet creates a geographical image of a data set. 538 GTrace A graphical front-end to traceroute. 714 otter A tool used for visualizing arbitrary network data that can be expressed as a set of nodes, links or paths. 535 plotpaths An application that displays forward and reverse network path data. 136 plankton A tool for visualizing NLANR's Web Cache Hierarchy 51
In 2008, CAIDA captured and curated data from three primary sources of network data: 1) macroscopic topology, 2) passive traffic traces at tier1 Internet Backbone links, and 3) passive traffic traces from the UCSD Network Telescope. We derived several datasets from this data that we make pubilcly available to researchers including our AS Rank, AS adjacencies, and Router adjacencies datasets as well as several Backscatter datasets. CAIDA makes some data available to anyone without restriction. CAIDA makes a subset of its collected data available only to academic researchers and CAIDA members, with data access subject to Acceptable Use Policies (AUP) designed to protect the privacy of monitored communications, ensure security of network infrastructure, and comply with the terms of our agreements with data providers.
- We collected our first trace data from the equinix-chicago and equinix-sanjose passive monitors connected to tier1 ISP backbone links at Equinix facilities in Chicago, IL, and San Jose, CA.
- We deactivated skitter data collection and transitioned to our next generation topology measurement infrastructure named Archipelago (Ark) for collecting IPv4 topology data.
- We started collecting IPv6 topology data on the Archipelago infrastructure.
- We collected data on the Conficker worm on the UCSD Network Telescope
Data Collected in 2008
Data Type First date Last date Total size (on disk) Macroscopic Topology Measurements, IPv4 (Archipelago) 2008-01-01 2008-12-31 259 GB Macroscopic Topology Measurements, IPv6 (Archipelago) 2008-12-12 2008-12-31 7.6 MB Internet backbone Traces 2008-03-19 2008-12-17 2.0 TB Network Telescope 2008-01-01 2008-12-31 7.2 TB A Day In The Life (DITL) of the Internet - OARC 2008 2008-03-18 2008-03-19 1.9 TB DNS Names for IPv4 Routed /24 Topology Dataset 2008-03-01 2008-12-31 11 GB * DNS root/gTLD RTT Dataset 2008-01-01 2008-12-31 1.5 GB * Size of this dataset may vary as we store and serve a rotating window of the last 30 days for this dataset.
Data Distributed in 2008
We process raw data into specialized datasets to increase its utility to researchers and to satisfy security and privacy concerns. In 2008, this resulted in the following datasets:
- Backscatter-2008 Dataset
- Inferred AS Relationships Dataset (Ongoing)
- AS Links Dataset (Ongoing)
- DNS Names from Topology Measurements (Ongoing)
- DNS Traces from Topology Measurements (Ongoing)
Publicly Available Data
These datasets require that users agree to an Acceptable Use Policy, but are otherwise freely available.
Dataset Unique visitors (IPs) Data Downloaded AS Rank 4200 10.8 GB AS Links (AS Adjacencies) 571 2.69 GB AS Relationships 779 11.4 GB Router Adjacencies 106 316 MB Witty Worm Dataset 139 244 MB AS Taxonomy 45 23.9 MB * Code-Red Worms Dataset 115 9.07 GB We count the volume of data downloaded per unique user per unique file, so even if a user downloads a file 100 times, we only count that file once for that user. This methodology results in significantly undercounting the total volume of data served through our dataservers in 2008, but is necessary because of limitations in dataserver logging combined with abberant user behaviour. * AS Taxonomy dataset is included in a mirror of the GA Tech main AS Taxonomy site, and thus does not represent all access to this data.
Restricted Access Data
These datasets require that users:
- be academic or government researchers, or join CAIDA;
- request an account and provide a brief description of their intended use of the data; and
- agree to an Acceptable Use Policy.
Dataset Unique visitors (usernames) Data Downloaded * Anonymized Internet Backbone Traces 122 3.31 TB Backscatter Datasets 42 1.81 TB (Raw Topology Traces from Archipelago infrastructure) 58 702 GB Raw Topology Traces (skitter) 48 496 GB Witty Worm Dataset 20 144 GB DNS Names for IPv4 Routed /24 Topology Dataset 32 3.57 GB 2003 Internet Topology Data Kit 48 8.10 GB DNS Root/gTLD server RTT Dataset 6 14.7 MB * We count the volume of data downloaded per unique user per unique file, so even if a user downloads a file 100 times, we only count that file once for that user. This methodology results in significantly undercounting the total volume of data served through our dataservers in 2008, but is necessary because of limitations in dataserver logging combined with abberant user behaviour.
Restricted Access Data Requests
Statistics on how many requests for data access we got and how many we granted. We got 60% more requests in 2008 then in 2007, and approved 50% more requests for access to restricted datasets.
Dataset Number of requests received Number of requests granted access Anonymized Backbone and Peering Link Traces 207 139 Active Topology Trace Datasets 134 77 Backscatter Datasets 109 52 Witty Worm Dataset 33 21 DNS Root/gTLD server RTT Dataset 12 7 Totals 495 296
This data was collected by the NLANR project. When this project came to an end in July 2006, CAIDA inventoried NLANR equipment and took over temporary curation and distribution of NLANR data. CAIDA now maintains both the NLANR AMP and PMA public data repositories, on a best effort basis. Our efforts of serving this data are currently unfunded and we plan to cease serving this data in May 2009. For sponsorship or taking over hosting responsibility for this data please contact firstname.lastname@example.org. We've had significant outages and gaps in logging, which make it impossible for us to provide relevant statistics for the AMP Topology Traces.
Dataset Unique visitors (IPs) Data Downloaded PMA Traffic Traces 3991 206 TB AMP Topology Traces -- --
As part of our mission to investigate both practical and theoretical aspects of the Internet, CAIDA hosted the 9th CAIDA-WIDE Worshop, co-hosted the SFI Workshop on Networks and Navigation, and held the 1st CAIDA-WIDE-CASFI workshop.
Please check our web site for a complete listing of past and upcoming CAIDA workshops.
The 9th CAIDA/WIDE workshop was held on January 19th and 20th, 2008 (by invitation only) in the East-West Center on the University of Hawaii campus as part of Techs in Paradise (TIP2008). The main topics presented and discussed at the workshop included: Internet measurement projects and DNS. We used the venue to discuss the upcoming 2008 Day In The Life of the Internet event and compiled a list of the top questions and data types.
On August 4-6, 2008 Santa Fe Institute (SFI) hosted an interdisciplinary "Networks and Navigation" workshop jointly organized and supported by SFI and CAIDA. The main topics of discussion included examination of similarities between complex networks sharing small world characteristics and navigation of such networks using only local information. A deeper understanding of the origin of these locally-navigable structures would (1) clarify the role (if any) that latent metric spaces play in the navigability of networks, and potentially point to novel generative mechanisms based on such spaces, (2) point us toward novel routing algorithms and search protocols for Internet-like topologies, other communication networks, and possibly social networks, (3) shed light on the potentially different behavior of passive versus active spreading on these networks, i.e., diffusion versus search, (4) identify the relationship between navigability and other network properties, e.g., community structure, degree heterogeneity, etc.
The 10th CAIDA/WIDE workshop was held on August 15-16, 2008 (by invitation only) in Marina del Rey, CA. This workshop supported a three-way collaboration between researchers from CAIDA (USA), WIDE (Japan) and CASFI (South Korea). The main topics presented and discussed at the workshop included: updates on active Internet measurements of Internet topology, reverse paths, DNS source port randomness, analysis of DITL 2008 data, trends in residential user traffic, and automated application signature generation for traffic identification.
The following table contains the papers published by CAIDA for the calendar year of 2008. Please refer to Papers by CAIDA on our web site for a comprehensive listing of publications.
||IP Alias Resolution Techniques: Technical Report||Cooperative Association for Internet Data Analysis (CAIDA)|
||Internet Traffic Classification Demystified: Myths, Caveats, and the Best Practices||ACM SIGCOMM Conference on emerging Networking EXperiments and Technologies (CoNEXT)|
||A Day at the Root of the Internet||ACM SIGCOMM Computer Communication Review (CCR)|
||Influence Maps - a novel 2-D visualization of massive geographically distributed data sets||Internet Protocol Forum|
||Traceroute Probe Method and Forward IP Path Inference||ACM Internet Measurement Conference (IMC)|
||Realistic Topology Modeling for the Internet BGP Infrastructure||IEEE Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems (MASCOTS)|
||Ten Things Lawyers Should Know About Internet Research||Cooperative Association for Internet Data Analysis (CAIDA)|
||Scale-free networks as pre-asymptotic regimes of super-linear preferential attachment||Physical Review E|
||The COMMONS Initiative: Cooperative Measurement and Modeling of Open Networked Systems||CommLaw Conspectus|
||On Cycles in AS Relationships||ACM SIGCOMM Computer Communication Review (CCR)|
||Efficient Navigation in Scale-Free Networks Embedded in Hyperbolic Metric Spaces||arXiv cond-mat.stat-mech/0805.1266|
||Self-similarity of complex networks and hidden metric spaces||Physical Review Letters|
CAIDA staff and collaborators actively attend and contribute to relevant workshops and conferences and other events to present our research and gain better understanding of Internet infrastructure, trends, topology, routing, and security. Last year, CAIDA staff presented at the ARIN meeting, DHS/SRI, the University of Chile and Jornadas Chilenas de Computación, the Internet Measurement Conference, University of Aveiro and DHS Cybersecurity PI Meeting, the Santa Fe Institute, NANOG, our own WIDE and WIDE-CASFI workshops and DNS/OARC workshops.
The following table contains the presentations and invited talks published by CAIDA for the calendar year of 2008. Please refer to Presentations by CAIDA on our web site for a comprehensive listing.
In 2008, CAIDA's web site continued to attract considerable attention from a broad, international audience. Visitors seem to have particular interest in CAIDA's tools and analysis.
The table below presents the monthly history of traffic to www.caida.org for 2008. To show a more accurate representation of website traffic, these statistics do not include traffic from spiders, crawlers or other robots.
|Month||Unique visitors||Number of visits||Pages||Hits||Bandwidth (GB)|
|Jan 2008||40,897||72,619||307,839||1,362,935||120.67 GB|
|Feb 2008||44,872||74,938||249,354||1,437,310||58.29 GB|
|Mar 2008||42,439||78,108||278,366||1,535,978||54.83 GB|
|Apr 2008||46,375||79,438||280,187||1,571,985||53.85 GB|
|May 2008||46,354||78,448||272,240||1,440,674||50.09 GB|
|Jun 2008||45,458||76,786||248,134||1,431,765||41.89 GB|
|Jul 2008||44,671||75,530||235,541||1,427,714||42.61 GB|
|Aug 2008||49,759||77,842||279,990||1,686,476||56.82 GB|
|Sep 2008||48,804||72,929||272,963||1,607,294||49.06 GB|
|Oct 2008||50,830||76,711||279,988||1,596,161||59.53 GB|
|Nov 2008||45,743||70,342||233,295||1,367,139||52.13 GB|
|Dec 2008||41,957||63,733||225,140||1,301,988||46.05 GB|
CAIDA would like to acknowledge the many people who put forth great effort towards making CAIDA a success in 2008. The image below shows the functional organization of CAIDA. Please check the CAIDA Staff page for more complete information about CAIDA staff.
CAIDA Functional Organization Chart
CAIDA thanks our 2008 sponsors, members, and collaborators.
The charts below depict funds received by CAIDA during the 2008 calendar year.
|Funding Source||Allocations||Percentage of Total|
Figure 1. Allocations by funding source received during 2008
The charts below depict CAIDA's Annual Expense Report for the 2008 calendar year.
|LABOR||Salaries and benefits paid to staff and students|
|IDC||Indirect Costs paid to the University of California, San Diego including grant overhead (52-54%) and telephone, Internet, and other IT services.|
|SUBCONTRACTS||Subcontracts to the Internet Systems Consortium (ISC), Georgia Institute of Technology, and The Measurement Factory|
|TRAVEL||Trips to conferences, PI meetings, operational meetings, and sites of remote monitor deployment.|
|SUPPLIES & EXPENSES||All office supplies and equipment (including computer hardware and software) costing less than $5000.|
|EQUIPMENT||Computer hardware or other equipment costing more than $5000.|
|TRANSFERS||Exchange of funds between groups for recharge for IT desktop support and Oracle database services.|
|Program Area||Expenses||Percentage of Total|
|Supplies & Expenses||67,663||2.5%|
Figure 2. 2008 Operating Expenses
|Program Area||Expenses||Percentage of Total|
Figure 3. 2008 Expenses by Program Area