IODA: Detection and analysis of large-scale Internet infrastructure outages - Project Summary
The Internet Outage Detection and Analysis (IODA) project will apply successful results in analyzing large-scale Internet outages to the development, testing, and deployment of an operational capability to detect, monitor, and characterize future episodes of Internet connectivity disruptions.
Principal Investigators: Alberto Dainotti kc claffy kc claffy Alberto Dainotti
Funding source: CNS-1228994 Period of performance: September 1, 2012 - August 31, 2016.
Project Summary
Our dependence on the Internet has rapidly grown much stronger than our comprehension of its underlying structure, global dynamics, operational threats, and overall network health. Wide-scale Internet service disruptions and even politically motivated interference with Internet access in order to hinder anti-government organization are not new. But the scale, duration, coverage, and violent context of the government-mandated country-level Internet censorship episodes in 2011 inspired scientific as well as popular interest in capabilities to not only detect but quickly and thoroughly characterize the causes of reachability problems.
We have developed and demonstrated a methodology that can identify not only which networks have been affected by an outage, but also which techniques have been used to effect a deliberate disruption (e.g., control plane vs. data plane intervention). We have also developed metrics to quantitatively gauge the geographic and topological extent of impact of geophysical disasters on Internet infrastructure, and techniques to investigate the chronological dynamics of the outage and restoration. Our approach relies on:
- the extraction of signal from a pervasive and continuous source of malware-induced background radiation in Internet traffic (IBR);
- combining multiple types of data (active probing, passive IBR measurement, BGP routing data, and address geolocation and registry databases) to assess the scope and progression of the outage.
This project will result in an experimental operational deployment to validate and extend an empirically-grounded methodology for detection and analysis of large-scale Internet outages. In addition to improving our understanding of how measurements yield insights into network behavior, and strengthening our ability to model large scale complex networks, use of such a system will also illuminate infrastructure vulnerabilities that derive from architectural, topological, or economic constraints, suggesting how to mitigate or eliminate these weaknesses in future Internet architecture and measurement research. A deployed platform will be able to detect and monitor connectivity disruption and censorship events on a planetary scale thus enabling situational awareness of the nature and causes of network outages to national decision-makers who must determine the type and extent of proper response.
Management Plan
The requested budget supports approximately 2 full time positions (25 person-months of effort per year) at CAIDA. The main proposed tasks are overlapping in time and each task will inform the others:
- Task 1 : investigating and defining strategies and methodologies for how to combine multiple heterogeneous data sources to detect and characterize outage events (Years 1, 2, and 3);
- Task 2 : defining (and refining) the system requirements for continuous monitoring and (near) real-time analysis of outages as they occur (will start in the second half of Year 1);
- Task 3 : testing and experimental deployment of such a system (Years 2 and 3).
Additional ongoing project activities will include:
- developing project web pages to track project progress and disseminate data and tools;
- maintaining a blog for timely dissemination of analysis and discussion of detected events;
- coordination of our observations with other research and operational groups;
- interaction with various stakeholders interested in our results.
Subtask | Description | Projected Timeline | Status |
---|---|---|---|
1.1 | Select a geolocation license provider for the project and purchase a license | Year 1 | done |
1.2 | Define prefix and AS groupings by countries and/or by geographic regions | Year 1 | done |
1.3 | Work with UCSD telescope researchers to define most relevant IBR traffic indicators | Year 1 | done |
1.4 | Start developing automated methods of monitoring prefix reachability in BGP tables | Year 1 | done |
1.5 | Experiment with more frequent probing of globally routed prefixes by the Ark platform | Year 1 | done |
1.6 | Test on-demand active probing capabilities of the Ark measurement infrastructure | Year 1 | done |
1.7 | Investigate combined indicators for event detection, characterization, and analysis | Year 1 | done |
2.1 | Evaluate the volume of data that needs to be stored locally | Year 1 (2nd half) | done |
2.2 | Evaluate the size of the time window for data aggregation and processing | Year 1 (2nd half) | done |
2.3 | Evaluate the computational resources required for fast ongoing processing | Year 1 (2nd half) | done |
2.4 | Analyze the feasibility of emerging requirements, balancing storage and processing resources vs. desired functionality vs. cost | Year 1 (2nd half) | done |
1.8 | Experiment with monitoring and analyzing IBR traffic by geographic regions | Year 2 | done |
1.9 | Evaluate indicators for detection, characterization, and analysis of events with specific regard to aggregation by geographic region | Year 2 | done |
1.10 | Develop methods to integrate BGP data from Route Views and from RIPE RIS | Year 2 | done |
1.11 | Develop methods to integrate probing data from CAIDA's Ark and RIPE's Atlas platforms | Year 2 | done |
1.12 | Develop triggers for on-demand active probing based on observed routing changes | Year 2 | done |
1.13 | Develop triggers for on-demand active probing based on observed IBR traffic changes | Year 2 | done |
1.14 | Develop and integrate change-point detection algorithms into the system | Year 2 | done |
2.5 | Specify hardware parameters, obtain quotes, and purchase compute server and disk storage | Year 2 | done |
2.6 | Put the server and storage into production mode | Year 2 | done |
2.7 | Design and prototype web interfaces to control input/output of the monitoring system | Year 2 | done |
2.8 | Define efficient data structures for detection algorithms | Year 2 | done |
2.9 | Design and prototype web interfaces to present the analysis results | Year 2 | done |
2.10 | Define requirements for merging routing data from Route Views and RIPE RIS | Year 2 | done |
2.11 | Define requirements for merging active probing data from CAIDA Ark and RIPE Atlas | Year 2 | done |
2.12 | Document the system requirements and the selected design | Year 2 | done |
3.1 | Implement a software library for a common layer of functions and data structures | Year 2 | done |
3.2 | Implement a web-based interface to focus the monitoring process on specific regions and/or to use specific subsets of data | Year 2 | done |
3.3 | Implement the monitoring software modules | Year 2 | done |
3.4 | Implement the inference software modules | Year 2 | done |
3.5 | Create informative demos of the system capabilities | Year 2 | done |
3.6 | Implement interactive web interface to visualize the results of data analysis | Year 2 | done |
1.15 | Evaluate the efficiency of implemented automated detection algorithms | Year 3 (1st half) | done |
1.16 | Evaluate the effectiveness of data integration, data visualization, and user interface | Year 3 (1st half) | done |
2.13 | (optional) Investigate the possibility to plug-in additional data sources | Year 3 (1st half) | done |
2.14 | If necessary, adjust the system requirements based on experience | Year 3 (1st half) | done |
2.15 | Update the documentation | Year 3 (1st half) | done |
3.8 | Test the system on real cases | Year 3 | done |
3.9 | Experiment with various methods to deliver alerts (e.g., email, instant messaging) | Year 3 | done |
3.10 | Release the developed software under an open source license | Year 3 | done |
3.11 | Evaluate the potential impacts (positive and negative) of our analysis and dissemination of results on the network operators involved in the observed outage cases | Year 3 | done |