IODA: Detection and analysis of large-scale Internet infrastructure outages - Project Summary

The Internet Outage Detection and Analysis (IODA) project will apply successful results in analyzing large-scale Internet outages to the development, testing, and deployment of an operational capability to detect, monitor, and characterize future episodes of Internet connectivity disruptions.

Sponsored by:
National Science Foundation (NSF)

Principal Investigators: Alberto Dainottikc claffy

Funding source:  NSF CNS-1228994 Period of performance: September 1, 2012 - August 31, 2016.


Project Summary

Our dependence on the Internet has rapidly grown much stronger than our comprehension of its underlying structure, global dynamics, operational threats, and overall network health. Wide-scale Internet service disruptions and even politically motivated interference with Internet access in order to hinder anti-government organization are not new. But the scale, duration, coverage, and violent context of the government-mandated country-level Internet censorship episodes in 2011 inspired scientific as well as popular interest in capabilities to not only detect but quickly and thoroughly characterize the causes of reachability problems.

We have developed and demonstrated a methodology that can identify not only which networks have been affected by an outage, but also which techniques have been used to effect a deliberate disruption (e.g., control plane vs. data plane intervention). We have also developed metrics to quantitatively gauge the geographic and topological extent of impact of geophysical disasters on Internet infrastructure, and techniques to investigate the chronological dynamics of the outage and restoration. Our approach relies on:

  • the extraction of signal from a pervasive and continuous source of malware-induced background radiation in Internet traffic (IBR);
  • combining multiple types of data (active probing, passive IBR measurement, BGP routing data, and address geolocation and registry databases) to assess the scope and progression of the outage.

This project will result in an experimental operational deployment to validate and extend an empirically-grounded methodology for detection and analysis of large-scale Internet outages. In addition to improving our understanding of how measurements yield insights into network behavior, and strengthening our ability to model large scale complex networks, use of such a system will also illuminate infrastructure vulnerabilities that derive from architectural, topological, or economic constraints, suggesting how to mitigate or eliminate these weaknesses in future Internet architecture and measurement research. A deployed platform will be able to detect and monitor connectivity disruption and censorship events on a planetary scale thus enabling situational awareness of the nature and causes of network outages to national decision-makers who must determine the type and extent of proper response.

Management Plan

The requested budget supports approximately 2 full time positions (25 person-months of effort per year) at CAIDA. The main proposed tasks are overlapping in time and each task will inform the others:

  •  Task 1 : investigating and defining strategies and methodologies for how to combine multiple heterogeneous data sources to detect and characterize outage events (Years 1, 2, and 3);
  •  Task 2 : defining (and refining) the system requirements for continuous monitoring and (near) real-time analysis of outages as they occur (will start in the second half of Year 1);
  •  Task 3 : testing and experimental deployment of such a system (Years 2 and 3).

Additional ongoing project activities will include:

  • developing project web pages to track project progress and disseminate data and tools;
  • maintaining a blog for timely dissemination of analysis and discussion of detected events;
  • coordination of our observations with other research and operational groups;
  • interaction with various stakeholders interested in our results.
The tentative schedule below details subtasks for each year of the project.

Subtask Description Projected Timeline Status
1.1 Select a geolocation license provider for the project and purchase a license Year 1 done
1.2 Define prefix and AS groupings by countries and/or by geographic regions Year 1 done
1.3 Work with UCSD telescope researchers to define most relevant IBR traffic indicators Year 1 done
1.4 Start developing automated methods of monitoring prefix reachability in BGP tables Year 1 done
1.5 Experiment with more frequent probing of globally routed prefixes by the Ark platform Year 1 done
1.6 Test on-demand active probing capabilities of the Ark measurement infrastructure Year 1 done
1.7 Investigate combined indicators for event detection, characterization, and analysis Year 1 done
2.1 Evaluate the volume of data that needs to be stored locally Year 1 (2nd half) done
2.2 Evaluate the size of the time window for data aggregation and processing Year 1 (2nd half) done
2.3 Evaluate the computational resources required for fast ongoing processing Year 1 (2nd half) done
2.4 Analyze the feasibility of emerging requirements, balancing storage and processing resources vs. desired functionality vs. cost Year 1 (2nd half) done
1.8 Experiment with monitoring and analyzing IBR traffic by geographic regions Year 2 done
1.9 Evaluate indicators for detection, characterization, and analysis of events with specific regard to aggregation by geographic region Year 2 done
1.10 Develop methods to integrate BGP data from Route Views and from RIPE RIS Year 2 done
1.11 Develop methods to integrate probing data from CAIDA's Ark and RIPE's Atlas platforms Year 2 done
1.12 Develop triggers for on-demand active probing based on observed routing changes Year 2 done
1.13 Develop triggers for on-demand active probing based on observed IBR traffic changes Year 2 done
1.14 Develop and integrate change-point detection algorithms into the system Year 2 done
2.5 Specify hardware parameters, obtain quotes, and purchase compute server and disk storage Year 2 done
2.6 Put the server and storage into production mode Year 2 done
2.7 Design and prototype web interfaces to control input/output of the monitoring system Year 2 done
2.8 Define efficient data structures for detection algorithms Year 2 done
2.9 Design and prototype web interfaces to present the analysis results Year 2 done
2.10 Define requirements for merging routing data from Route Views and RIPE RIS Year 2 done
2.11 Define requirements for merging active probing data from CAIDA Ark and RIPE Atlas Year 2 done
2.12 Document the system requirements and the selected design Year 2 done
3.1 Implement a software library for a common layer of functions and data structures Year 2 done
3.2 Implement a web-based interface to focus the monitoring process on specific regions and/or to use specific subsets of data Year 2 done
3.3 Implement the monitoring software modules Year 2 done
3.4 Implement the inference software modules Year 2 done
3.5 Create informative demos of the system capabilities Year 2 done
3.6 Implement interactive web interface to visualize the results of data analysis Year 2 done
1.15 Evaluate the efficiency of implemented automated detection algorithms Year 3 (1st half) done
1.16 Evaluate the effectiveness of data integration, data visualization, and user interface Year 3 (1st half) done
2.13 (optional) Investigate the possibility to plug-in additional data sources Year 3 (1st half) done
2.14 If necessary, adjust the system requirements based on experience Year 3 (1st half) done
2.15 Update the documentation Year 3 (1st half) done
3.8 Test the system on real cases Year 3 done
3.9 Experiment with various methods to deliver alerts (e.g., email, instant messaging) Year 3 done
3.10 Release the developed software under an open source license Year 3 done
3.11 Evaluate the potential impacts (positive and negative) of our analysis and dissemination of results on the network operators involved in the observed outage cases Year 3 done

Published
Last Modified