IODA: Detection and analysis of large-scale Internet infrastructure outages - Project Summary

The Internet Outage Detection and Analysis (IODA) project will apply successful results in analyzing large-scale Internet outages to the development, testing, and deployment of an operational capability to detect, monitor, and characterize future episodes of Internet connectivity disruptions.

Project Summary

Our dependence on the Internet has rapidly grown much stronger than our comprehension of its underlying structure, global dynamics, operational threats, and overall network health. Wide-scale Internet service disruptions and even politically motivated interference with Internet access in order to hinder anti-government organization are not new. But the scale, duration, coverage, and violent context of the government-mandated country-level Internet censorship episodes in 2011 inspired scientific as well as popular interest in capabilities to not only detect but quickly and thoroughly characterize the causes of reachability problems.

We have developed and demonstrated a methodology that can identify not only which networks have been affected by an outage, but also which techniques have been used to effect a deliberate disruption (e.g., control plane vs. data plane intervention). We have also developed metrics to quantitatively gauge the geographic and topological extent of impact of geophysical disasters on Internet infrastructure, and techniques to investigate the chronological dynamics of the outage and restoration. Our approach relies on:

the extraction of signal from a pervasive and continuous source of malware-induced background radiation in Internet traffic (IBR);
combining multiple types of data (active probing, passive IBR measurement, BGP routing data, and address geolocation and registry databases) to assess the scope and progression of the outage.

This project will result in an experimental operational deployment to validate and extend an empirically-grounded methodology for detection and analysis of large-scale Internet outages. In addition to improving our understanding of how measurements yield insights into network behavior, and strengthening our ability to model large scale complex networks, use of such a system will also illuminate infrastructure vulnerabilities that derive from architectural, topological, or economic constraints, suggesting how to mitigate or eliminate these weaknesses in future Internet architecture and measurement research. A deployed platform will be able to detect and monitor connectivity disruption and censorship events on a planetary scale thus enabling situational awareness of the nature and causes of network outages to national decision-makers who must determine the type and extent of proper response.

Management Plan

The requested budget supports approximately 2 full time positions (25 person-months of effort per year) at CAIDA. The main proposed tasks are overlapping in time and each task will inform the others:

Task 1 : investigating and defining strategies and methodologies for how to combine multiple heterogeneous data sources to detect and characterize outage events (Years 1, 2, and 3);
Task 2 : defining (and refining) the system requirements for continuous monitoring and (near) real-time analysis of outages as they occur (will start in the second half of Year 1);
Task 3 : testing and experimental deployment of such a system (Years 2 and 3).

Additional ongoing project activities will include:

developing project web pages to track project progress and disseminate data and tools;
maintaining a blog for timely dissemination of analysis and discussion of detected events;
coordination of our observations with other research and operational groups;
interaction with various stakeholders interested in our results.

The tentative schedule below details subtasks for each year of the project.

Subtask	Description	Projected Timeline	Status
1.1	Select a geolocation license provider for the project and purchase a license	Year 1	done
1.2	Define prefix and AS groupings by countries and/or by geographic regions	Year 1	done
1.3	Work with UCSD telescope researchers to define most relevant IBR traffic indicators	Year 1	done
1.4	Start developing automated methods of monitoring prefix reachability in BGP tables	Year 1	done
1.5	Experiment with more frequent probing of globally routed prefixes by the Ark platform	Year 1	done
1.6	Test on-demand active probing capabilities of the Ark measurement infrastructure	Year 1	done
1.7	Investigate combined indicators for event detection, characterization, and analysis	Year 1	done
2.1	Evaluate the volume of data that needs to be stored locally	Year 1 (2nd half)	done
2.2	Evaluate the size of the time window for data aggregation and processing	Year 1 (2nd half)	done
2.3	Evaluate the computational resources required for fast ongoing processing	Year 1 (2nd half)	done
2.4	Analyze the feasibility of emerging requirements, balancing storage and processing resources vs. desired functionality vs. cost	Year 1 (2nd half)	done
1.8	Experiment with monitoring and analyzing IBR traffic by geographic regions	Year 2	done
1.9	Evaluate indicators for detection, characterization, and analysis of events with specific regard to aggregation by geographic region	Year 2	done
1.10	Develop methods to integrate BGP data from Route Views and from RIPE RIS	Year 2	done
1.11	Develop methods to integrate probing data from CAIDA's Ark and RIPE's Atlas platforms	Year 2	done
1.12	Develop triggers for on-demand active probing based on observed routing changes	Year 2	done
1.13	Develop triggers for on-demand active probing based on observed IBR traffic changes	Year 2	done
1.14	Develop and integrate change-point detection algorithms into the system	Year 2	done
2.5	Specify hardware parameters, obtain quotes, and purchase compute server and disk storage	Year 2	done
2.6	Put the server and storage into production mode	Year 2	done
2.7	Design and prototype web interfaces to control input/output of the monitoring system	Year 2	done
2.8	Define efficient data structures for detection algorithms	Year 2	done
2.9	Design and prototype web interfaces to present the analysis results	Year 2	done
2.10	Define requirements for merging routing data from Route Views and RIPE RIS	Year 2	done
2.11	Define requirements for merging active probing data from CAIDA Ark and RIPE Atlas	Year 2	done
2.12	Document the system requirements and the selected design	Year 2	done
3.1	Implement a software library for a common layer of functions and data structures	Year 2	done
3.2	Implement a web-based interface to focus the monitoring process on specific regions and/or to use specific subsets of data	Year 2	done
3.3	Implement the monitoring software modules	Year 2	done
3.4	Implement the inference software modules	Year 2	done
3.5	Create informative demos of the system capabilities	Year 2	done
3.6	Implement interactive web interface to visualize the results of data analysis	Year 2	done
1.15	Evaluate the efficiency of implemented automated detection algorithms	Year 3 (1st half)	done
1.16	Evaluate the effectiveness of data integration, data visualization, and user interface	Year 3 (1st half)	done
2.13	(optional) Investigate the possibility to plug-in additional data sources	Year 3 (1st half)	done
2.14	If necessary, adjust the system requirements based on experience	Year 3 (1st half)	done
2.15	Update the documentation	Year 3 (1st half)	done
3.8	Test the system on real cases	Year 3	done
3.9	Experiment with various methods to deliver alerts (e.g., email, instant messaging)	Year 3	done
3.10	Release the developed software under an open source license	Year 3	done
3.11	Evaluate the potential impacts (positive and negative) of our analysis and dissemination of results on the network operators involved in the observed outage cases	Year 3	done