The contents of this legacy page are no longer maintained nor supported, and are made available only for historical purposes.

Bibliography Details

C. Labovitz, A. Ahuja, and F. Jahanian, "Experimental Study of Internet Stability and Wide-Area Backbone Failures", Tech. Rep. CSE-TR-382-98, University of Michigan, 1998.

Experimental Study of Internet Stability and Wide-Area Backbone Failures
Authors: C. Labovitz
A. Ahuja
F. Jahanian
Published: University of Michigan, 1998
URL: http://www.eecs.umich.edu/techreports/cse/1998/CSE-TR-382-98.pdf
Entry Date: 2003-05-15
Abstract: In this paper, we describe an experimental study of Internet stability and the origins of failure in Internet protocol backbones. The stability of end-to-end Internet paths is dependent both on the underlying telecommunication switching system, as well as the higher level software and hardware components specific to the Internet's packet-switched forwarding and routing architecture. Although a number of earlier studies have examined failures in the public telecommunication system, little attention has been given to the characterization of Internet stability. Our paper analyzes Internet failures from three different perspectives. We first examine several recent major Internet failures and their probable origins. These empirical observations illustrate the complexity of the Internet and show that unlike commercial transaction systems, the interactions of the underlying components of the Internet are poorly understood. Next, our examination focuses on the stability of paths between Internet Service Providers. Our analysis is based on the experimental instrumentation of key portions of the Internet infrastructure. Specifically, we logged all of the routing control traffic at five of the largest U.S. Internet exchange points over a three year period. This study of network reachability information found unexpectedly high levels of path fluctuation and an aggregate low mean time between failures for individual Internet paths. These results point to a high level of instability in the global Internet backbone. While our study of the Internet backbone identifies major trends in the level of path instability between different service providers, these results do not characterize failures inside the network of service provider. The final portion of our paper focuses on a case study of the network failures observed in a large regional Internet backbone. This examination of the internal stability of a network includes twelve months of operational failure logs and a review of the internal routing communication data collected between regional backbone routers. We characterize the type and frequency of failures in twenty categories, and describe the failure properties of the regional backbone as a whole.
Datasets:
  • for inter-provider faults:
    • 10 months (Jan 97 to Nov 98) of BGP updates from three providers
    • 3 years of BGP updates at 5 U.S. exchange points: AADS, Mae-East, Mae-West, PacBell, and Sprint
  • for faults within a backbone:
    • studies MichNet, a medium size regional network connecting educational and commercial customers in 132 cities at speeds up to OC3; network connects 33 backbone routers to several hundred customer routers
    • 1 year (Nov 97 to Nov 98) of data from an automated system that pings all router interfaces
    • entries in trouble ticket system of the NOC
    • 6 months (Mar 97 to Nov 98) of OSPF messages
Results: Quoting and paraphrasing from paper:
  • The Internet backbone infrastructure exhibit significantly less availability and a lower mean-time to failure than the Public Switched Telephone Network (PSTN).
  • The majority of Internet backbone paths exhibit a mean-time to failure of 25 days or less, and a mean-time to repair of twenty minutes or less. Internet backbones are rerouted (either due to failure or policy changes) on the average of once every three days or less.
  • Routing instability inside of an autonomous network does not exhibit the same daily and weekly cyclic trends as previously reported for routing between Inter provider backbones, suggesting that most inter-provider path failures stem from congestion collapse.
  • A small fraction of network paths in the Internet contribute disproportionately to the number of long-term outages and backbone unavailability.
  • Majority of intra-domain outages stem from maintenance, power outages and PSTN failures.
Notes:
  • Inter-domain BGP updates are classified into the following categories: Route Repair, Route Fail-Over, Policy Fluctuation, and Pathological Routing. Previous work by the authors used a lower level classification (e.g., WWDup, AADiff). The paper does not analyze updates in the Policy Fluctuation and Pathological Routing categories, as these are outside the scope of the study.
  • The study only considers prefixes "present in each ISP's routing table for more than an aggregate 60 percent (170 days)" of the study period. This removed 20% of short-lived routes, leading to a lower estimate of network failures.
  • The authors "applied a fifteen minute filter window to all BGP route transitions;" specifically, multiple failures occurring during the window are counted as a single failure. This is meant to reduce the bias of high frequency pathological behavior and the effects of BGP convergence.