The contents of this legacy page are no longer maintained nor supported, and are made available only for historical purposes.

Bibliography Details

G. Iannaccone, C. Chuah, R. Mortier, S. Bhattacharyya, and C. Diot, "Analysis of link failures in an IP backbone", in ACM SIGCOMM Internet Measurement Workshop, Nov 2002.

Analysis of link failures in an IP backbone
Authors: G. Iannaccone
C. Chuah
R. Mortier
S. Bhattacharyya
C. Diot
Published: ACM SIGCOMM Internet Measurement Workshop, 2002
URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.2392
http://www.icir.org/vern/imw-2002/imw2002-papers/202.pdf
http://www.icir.org/vern/imw-2002/slides/202-slides.pdf
Entry Date: 2003-05-14
Abstract: Today's IP backbones are provisioned to provide excellent performance in terms of loss, delay and availability. However, performance degradation and service disruption are likely in the case of failure, such as fiber cuts, router crashes, etc. In this paper, we investigate the occurence of failures in Sprint's IP backbone and their potential impact on emerging services such as Voice-over-IP (VoIP). We first examine the frequency and duration of failure events derived from IS-IS routing updates collected from three different points in the Sprint IP backbone. We observe that link failures occur as part of everyday operation, and the majority of them are short-lived (less than 10 minutes) . We also discuss various statistics such as the distribution of inter-failure time, distribution of link failure durations, etc. which are essential for constructing a realistic link failure model. Next, we present an analysis of routing and service reconvergence time during a controlled link failure scenario in our backbone. Our results indicate that disruption to packet forwarding after link failures depends not only on routing protocol dynamics, but also on the design of routers' architectures and control planes. Thus our results offer insights into two basic components for defining network-wide availability, which we consider a more appropriate metric for service-level agreements to support emerging applications.
Datasets:
  • Discusses only failure events that affect links connecting different POPs (Points of Presence). Intra-POP failures are not covered.
  • Disregards link failures that are not fixed in 24 hours (under the assumption that these represent a permanent removal of links).
  • for link failures: IS-IS updates collected Dec 2001 to Apr 2002
  • for IS-IS convergence time: two-way packet probes and traceroutes between a host on the U.S. East Coast and a host on the West Coast; two backbone links were intentionally brought down
Results:
  • only 10% of failures last longer than 20 minutes
  • 50% of failures last less than 1 minute
  • 47% of all failure events occur between 10PM to 6AM EST, a time period including most planned maintenance (at Sprint)
  • links differ widely in number of failures and in mean time between failures; a small number of links are highly failure prone
  • using Cisco default values for IS-IS parameters: IS-IS convergence time after a failure is less than 18 seconds
  • tuning IS-IS parameters: IS-IS convergence time can be reduced to 2-3 seconds