ISMA Dec '01 Routing and Topology Analysis
no aphorism is more frequently repeated...
than that we must ask Nature few questions,
or ideally, one question at a time.
...this view is wholly mistaken
Nature will best respond to a logically
and artfully thought out questionnaire;
indeed if we ask her a single question, she will often refuse
to answer until some other topic has been discussed.
Sir Ronald A Fisher
Perspectives in Medicine and Biology 1973
Introduction / GoalsOver the past year, questions about Internet growth and how long the BGP4 routing system can remain viable have become regular topics at NANOG and IETF meetings. Ongoing discussions result in global credence given to certain presented opinions and concerns, no matter what the scope of the original analysis efforts, or how extensive or limited the amount of data analyzed. Inadequacies in current Internet routing analysis tools and methodology handicap efforts to plan future architectures. In an attempt to begin to separate facts from Internet "myths", CAIDA hosted an Internet Statistics and Metrics Analysis (ISMA) workshop focused on routing data analysis. This workshop, held December 17 - 19, 2001 at the San Diego Supercomputer Center, brought together researchers from academia as well as commercial ISPs actively working in the emerging field of routing data analysis. An overarching goal of this workshop was to ask research questions relevant to maintaining the health of the Internet's global routing infrastructure. For example:
- What are trends in routing table growth? Is this growth manageable?
- Is CIDR still working? (e.g., How much dark Internet address space is left?)
- How do filtering options impact global routing system architecture?
- Can worms and viruses destabilize global routing?
- How many multiple origin AS conflicts exist?
- Is multihoming a technical and/or economic problem with respect to global routing?
- What does the global Internet look like? How can we model it?
Routing Data and GrowthAt the end of 2001, several different sources of operational inter-domain routing data exist:
|RouteViews (Cisco)||63||700K||sh ip bgp||12/day||U Oregon|
|RouteViews2 (zebra)||23||200M RIB||MRT UPDATES + RIBs||96/day||U Oregon|
|Routing Information System (RIS)||177||???||sh ip bgp; +announcements; +zebra logfiles||3/day||RIPE-NCC|
|Internet Routing Topology Archive (IRTA)||181||6.9M||sh ip bgp; migrating to MRT||1/day||PCH|
Issues for Routing Data ArchiversCollected routing data has been used by ISPs to troubleshoot operational problems and by researchers to analyze routing dynamics, although it is typically easier to detect a problem than to find out who to notify to have it fixed. Benefits achieved so far by ISPs and researchers using routing data archives do not preclude questions about the soundness and efficiency of the data collection methods employed. Some efforts have been made to correlate data analysis from different archives, e.g., Henk Uijterwaal's analysis of RIPE data is consistent with Andre Broido's RouteViews based plot of the number of hops in AS paths. Yet there is no real methodology for validating analyses, in particular since so many different views are visible from different points in the topology. Further sharing of archived routing data also depends on development of a reasonable Acceptable Use Policy as well as agreement on data formats. Finally, specific logistical problems are relevant both to archivers and to the entire routing operations and research community.
- As routing data grows, how do archivers keep up with necessary disk/processor/memory scaling?
- How can archivers maintain contacts with peering organizations?
- Who pays for the archive infrastructure?
Joel Jaeggli gave an update on the status of the RouteViews project. See: http://www.routeviews.org/. Joel explained that certain known Cisco bugs and anomalies with the sh ip bgp command motivate the push to use MRT format RIBs instead. However, problems occurred while trying to migrate RouteViews from Cisco's bgpd to a zebra bgpd that supports MRT format. It appears that a bug in zebra's scanner thread prevents the zebra bgpd from starting all 63 RouteViews peers. The CLI thread is also prohibitively slow. David Meyer is working with AYR implementors to fix this bug. He is also working with Cisco to fix some of the problems on the IOS side.
RIPE Routing Information Service (RIS)
Henk Uijterwaal discussed progress with RIPE's Routing Information Service (RIS). RIS was built to handle complaints about AS-level routing problems. RIS data is stored in a mysql database. There are eight RIS Route Collectors (RRCs) that monitor 177 peering sessions. Disk space did not scale with archive growth and a RAID disk was installed to solve that problem. RIS stores three months of raw data and three years of summary data. Data can be accessed via queries to the database and through a daily report. There is a 30-60 minute delay between data collection and database accessibility. A looking glass facility was added to enable RIS users to access data before it has been entered into the main database.
RIPE Test Traffic Measurements Service (TTM)
The TTM service measures network delays and packet losses between some 70 points on the Internet. In the last months, the measurement program has been expanded in several directions:
- IP Delay Variation ("jitter") measurements
- Long term trends, allowing a user to predict where network capacity must be added in the near future
Early next year, RIPE plans to integrate bandwidth measurements into the infrastructure.
Packet Clearing House (PCH) Internet Routing Topology Archives (IRTA)
Bill Woodcock described a new effort by PCH since the spring of 2000 to continuously collect routing data from key Internet eXchange (IX) points. IRTA probes are currently installed at PAIX, MAE-West, MAE-LA, MAX, SD-NAP, SIX, LINX, and VIX. New probes are ready for install at SOIX, Lo-NAP, and MAE-East/PAIX-VA/Equinix Ashburn. PCH is also collecting routing data from Sprint, UUNet, C&W, BBN, AT&T, MFN, and RCN transit links, as well as OIX (Oregon RouteViews). Twenty-three future IRTA probe machines are scheduled for deployment. The IRTA back end consists of mirrored 1.5TB RAID arrays, insufficient to support uncompressed data storage. An upgrade to mirrored 7Tb RAID arrays is underway; web and database gateways will be supported. See: http://www.pch.net/peermaster/
Andy Ogielski examined global routing instabilities that occurred during the Code Red II and Nimda worm events. Joint analysis using RIPE RIS data revealed large and long-lasting BGP storms induced by worm propagation. For example, when the Nimda worm initiated an attack on September 18, a 20-fold exponential growth in prefix announcements was observed. Announcements returned to a baseline level after four days. Nimda probes clearly had a substantial detrimental effect on the routing system.Andy presented qualitative and operational definitions of global Internet routing instabilities:
high rates of route changes:||
seen at many observation points:
|exponential growth of rate of prefix announcements and withdrawals||hours to days||almost all prefixes churning from most large ISP peers|
Nimda and Code Red triggered long-term BGP instabilities unlike any localized network failure. The attacks could not be localized to particular suspect peers, prefixes, or routes. Andy's data suggests that Nimda affected more edge routers than core routers. These routers fail due to router CPU overload, out-of-memory or cache overflows, or software bugs. The diversity of worm traffic causes extremely high scan rates, generating many flows that then overload router CPU or memory, causing NAT problems and ARP storms. Andy also noted router misconfiguration instabilities due to common BGP events in the Internet core. For example, if a misconfigured AS announces a private ASPATH, certain routers ignore yet propagate the malformed route while other RFC1771-compliant routers close and reopen the BGP sessions. Any local leakage of malformed ASPATH announcements can cause cascading router failures.
Sue Moon presented a technique for detecting routing loops in packet traces and analyzed their impact. Sprint backbone links utilize splitters to allow all headers to be captured. Since she is examining all packets going through the link, she can detect whether a packet visits that link more than once. Her data shows that 99% of the looping incidents last less than ten seconds. Routing loops occur for three general reasons:
- link failure or new link coming up impacting iBGP
- route withdrawals, new route announcements, peering link failure, new peering links impacting eBGP
- router misconfiguration
Ratul Mahajan discussed his study of BGP misconfigurations. He notes that while BGP is the weakest link in the Internet infrastructure, Internet connectivity is robust to most misconfiguration incidents. Using RouteViews BGP tables and heuristics to identify potentially bad announcements, Ratul attempted to validate suspected misconfigurations with email surveys of ISPs. He categorized short-lived table announcements into one of five categories: self deaggregation; stripping; strip deaggregation; foreign origination; and foreign deaggregation. Causes include buggy filters; redistribution; accidental inclusion of an old router configuration; address hijacking; forgetting auto-summary; incorrect summarization; private communities; and relying on an upstream router to correct problems. Ratul also analyzed route export misconfiguration, using Lixin Gao's methodology to infer AS relationships (See: Lixin Gao's Homepage). He then identified AS sequences that either do not obey those relationships or are short-lived. Causes of AS-level misconfigurations can be attributed to prefix based configuration, simple typos, and incorrect implementation of private communities.
Mike Lloyd discussed the rationale behind RouteScience's data collection and BGP performance enhancement (intelligent routing) tools. RouteScience built several multihomed PoPs around the US that take full BGP feeds from a number of different ISPs. Each PoP hosts some web content for partner companies while providing a platform for evaluating their global system load balancing (GLSB) technology. Using these PoPs, RouteScience studies the impact of egress choices on end-to-end performance. A three-way Handshake Round Trip Time (HRTT) is collected from tcpdump files, along with logs, and daily BGP tables from each PoP. Daily HRTT measurements for each prefix over each link are collected and serve to set baseline performance measures for each provider. Performance measurements are compared to BGP table derived best-paths. Hot-potato routing dominates, but this performance-based system offers a method for improving BGP performance in a certain percentage of cases. Mike notes that prefix performance varies substantially, and most multihomed networks use little of the BGP table.
Shiv Kalyanaraman examined BGP and load balancing issues. He viewed the problem in two parts: outbound and inbound. For outbound load balancing, Shiv proposed using the mechanism of LOCAL_PREF settings and then viewing the problem as a black box optimization. A new technique, 'Recursive Random Search (RRS)' was found useful for finding approximate solutions quickly or better solutions given more time. Aggregate traffic workload was assumed to exhibit statistical properties and parameters valid for short, finite periods of time. SSFnet was used to simulate such traffic for small AS-graph topologies. Planned future work will consider larger AS graphs and more complex optimization goals. For inbound load balancing, Shiv considered only the special case of multihomed stub ASes. Shiv argues that this is an important subset based on Cengiz and Andre's work. For this subset of inbound load balancing only, Shiv suggests that BGP can be completely bypassed, and a NAT-based technique used instead, because inbound load is routed to public addresses managed by NAT boxes. The NAT boxes must tightly coordinate the assignment of public addresses to flows to achieve fine-grained inbound load balancing. Intended for edge routers, this method is consistent with prefix-length based filtering proposals and could reduce routing table sizes.
Avi Freedman evaluated the impact of routing churn on performance by examining edge/core update propagation. Akamai built an SLA verification system, called "Akanote", to compare origin web service to Akamai web content service. Often this system detects problems that ISPs have not yet discovered. For example, many problems turn out to be due to hard-to-detect CEF bugs or partial unreachability. Collected data enables SLA verification between Akamai and its customers as well as between Akamai and its ISPs. Network performance data is also used as a `coarse-grained' map of inter-AS connectivity, interfacing with Akamai's NOC tools as well as parts of Akamai's mapping efforts. Akamai collects diverse kinds of Internet measurements, and maintains BGP sessions with 350 ASes. Of these, more than half provide full BGP feeds. More than 250 are non-core ISPs who only do transit routing. BGP is used to determine acceptable prefixes for use by specific Akamai boxes, as well as to look at performance and structure. Akamai hopes to start using MRT format soon. Akamai gathers other kinds of measurement data for its 5-15 billion HTTP transactions/day: TCP throughput, retransmits, and timers. While edge filtering for specific pathologies is possible, detection of anomalies is more difficult than just classifying good or bad performance. Future plans call for more data mining capabilities.
Akamai also collects IP to geographic location mappings via standard heuristics (e.g., in-address, whois) and ISP collaboration, typically via BGP communities. They also collect bandwidth measurements, and analyze unauthorized hits. Billing logs generate 5-15 billion transactions/day, and take 24 hours to process, yielding coarse indication of access traffic density over time per prefix or /24. Avi also conducted a study to look at how long it takes route changes to propagate from origin to remote ASes, and how this value depends on topology or location of the remote AS. He collected data from 15 core and 15 edge routers for one week during which he noted 23,127 insertions, 192,409 origin AS changes, and 22,771 deletions. His results vary significantly from those previously reported by Craig Labovitz and Abha Ahuja, whose methodology involved fault insertions and more peers. Avi's results were faster than expected, especially for route deletion. Avi plans to repeat the study with a larger set of peers. It was suggested that the differing results from the two studies might be attributable to differences in how each determines start of event time.
Macroscopic Statistics of Routing Tables
Cengiz Alaettinoglu produced evidence gathered from RIPE/RIS BGP Update Archives data showing that CIDR address management is working well. However his data also shows that in spite of an overall decrease in churn, engineered prefixes seem to churn more, primarily as a result of peering loss. Churn due to peering loss is expensive and gets carried several hops away from its source. Cengiz also pointed out that growth of multihoming prefixes is much closer to quadratic than exponential.
Andre Broido spoke about Internet stability amid change. His analysis of RouteViews data shows large variation but not rapid growth. He also showed that multihomed nets originate only their fair share of more specifics, in proportion to their presence in the routing table, and thus are not to blame for increases in routing churn. Andre also observed 15 times as many announcements as withdrawals, and 15 times as many withdrawals as announcements followed by immediate withdrawals. Data indicates that the Internet is undergoing a refinement process in which some bulk measures are stable while others change slowly. Contrary to prevailing wisdom, global routing system churn comes mostly from large ISPs, .gov/.mil, and ISPs in developing nations. Indeed, small ASes generate relatively little churn or table growth.
Brad Huffaker evaluated the predictive power of four different metrics for indicating end-to-end performance: IP path length; AS path length; geographic distance; and previously measured RTT (specifically, the median RTT over 24 hours). Although previously measured RTTs provide the best server selection predictor in 90% of all trials, the cost of making active probes over the network must be considered. Brad showed that for IP address pairs within the US, the shortest geographic distance between them provides a reasonable indicator of low RTT at zero cost to the network. In contrast, AS path length was not statistically useful (although easy to collect), and neither IP path length nor median RTT over 24 hours provided any better estimate than geographic distance.
Krishna Nayak of netVmg showed how multihoming can be used to optimize performance over a large set of endpoints. He wants to determine which combination of ISPs will yield the most path diversity so that a multihomed enterprise gets the highest availability. His measurement sample contained 8,932 unique /24 prefixes randomly selected from 46,089 netflow flows. In this study of 4 different ISPs, Krishna ran a UDP traceroute via the BGP default path associated with each ISP. Then he located convergence points shared by each provider. Focusing on shared paths, he limited his analysis to those 3,372 destinations (out of total 8,932) having full traceroutes. From this subset, he plotted histograms of IP hop counts and RTT. Using this data, he was able to identify the least diverse and the most diverse providers, based on a specific viewpoint and probed endpoints.
Hyunseok Chang presented a study of BGP routing tables that considered both local and non-local viewpoints, based on joint work with Ramesh Govindan, Sugih Jamin, Scott J. Shenker and Walter Willinger. They find that peering relationships of Tier-1 ASes are more easily observed than those of non-Tier-1 ASes. He also notes that peering relationships of a given AS are more easily observed by its customer ASes than by its peer ASes. Peering relationships with upstream provider ASes are easily observed using his technique of non-local views. Hyunseok also analyzed exchange point peering connectivity. He found that connectivity of ASes co-located at a typical exchange point is quite rich (based on RIPE IRR, email survey and individual ISP/EP website information). For example, at LINX, where more than 100 ASes co-locate, the average number of peering connections for each AS is more than 50. A large number of regional ISPs worldwide are co-locating themselves at regional exchange points, but Oregon RouteViews does not capture all of this rich connectivity. Hyunseok's results suggest that available public BGP routing data may not be sufficient to construct a representative AS-level topology, shedding doubt on the validity of well-known power-law degree distributions to model Internet topology.
Statistical Methods and Issues
Ramesh Govindan presented work done with Vern Paxson at ISI to characterize ICMP times of active probes. This work is relevant to all measurement tools that employ ICMP for measuring end-to-end performance. He wants to determine how much noise is added by a particular end-to-end tool and how much depends on router configuration. He does not address path asymmetry. Ramesh conducted his tests on 11 NIMI boxes. Five of these boxes implemented ingress filtering, so they only returned "administratively prohibited" messages. Several metrics for estimating noise were considered, and problems with clock skew and rate-limiting on deep queues were acknowledged. Ramesh's methodology utilizes a regular ping and a special hop limited ping that spoofs the source address with the destination address. In this way, the ICMP engine on the target router will be reached, and the difference between the two RTTs is the ICMP packet generation time. Ramesh finds that most ICMP messages are generated in 1-2 msecs, but that some take up to 11 msecs.
Valery Kanevsky discussed issues related to establishing statistically justifiable sample sizes when making Internet measurements. He used the distribution of packet samples classified into different groups, e.g., IP address, packet length, from which he tries to project the real traffic distribution.
Bill Woodcock spoke about trends in the growth of public Internet exchanges. He described the Peering Coordinator Toolset, which offers a possible research spinoff: it can be used to build a more complete, but anonymizable database of peerings than can be published using AS numbers. Bill contrasted popular wisdom of IX membership trends (expecting exponential growth) with actual historical data (showing quite linear growth). The web-based Peermaster interface at http://peermaster.pch.net currently supports the following queries:
- total peers by month at exchanges that are above some defined threshold from a specified date to present
- total exchanges by month for all ISPs above some defined threshold from a specified date to present
- specific peer/date adds for a selected exchange
- specific exchange/date adds for a selected ISP
Highlights and Key Findings
Our understanding of global inter-domain Internet routing dynamics lags significantly behind the growth in the routing system itself. To improve the routing system, we need to better understand its behavior, starting with calibration and refinement of various methods used to analyze routing dynamics. Yet formidable challenges exist in extracting meaningful insights from measurements, even of homogeneous data sets. Data sets are typically extremely large (gigabytes to terabytes), in heterogeneous formats, and gathered at different times, granularities, and locations, both geographically and logically.
Improving Available Routing Data
Limitations of currently available BGP data prevent the pursuit of certain questions. For example, BGP data connectivity does not portray most redundancy in different parts of the network, because BGP tables show only selected (best) routes rather than all possible paths. Also significantly missing are public and private exchange points, short-term AS path variations, and AS load balancing. Significant distortions of network connectivity, even in the core, are produced when using only BGP data to develop any kind of topology map. Exported BGP data is simply too sparse for comprehensive Internet topology analysis, even at the AS-level granularity.
Indeed, several analyses suggested that available BGP data may not be sufficient to construct a representatitve AS-level topology, and definitely not reflective of underlying IP-level topologies. In particular, the validity of well-known power-law degree distributions is questionable for accurate topology modeling.
Nothwithstanding these limitations, Andy Ogielski believes that routing data quality can be improved by implementing prefiltering to remove errors and anomalies. Specifically, he suggests removing data to prevent decreasing timestamps due to clock shifts, corrupt MRT headers, truncated BGP messages, and peers opening then immediately closing sessions. BGP has inherent race conditions, which are not helped by vendor software bugs, e.g., Cisco CEF failing to properly distribute the forwarding table.
Avi Freedman's data differed significantly from the classic Labovitz/Ahuja route convergence study, for reasons not yet clear. Some hypotheses include: differences in measurement methodology; time granularity of analysis; or insufficient number of peers in the sample.
Understanding the Routing MythsAnalysis of growth and dynamics of the routing system has already suggested that current prevailing wisdom is inaccurate in significant ways, and deserves reexamination based on more comprehensive data analysis:
- Routing table growth is exponential.: Growth is closer to linear or quadratic rather than exponential.
- Most prefix growth occurs for /20 and longer.: Andre Broido presented data showing that small ASes generate very little churn or table growth. Half of the routing instability in the form of withdrawal/reannouncement events in late 2001 is contributed by 1.2% of all (12,422 active) ASes. Government networks, telecoms in developing countries and major backbone ISPs are the top contributors to BGP table churn. Small ASes (those originating a few prefixes) do not contribute more than their fair share to the BGP table size or to instability of the global routing system.
- Multihoming misconfigurations cause most routing table churn.: Most multihomed networks contribute proportionately little churn to the BGP table, relative to their degree of presence in the table. As of November 2001 multihomed networks (transit or non-transit) are not significantly more likely to announce more specific prefixes than non-multihomed networks.
- Average AS path length is decreasing.: AS path length, both the mean and the overall distribution, did not significantly change between 1999 and 2001. Link/node ratio (average degree), and peering richness of the BGP AS graph also did not significantly change between November 2000 and May 2001 although individual ASes often exhibited a high degree of change.
- What mechanisms currently exist or could be developed to identify and track critical or vulnerable pieces of the infrastructure?
- Are observations of denser meshes, including those toward the edge, affecting convergence? Are there ways to use an even denser mesh while not causing update storms and unstable intermediate states?
- Is there a way to scale a routing system while accommodating the frequent changes of fine-grained policy? In particular, is there a way to express traffic engineering preferences in such a way that does not overload the BGP table with policy applied to specific prefix announcements?
- Which ASes are downstream and upstream from a given AS?
- What percent of address space can be reached only via a given set of ASes or network prefixes? how much connectivity disappears if an AS or a prefix is removed?
- For a given AS number or prefix, which ASes/prefixes are reached only through that AS or prefix?
- What is the frequency of AS path changes for specific destinations over a day/week/month?
- Which exchange points and BGP peering sessions are visible at each AS?