ISMA Winter 2000 Workshop - Final Report
Correlation and Visualization
There are strong parallels between the BGP routing space and the condition commonly referred to as the tragedy of the commons. The BGP routing space is simultaneously everyone's problem, as it impacts the stability and viability of the entire Internet, and noone's problem in that no single entity can be considered to manage this common resource.
Geoff Huston, IRTF mailing list, 15 dec 2000
Introduction / GoalsAlthough the number of routed prefixes and autonomous systems (ASes) grows exponentially, our understanding of global inter-domain Internet routing fails to keep pace. To improve the routing system, we need to understand better how it works now, starting with calibrating various methods used to analyze routing dynamics today. The workshop agenda first covered public sources of global routing data and simple growth curves of routes, address space, and AS number utilization. We then focused on uses of this data for routing modeling and analysis, including applications to traffic engineering. We had a session on analysis of massive topology data sets, and on tools for simulating very large (million-node) networks. The final session was a special interlude on techniques for bandwidth estimation. In this report we'll briefly capture the main points from each speaker, and then summarize highlights and results from the workshop perceived by the participants.
Sources of Routing dataSeveral different sources of operational inter-domain routing data exist, most notably the University of Oregon's RouteViews and RIPE-NCC's more recently established Routing Information Services (RIS) project.
RouteViews is a publicly accessible Cisco router with 39 peers currently providing their Internet routing view. Although some ISPs are sensitive about their routing data, it has become clear that the current Route Views Acceptable Use Policy is stifling research and unnecessarily consuming administrative time, and since the data is essentially available elsewhere without restrictions, David Meyer (the RouteViews administrator) would like to remove the AUP.
Meyer asked for suggestions or comments on how to make the data more useful to the research community. Issues recognized at the workshop:
- We have no way to know if peers are sending RouteViews customer routes, full routes sent to customers, full routes sent to peers, or all specifics.
- Cisco's TCP implementation wasn't intended to handle transfers of 100+Mb "files", which researchers need to do to download a complete RouteViews dump for analysis.
- the IOS "sh ip bgp" format was not designed to be easily parsed. It would be useful to develop an export RIB format (XML, MRT, text, no consensus...) usable either pre- or post- policy.
- Also incredibly useful would be the ability to dump dynamic updates, with flexible configuration on how much (e.g., specify maps). Best way: a BGP capability to flag collection sessions in order to prevent unintended re-export (would be bad..)
- RouteViews uses a private AS number, but this will likely change soon since those peers using it cannot use these numbers anymore privately anyways
Meyer is interested in co-authoring a proposal on expanding the functionality and operational analysis of RouteViews, based on needs and ideas articulated in this workshop.
RIPE Ripe-NCC Information Service (RIS)
Like RouteViews, the RIS was originally established as a data mechanism to help resolve transient problems with reachability. More fine-grained in collection, RIS archives in a mySQL database default-free time-stamped BGP announcements at several points (about 10 by 2001) in the Internet, providing the opportunity for a `giant queryable Looking Glass with history'. The remote route collectors (running Zebra) take snapshots of routing tables 3 times/day and store and forward all BGP route announcements. RIS will become a RIPE NCC service in 2001. Henk Uijterwaal presented details about the RIPE-NCC Routing Information Service.
Geoff Huston (Telstra) presented his analysis of a Telstra core routing table observed over the last few years,
- the AS table size is growing faster than the address space covered by it
- the average size of a routing entry is decreasing (/18.1 a year ago; /18.4 now)
- the number of distinct AS paths is growing due to multi-homing(multi-homing with several providers and is being used as a substitue for upstream service resiliency; core BGP tables are bearing the brunt of the load)
- 35% of advertisements are routing "holes"; many have different AS paths than the aggregates (due to, e.g., selective adverstisement of smaller prefixes along different paths as traffic engineering technique)
- CIDR is no longer working that well, and a lot of prefix growth is for long prefixes growth is for longer prefixes (/20, /25)
- accelerants: lower communication costs, multi-homing misconfigurations (reduced significantly after every IETF meeting publishes its top-ten shame list)
Rather than being a function of the bearer or switching subsystem, resiliency is provided through the function of the BGP routing system. The question is not whether this is feasible or desirable in the individual case, but whether the BGP routing system can scale adequately to continue to undertake this role.
Other Distilled Versions of Routing Data
- Tony Bates' weekly CIDR Status posting http://www.employees.org/~tbates/cidr-report.html (also posted to the NANOG list)
Visualization of Routing DataAndrea Carmignani (University of Rome) presented Hermes, a mediator tool for integrating complex information on mutual AS relationships (currently from RouteViews, RADB, and RIPE registry sources, intending to add more sources eventually). Hermes includes a graph-drawing module for visualization of BGP relationships, with the ability to display policy incongruities/inconsistencies, e.g., no policy-in or policy-out, asymmetric policies, two-way policy inconsistency.
Modeling and Analysis
Using routing data for modeling and analysis, Lixin Gao (U-Mass Amherst) presented her work on inferring Autonomous System relationships in the Internet. Lixin hierarchically taxonomized AS relationships and evaluated the accuracy of heuristic mechanisms to determine whether a given AS relationship was provider-customer, peer-peer (exchange customer routes), or sibling-sibling (exchange transit routes). The methodology provides a starting point for investigation of the effects routing policy has on AS path lengths.
Craig Labovitz (Microsoft) then described his and Abha's update to their work on how routing convergence is affected by Internet policy and topology. They modeled inter-domain routing dynamics using default-free BGP peering sessions and injecting faults. They found that BGP convergence typically takes much longer than RIP convergence, and is dependent on the length of the longest AS path. BGP's MinRouteAdver timer interferes with some measurements, solved by disabling it but at the expense of more BGP traffic. If BGP could tag updates and provide hints so nodes could detect bogus state, it could invalidate alternative paths that would eventually get withdrawn and may improve BGP convergency.
We turned to macroscopic topology analysis, and its relationship with routing. Ramesh presented talked about the incongruence between routing policy and the raw toplogy, which we have seen give rise to higher layer ways to get around perceived sub-optimal routing, both from the research (e.g., Detour) and commercial (CDNS, e.g., Akamai, Digital Island) communities.
Cengiz Alaettinoglu (packetdesign.com) talked about his work at ISI on Magellan, a tool used for unicast fault isolation. Magellan selects targets of interest to the user and uses traceroute and libpcap to learn the current path to those targets. Magellan can detect link failures by probing the link using other destinations and correlating results. Detection of oscillations requires a route history.
Bradley Huffaker (CAIDA) presented his work on providing daily operational summaries of the macroscopic skitter topology project. With 25 monitors and hundreds of thousands of destinations, the daily summaries allow the user to see plots of distributions of hop count, global RTTs, RTT by continent, RTT vs longitude, subsets of poorly served (high RTT) destinations from a given source, and distribution of addresses and dispersion of paths by country and AS. Marina Fomenkova has used similar methodology for a more in-depth study of destinations with high RTTs from the root nameservers The cgi script also supports interactive customization of graphs. CAIDA used its netgeo program to map geographic location based on whois contact address and router name.
Andre Broido presented his work on determining the combinatorial core of Internet topology. Andre's technique is to construct a directed IP graph from skitter forward path topology data (compensating for non-responsive hops) and then find its combinatorial core by iteratively stripping nodes with outdegree 0 (no outgoing links). Those of the remaining nodes with the minimum number of maximum hops from any given node represent the `center of the combinatorial core'. He concluded that there is indeed a center of the graph, and its reachable neighborhood is quite small before reaching the core. Andre presented a small twist to his analysis which introduces a new concept to graph-theoretic routing and topology analysis: the `dual-AS' graph. In the policy graphs he constructed and analyzed, a node corresponds to a peering session and a link corresponds to pair of sessions with a common AS and some traffic traversing it (e.g., "1909 195 1740" in an AS path becomes two nodes, "1909 195;" and "195 1740", with a link between them.) This dual-AS graph conceptually uses these inverse definitions of nodes and links in order to capture more policy constraints in the infrastructure. In particular, the graph of AS adjacencies is a poor descriptor for peerings further than one hop due to the influence of policy, but in the dual AS graph, nodes represent peering sessions and links are pairs of sessions with common ASes across which announcements occur and traffic can flow.
One intriguing discovery Andre has made is that the macroscopic skitter topology data collection captures significantly (about an order of magnitude) more AS connectivity than any of the publically available route-views data. The skitter data is thus extremely valuable for research into inter-AS connectivity and coverage.
The next phase of the workshop focused on using routing tables in conjunction with other types of data for experimenting with and validating traffic engineering methodologies. Jennifer Rexford (AT&T) showed how her ATT research group used topology, configuration, and limited workload information to derive detailed traffic demands and routing improvements. They modeled traffic demand as a volume of traffic from a source to a destination over a period of time. Multiple possible ingress/egress points complicated the analysis, and they had to identify all possible possible egress points from a given source. Further, they could only measure at a subset of the links on the network. In the data sample studied, top traffic demands roughly followed a power law distribution, with time of day effects dependent mostly on the time for the traffic recipient. Among the insights gained from her study were the needs for:
- improved measurement support in the routers, including packet/flow sampling;
- online monitoring of BGP advertisments;
- distributed collection infrastructure;
- selective measurement at access links; and
- online monitoring of up/down links and IGP weights.
Nina Taft (Sprint Labs) presented a similar study characterizing traffic flows at the POP-level on Sprint's backbone. Sprint instrumented multiple POPs with optical splitters on OC3 links, and collected IP headers with timestamps. Nina observed `elephants and mice' behavior, with a significant split between heavy and light inter-POP flows. For example, She found interesting traffic distribution at the level of prefix masks: for 8-bit masks, the top 10% of the prefixes make up 82% of traffic. If we then split the mask of the largest such `elephant' flow into (256) prefix masks of length of /16, the top 10% generated 97% of the traffic for that /8. Over the course of a day, elephants tend to remain elephants and mice tend to remain mice.
Olivier Goldschmidt (Make Systems) discussed the commercial software NetMaker's support for inference methods to support ISP backbone traffic engineering. NetMaker separates three different types of available information: 1) deterministic (topology, router/link types, routing paths); 2) measured (SNMP interface statistics, partial RMON/NetFlow), and 3) usage (external info about constraints on ingress-egress pairs). NetMaker applies linear programming methods, with the coefficient in the objective function set to the number of hops in the demand to ensure that the correct solution is always one of the optimal solutions. The application works best with NetFlow information from the most utilized routers.
Lance Tatman (Agilent) and Bill Woodcock (Zocalo) continued in the traffic engineering session with their presentation onusing traffic matrix data for transit bit cost optimization. The motivation is developing methodology for (mostly non-tier-1) providers to select transit providers. Current practice seems to involve mostly personal contacts/recommendations, or casual web searches or surveys, rather than any quantitative analysis of their service. Currently multi-homing transit have difficulty ascertaining how much traffic will go to each of its providers. Lance and Bill's tool collects Netflow data aggregated at prefix level, and maps the prefix to the AS paths listed in the concomitant routing table. The presence of the entire routing table allows them to examine not only the best paths, but all possible ASes that could be used to get to that prefix. The demonstrated their own analysis of how to take data for their own (Bill's) network for one month's numbers and apply them to next months potential transit purchases to optimize transit costs.
We then transitioned into more theoretical topics, including simuation and graph theory of large Internet topologies. BJ Premore (Dartmouth College) implemented a BGP simulation module to better understand dynamic routing behavior and look at implementation tradeoffs and extensions. Using the well-established Scalable Simulator Framework (SSFnet http://www.ssfnet.org), a compositional approach to large network design with its own Domain Modeling Language (DML), BJ examined the effects of black holed router on the network, as well as convergence of BGP over a variety of topologies. The workshop participants expressed desire for a BGP-config to DML translator, which BJ agreed to implement.
Andre Broido and kc claffy (CAIDA) presented further work on the use of route-views to analyze global patterns and efficiency of the routing system. Andre presented statistics on the extent of the use of prepending (11%, mostly with the last AS, but a fair amount of prepending also occurs in the middle of AS paths). Andre introduced and demonstrated analysis using a new connectivity granularity: BGP atoms. A BGP atom is an equivalence class of prefixes that share the same set of AS paths. In calculating the combinatorial core of the AS graph represented by data in RouteViews, Andre reduced 80017 prefixes to 16521 atoms. Among other results(/a), Andre also showed data indicating a power law relationship for the number of prefixes in an atom.
Nevil Brownlee offered us an interesting interlude on his use of performance to the DNS root name servers and gTLD servers as seen from the edges of network (UCSD and Auckland campuses). Using the packet pair technique to monitor the response times between queries and responses passively seen on the link, he observed periodic packet loss data that suggests regularly scheduled NSI zone rebuilds/reloads.
The final session of the program focused on a topic of tangential relevance but significant interest to the routing research community: techniques for accurate estimation of link bandwidths along an arbitrary Internet path. Constantinos Dovrolis (University of Wisconsin - Madison) described his spectral analysis techniques for estimating bottleneck bandwidth of IP paths, using dispersion between packet pairs or variation of RTTs for packets of different sizes. Constantinos' analysis distinguishes between bottleneck capacity versus available bandwidth: the `narrow link' has the minimum capacity along the entire path, while the `tight link' is the one with the minimum available bandwidth along the entire path. Underestimation can result when packets get between a measurement packet train, extending the dispersion. Overestimation can result if high delays are experienced by the first packet. Congestion along the path affects the accuracy of the estimate.
Bruce Mah (Cisco) and Allen Downey (Wellesley) discussed their tools, both followups to Van Jacobson's original pathchar, for which source code was never released. Bruce wrote pchar and Allen wrote clink, and they will be merging efforts in a project called netchar that will try to test these techniques among multiple pairs of nodes. Both tools use RTT measurements at various packet sizes in order to yield enough sample probes that experienced no queuing, using the TTL field to constrain the number of hops probed. Plotting of resulting RTT versus packet size yields a slope that represents the inverse of link bandwidth and a y-intercept that represents RTT. They use a variety of different curve-fitting techniques to handle noise. Bruce has implemented pchar in (a non-released) version of IOS. Finally, Andre Broido presented results of experiments using his bandwidth estimation technique.
Highlights and Key FindingsThere seemed strong consensus among the participants regarding what research topics were important to pursue, and what fundamental building blocks were appropriate to beginning such pursuit. Key conclusions and highlights:
- There are actually quite a few people collecting BGP data (tables, updates, etc.) and large scale IP topology data for various kinds of analysis. Researchers seem to be using this data effectively thus far; indeed it has become a vitally useful component of relevant Internet routing research. The data collection architectures do have limitations, mostly in terms of scope and functionality, and several workshop participants have committed to trying to improve their data collection architectures for use in routing modeling and analysis by the community.
- Acquiring complete data sets of other types, e..g, traffic matrices, performance, is actually much more difficult, for operational reasons, and drawing inferences is imprecise but necessary for testing traffic engineering tools and techniques. More complete data sets in these areas would improve the verification ability and integrity of work in the field.
- We still need to better understand relationships between Internet size and routing tables, topology and policy, policy and traffic demands. There is growing concern about the complexity of the Internet (in terms of paths, topology, convergence properties, etc.) due to more complex routing policies (multihoming, backup, load-balancing etc.) The use of multi-homing as a substitute for infrastructural resiliency in the core is putting an unexpected strain on the BGP routing system. With leaf behavior `allowing' the sub-optimal quality in the core to continue, the size of the global routing table will continue to grow exponentially, lending credence to Geoff's extrapolation that we will run out of AS addresses by 2005.
- The effects of CIDR on the growth of the BGP table have been outstanding, not only because of the initial impact in turning exponential growth into a linear growth trend, but also because CIDR was effective far longer than could have been reasonably expected in hindsight. The current growth factors at play in the BGP table are not easily susceptible to another round of CIDR deployment pressure within the operator community. It may be time to consider how to manage a BGP routing table which has millions of small entries, rather than the expectation of tens of thousands of larger entries.
- Bandwidth estimation techniques are still problematic and seem to require a measurement infrastructure for calibration (which CAIDA is hoping to offer the skitter infrastructure to help, to the extent viable.) Several new techniques were proposed and bear further examination and testing on real Internet paths.
- How can we correlate and analyze massive routing, topology, and performance data sets to provide timely insight into both normal as well as anomalous Internet behavior?
- What analysis tools exist or are needed to aid correlation?
- How accurately does a typical core routing table reflect paths that traffic will actually take through the infrastructure? Can we account for all types of incongruities?
- Can we identify causes of trends in routing table growth, such as multi-homing at the edges?
- What mechanisms currently exist or could be developed to identify and track critical or vulnerable pieces of the infrastructure?
- How will routing analysis needs change for MPLS/TE infrastructure? What about optical (non-IP) switched core?
- Is there a way to scale a routing system while admitting the frequent changes of fine-grained policy?
- Are the denser meshes observed, including toward the edge, affecting convergence? Are there ways to use a even denser mesh while not cause update storms and unstable intermediate states?
- Is there a way to express traffic engineering preferences in such a way that does not overload the BGP table with policy applied to specific prefix announcements?