Skip to Content
[CAIDA - Cooperative Association for Internet Data Analysis logo]
The Cooperative Association for Internet Data Analysis
top problems of the Internet and what sysadmins and researchers can do to help
This outline distills kc claffy's slides from her Usenix LISA 2003 plenary keynote talk describing her vision of the current state of the Internet and the most acute problems it faces now and in the near future.
the significant problems we face cannot be solved by the same level of thinking that created them. -- Albert Einstein
outline
  1. Introduction, acknowledgements
  2. some context: what NSF (and other funding agencies) consider top problems
  3. underlying themes: `interactive complexity' and `space in between'
  4. list of top problems [i thought i should at least mention them]
  5. "sky is falling" vignettes - some reading recommendations
  6. why system and network administrators are/hold key[s] to solutions
    reminder to sysadmins:
    a) if you manage systems connected to the Internet, you're a network administrator
    b) add it to your business card so you remember

I. Introduction

  • what talk(s) i'm not going to give:
    - lofty things like telepresence, virtual reality, ubicomp, emergency response system
    - specific measurement and modeling priorities (see CAIDA web site for that talk)
  • what problems i will cover
    - those more relevant to sysadmins and netadmins
    - either because you can help or because you need to stay cognizant
    - 1-2 year not 5-10 years out [i am a big fan of far out but won't emphasize it in this talk]
acknowledgments/caveats
[this talk is in the `if you steal from one author, it's plagiarism; if you steal from many, it's research' category.]

this talk derives from dozens of conversations with variety of experts

  • could kibitz for another 6 months but answers would likely change
    (actually that's ridiculously optimistic)
  • wrote down all contributing names so have them in the raw data, but those who spent hours discussing it:
    • Mike Lloyd and Sean Finn (RouteScience)
    • John Johnson (Nemertes)
    • some IAB, IESG, IETF folk
    • funding agents
    • network engineers, security folk
    • Internet researchers
  • did not ask[to my knowledge]:
    • spammers, hackers, layer1 folk, lawyers, teenagers, K12 teachers, librarians, FBI/CIA/NSA, RIAA, MPAA, John Ashcroft
      => so i don't claim to represent them


occasional theme of talk
  • [narrative reference to movie before sunrise (warning: chick flick)]
    • among most memorable notions in movie: "the space in between"
  • theme: system and network administrators are the ultimate 'space in between' (two user nodes)
    • exalted sense of that phrase
    • often the only intelligent glue holding two users together
      [we oversignificate protons and neutrons too, maybe because we understand them better
      than we understand the space in between]
    • critically important but underrepresented in policy, protocol, architectural areas
  • confession: if this talk comes across as a call to arms, i can live with that


II. Some context: a view from the National Science Foundation: grand challenges

  • In April 2003, NSF held a workshop on fundamental research in networking.
    • [Read their recommendations online and look for your name!]
  • Federal community takes prioritization of goals pretty seriously.
    • [and some of these people are damn good.]
  • I will discuss these grand challenges and innovations needed to support them.
  • [Note:
    • 1. goals may not always be prescient, or even sensible - but it's not because of indifference
    • 2. the cart is now officially before the horse.
    • 3. essential pieces are being confidently assumed from both directions ]


Reflections on past, present, and future of networking research suggest need for emphasis on:

  • radical innovation and paradigm shifts
  • multi-disciplinary aspects of fundamental research in networking
  • outside-the-box thinking, beyond the success of the Internet
    • avoid "network innovator's dilemma": being too involved in improving existing Internet technology
      to observe novel and disruptive technologies
  • increased reproducibility of networking research
    • ease burden of reproducible experiments on complex systems


[NSF view]: We should focus the research agenda on application-induced challenges:

  • user-focused:
    - robustness
    - transparency, ease of configuration
    - exploiting storage and processing capabilities
    - power consumption
  • network-centric
    - heterogeneity and scale
    - manageability
    - evolvability
There is a strong articulated sense from [several] funding agencies of `being done with the Internet',
as if the most important research problems there have been solved.


[NSF-articulated] meta-challenge:
``In the face of shortcomings of current Internet architecture, develop new network theories,
architectures, and methodologies that will facilitate the development and deployment
of the next generation of services and applications.''

Undeniable (although not verbatim) admission is that shortcomings of current architecture
all relate to the inability to manage (administer) it.

[kc: the plot thickens...where are folks we expect to manage it?]

[NSF-articulated] Internet research grand challenges
  1. Internet information theory
    - holy grail: Internet Erlang
    - for wireless too
  2. overlay networks
    - all getting along, even economically
  3. network economic theory (network markets)
    - more on this below
  4. resilient networking
    - more on this below
  5. sensorized universe
    - security, health, education, battlefield, traffic, law
    - holy grail: `ubiquitous safety net' (emergency response system)
    - less holy grail: big_brother_net
  6. virtual networks
    - VPNs, mobility, automatic team discovery/creation, cognitive networking
I find challenges (3) and (4) particularly important and will spend a few minutse on each.

(3) network economic theory - the lack of coherent network economic theory is often invoked to explain failures of QoS, multicast, CDNs, or any Internet startup.

  • it is difficult to make progress with such an uninstrumented network
  • micropayments have failed a few times, but will likely get another chance
  • architectural bottleneck is often considered to be global network economics (theory and practice) rather than network technology
  • current economic model fails to capture potential utility of network
  • thus Internet can not support many potentially cool applications
  • we need to get some economists involved (this has been true for years)


(4) resilient networking - currently, network faults are a norm rather than exception. They are caused by component failures, human operational errors, software viruses, malicious attacks, etc. NSF articulated needs to support this challenge include:

  • system development tools that reduce frequency and severity of bugs
  • programming languages, environments, audit tools, runtime checking tools that reduce frequency and severity of config errors
  • understandable, deployable, and usable security
  • new approaches to the composition of modular elements
  • new approaches to federation
  • pervasive audit trails
  • self-adaptive systems (automatic detection of and response to attacks, route changes, errors)
That list looks dangerously close to something we need NOW.
[sysadmins: if you don't see your name above you're not paying attention]

III. Interactive complexity and the `space in between'

The meta challenge of conquering system complexity merits its own discussion, throughout which I will liberally reference Yale sociologist Charles Perrow's acutely relevant book Normal Accidents (latest edition 1999).
The subject is related to resilience as the book's focus is on large scale systems that are more tightly coupled than ever imagined. Perrow discusses interactive complexity, coupling, and catastrophic potential of such systems. Ironically, the Internet is not mentioned once in this book [but you'd never believe this guy hadn't configured a BGP session...]

Why we should hope Dr. Perrow writes a book on the Internet next:

"rule of thumb in production software about 70% of the code deals with error conditions and exceptions, and only the remaining 30% provides the functionality expected by users. In complex systems with billions of components, the portion of code devoted to functionality will decrease further; most code will deal with adaptation to dynamic change, not just error handling, but overall self-sustainment." -- NSF workshop, ibid.


Themes of book Normal Accidents:

  • the way the parts fit together, interact, is important
  • the dangerous accidents lie in the system, not in the components
    - the 'space in between'
  • air transport system works well
    - diverse interests and technological changes support one another
  • all human constructions are resistant to change
    - driven by private privileges and profit
  • book has fairly optimistic ending [well, depending on your perspective]
    - Y2K as analogy: huge global sociocultural momentum does make a difference
    - apparently, we don't have an 'imminent common globally destabilizing threat' right now [noone's more surprised than i...]


Quotations from the book:

"for some systems that have this kind of complexity, such as universities or research and development labs, the accident will not spread and be serious because there is a lot of slack available, and time to spare, and other ways to get things done."
[This was written in the mid 80s and doesn't describe many universities today.
nothing on the Internet can safely assume `time to spare'.]
"but suppose the system is also 'tightly coupled', that is, processes happen very fast and can't be turned off, the failed parts cannot be isolated from other parts, or there is no other way to keep the production going safely. then recovery from the initial disturbance is not possible; it will spread quickly and irretrievably for at least some time. indeed, operator action or the safety systems may make it worse, since for a time it is not known what the problem really is." -- p.5 normal accidents


  • [If that doesn't remind you of debugging BGP configs it is because you haven't done it.
    • most BGP admins i know are more "used to" than "understanding" configuration
    • too many people are configuring BGP for us to accept any less transparency than with driving a car
    • ok, maybe a truck. but definitely not a space shuttle. we abjectly lack a houston.]

Dr. Perrow further discusses complexity, coupling, and catastrophic potential as inspired by accidents like TMI, chemical plant mishaps, aircraft collisions (and also a postscript on Y2K written in 1999). He delineates four classes of victims affected by an accident: (1) operators, (2) users, (3) system outsiders, (4) future (fetus, future user).

There is always a balance of passive versus active risks. Passive risks occur when safety is beyond user control (such as in airline, concerts, shopping malls) and generally someone else making a profit. Using the Internet is a passive risk for most people. However, passive risk can be `conscious but not intentional risk' - the price of convenience. On the Internet many folks are not even conscious of the risks they take [as if sysadmins need to be told this].

How to minimize normal accidents? Tracking safety of a pervasive complex system is a daunting task:

  • safety records do not normalize for the number or expertise of participants
  • as 'user-friendly' technology reduces the risk, more users join the activity
  • but the clueful ones are already on, so the end result is that
    even as a component safety increases, the accident level may not change!
  • the safer you make it, the lower clue threshold needed to operate, the safer you have to make it [perhaps i don't need to dwell here since if anyone knows that trying to compete with stupidity is a losing battle, it's system administrators]
  • see related dialectic between Internet robustness and complexity


normal accidents and `garbage can theory'
The garbage can theory (or bounded rationality) describes decision-making in highly ambiguous settings. Published by Cohen, March and Olsen in 1972 in the field of organizational behavior, the theory defines "organized anarchies" as irregular confluences of people, problems, solutions, uncertain circumstances, and choice opportunities. Decision makers move from one opportunity to the other relying on chance alignment of components and organizational demands.

[kc: I reckon that network research has its own similar garbage can theory, but we have studied naming a lot so use other labels. In the meantime we also have Dilbert as a mouthpiece for basic tenets. Actually, we need more Internet 'garbage can' science since we have not formalized a lot of our garbage yet. See Dave Plonka's LISA talk, CAIDA RFC1918 paper, redundant anycast traffic, etc. Also grep for' garbage' in Bruce Sterlings's NSF keynote talk [exceptionally worth reading anyway].

A closing quote:
"catastrophes send us warning signals. this book has attempted to decode these signals: abandon this, it is beyond your capabilities; redesign this, regardless of short-run costs; regulate this, regardless of the imperfections of regulation. but like the operators of TMI [three-mile island] who could not conceive of the worst -- and thus could not see the disasters facing them -- we have misread these signals too often, reinterpreting them to fit our preconceptions. better training alone will not solve the problem, or promise that it won't happen again. worse yet, we may accept the preconception that military superiority and private profits are worth the risks. this book's decoding asserts that the problems are not with individual motives, individual errors, or even political ideologies. the signals come from systems, technological, and economic. they are systems that elites have constructed, and thus can be changed or abandoned." -- Charles Perrow, normal accidents, 1999

IV. Top problems of the Internet

  1. scalable configuration management
    • higher layer connectivity requirements are hard to express, manage, maintain, verify that they are still working, simulate, model
    • today's routing configuration languages are based on low-level mechanism, rather than operator intent
    • networks are configured at the element (or router) level, rather than as a single cohesive unit with well-defined policies and constraints
      - key network operations goals require tweaking configs in pursuit of desired indirect effect on the network (for example, traffic engineering, security
    • usual mode of coping: monitor for things that break
      - not things that might break if you make a change
      - use Internet as a simulator ("`current best practices?' is that a band?")
    • lots of things to configure, even along one path: router, switch, load balancer, (NAT) host, OS, web server, application, database
    • configuration management is everywhere
    • word for the decade: -- abstraction

    Partially responsible for the current situation are:
    • trusting vendor defaults too much
    • putting up with vendor kitsch
    • the absence of trusted routing registry
    • an egregious lack of instrumentation
    • a moderate lack of clue
    • business constraints retrohacked into a system not designed for it [garbage can theory!]
    • inherent interactive complexity of the global distributed system
    • interdomain (BGP) routing system configuration -- I will discuss it later

    What can be done?[I am not claiming that we can definitely get there with incremental steps, only that we do not have an alternative at the moment.]
    • researchers: develop and use "higher level policy languages" (abstraction abstraction abstraction!)
    • sysadmins: help define the scope of your configuration needs

  2. security - also known as authentication, availability, containment, DOS tracking, identification, privacy, robustness, resliency, recovery, and threat analysis
    [It may also include spam depending on party - but we'll award spam separately.]

    Solutions to vastly [un]defined problems are inherently elusive. We have learned for certain that cryptographic algorithms and standards for authentication, security, and privacy are far ahead of our ability to deploy, administer, and use security systems.

    What is needed?
    • new specification techniques for security policies:
      - meaningful to system administrators and end-users
      - then security can be deployed in a way that meets user expectations
    • increased automation, rigorous analysis, baseline profiling data
    • self-configurable and self-healing systems
    • ISP cooperation (DOS traceback)

    [sysadmins: if you think that these problems will be solved by another community i encourage you to investigate further because whoever is solving them needs your help anyway.]

  3. end host patching and lack of wisdom in applying them
    • patches can make problem worse, or break other things. If a patch does that, please tell your vendor...
      - example: code red -- people could not patch IIS without breaking realsecure, thus many did not patch it
    • 'default deny' is your friend -- at host level!
    • help develop or at least be aware of product liability laws

    note: I will not push the genetic diversity argument as alternative 'safety':
    • sounds too much like 'security through obscurity' to me
    • unclear how much manageability would be sacrificed to get it:
      - already too much whack-a-mole in this field
      - fidelity.com (who handles about a billion dollars a day on the Internet) already can't handle my mozilla
    • if we espouse genetic diversity, we better espouse a hell of a lot of systemic investment in software testing
    • besides hey i'd run a monopoly OS too were it the best OS
      - although last month's monoculture paper suggests it might not be possible
    • many unixes use RPC and same BSD stack anyway

    Most importantly, OS diversity may be a good idea but it is no substitute for patch clue:
    • illegitimate botnetting is a big financially backed industry now
    • there is a serious income motivation to find holes (see rob thomas' aerobic NANOG talk, Oct 2003 meeting)
    • a few more OSes on the Internet would not diminish the catastrophic potential -- the kiddie scripts would just be longer

  4. knowing what's on your network
    How many site administrators do:
    • run one of: flowscan, flowtools, netflow, autofocus, dnstop?
      - see also CAIDA's Internet Tools Taxonomy
      - the latest addition: UCSD CSE's AutoFocus
    • follow relevant R&D measurement activities and peer-reviewed tools?
      - IETF WGs, e.g, IPFIX NANOG, sigcom, IMC, PAM
    • work with researchers on tools and visualization techniques?

    Measurement is such a obvious win, it advances ball on so many things
    • capacity engineering
    • security
    • privacy
      - indirectly... teach users the realities of measurement.
      - then teach them to use ssh and pgp
    • provider integrity checks - the more grassroots measurements, the less likelihood of another irrational bubble
    • obligatory caveat: know the law
      - need measurement tools that help manage and secure your network without breaking the law
      - we need your help getting better laws

    caveat: I do not mean to imply that measuring the Internet in general works..
    • can't measure topology effectively in either direction, at any layer
    • can't track propagation of a BGP update across the Internet
      - so, how to build this theory we are so lame for not having? -- discouraging to academics
    • can't get router to give you its whole RIB, just FIB (best routes)
    • can't get precise one-way delay from two places on the Internet
    • can't get an hour of packets from the core
    • can't get accurate flow counts from the core
    • can't get anything from the core with real addresses in it
    • can't get topology of core
    • can't get accurate bandwidth or capacity info
      - not even along a path, much less per link
    • SNMP just an albatross (enough to inspire telco envy)
    • no 'why' tool: what's causing problem now?
    • privacy/legal issues deter research

    Results of measurements are a meager shadow of careening ecosystem.
    [If you are not scared i am not explaining this right. ]

  5. spam
    I consider spam more of a user issue rather than an Internet issue.

    We are relying on some ad-hoc defacto messaging systems (SMTP, IM) that were never designed for corporate high integrity use. We need a new messaging infrastructure with built-in authentication instead. Note that some work in this direction is going on in IETF; please participate if you care at that level.

    In the meantime, current network-level cures are worse than the disease. Blocking traffic for content (done by an ISP or a sysadmin) is a dangerously slippery slope. This issue is becoming yet another arms race (along w p2p, firewalls, verisign wildcards). Sysadmins are in the valuable position of being able to mention these issues to those who might have other interests driving their behavior.

    What can be done to help?
    • give your users plenty of client side options for filtering
    • give operationally flavored input to IETF activities in this area
    • adjust expectations -- not to be confused with admitting defeat!
      - advertising has been no enemy to our free (not to mention cheap) press

  6. authentication
    I have mentioned authentication earlier under security and then again with spam. It is also often called 'the identity problem' (do not confuse it with 'anonymity' which is not on our problem list). Like spam, this problem is more of a user issue and it should be solved outside the architecture. Unfortunately, we still lack scalable, non-hierarchical trust models.

    Aside: Putting a "solution" label on pgp is ludicrous since pgp(/gpg) remains an egregious tech transfer failure. It is a perfect example of algorithms/standards far outpacing the community's ability to deploy/administer them. To wit, If I pgp an email to my security-conscious colleagues, it typically
    • adds a week to RTT
    • if they read it at all
    • and that's only if i manage to have a version-compatible key
    I admit that I do the same for ppt and doc files, but why punish ASCII? Why do not all client mail handlers support transparent authentication/encryption?

    Even for an adolescent industry this component fails the smell test, and we are a little beyond that now anyway. Teenage scars notwithstanding, we are capable of cooperation and should not indulge this splintering unless it offers some benefits. [if anyone defends the genetic diversity of the pgp landscape i will personally flog them.]

  7. Quality of service (QoS)
    QoS refers to mechanisms to differentiate performance based on application or network-operator requirements. It also means providing predictable or guaranteed performance to applications, sessions, or traffic aggregates.

    Innovations have emerged in several areas (such as packet scheduling, admission control, traffic shaping) and were successful in constrained scenarios (VOIP, empirical load-based capacity planning). With regard to interdomain traffic, QoS went as far as it could go technically without economic (see discussion of network market above) support. Like many research areas, a few years in the lab could have been saved by a few hours talking to a provider. Economically viable technology, not to mention technologically viable economic substrate, just not there.

    Also a factor: there is a huge sociocultural resistance to paying more. Users think the Internet should `just work' (they have seen it happen before, in many cases at no cost to them.). Tools to separate and service differentiate topologies are now emerging in protocol specs, but market support is still insufficient.

  8. compromise of the e2e principle
    The general principle "do not replicate in lower layers what can be handled by higher layers" has taken a beating this decade, and it's still early yet... In its place has emerged a web of contracts to control what people are allowed to do with their packets. The saddest part of this situation is that we had a different solution (IPv6), but too little of it and too late. NATs, firewalls were demanded viscerally by the market -- and the same brilliant community ultimately brought you both, in some definition of parallel.

    Now we have the Internet [un]layering we deserve and it is a mess (e.g, IPSEC through NATs/firewalls).

    "sometimes the price of freedom is what freedom brings" -- reefer madness, Eric Schlosser
    We have failed you (sysadmins) by engineering our way toward the unsupportable. It might not have happened if more sys and net admins had been in some of those IETF WG meetings...

    We need to be realistic about where to go from here. IPv6 will be hard-pressed to revive e2e legitimacy (although it has its own believers). I do not think we are ever getting the e2e architectural assumption back. We need to think outside that box from now on and to realize that the right solution at time t might not actually be the right solution at time t+1. Note that it is not too late for admins to get [back] in on the fun.

  9. dumb network
    I call it 'dumb' as in 'mute' -- since it can not talk to us about its internal state:
    • can't tell us how much bandwidth it has
    • can't tell us why it changed its route
    • can't change a route because we want it to
    • can't tell us if it's being attacked
    For something built for communication, a network is pretty disappointingly uncommunicative. This deficit makes it hard to manage and to provision/engineer its growth. [no wonder we engineer blind so much of the time]

    On the other hand, there is 'dumb' and there is 'idiotic'. A routing architecture that requires humans in various NOCs to tweak link weights for good performance can not have been "plan A".
    We need greater Internet transparency (see G. Armitage's recent talk on `making the Internet go away'.

    Note that this whole problem largely goes back to measurement. The energy we have invested in measurement of this infrastructure is far less than has been invested in any other aspect of this infrastructure, and now we are wondering why we are having a hard time getting a handle on it. (kc: not that i'm bitter..)

  10. robust scalability of routing system

    This problem is closely related to configuration management.

    Primary factors in routing evolution are:

    • relative cost-performance of communication, computation, and human brains
    • tradeoff between fast convergence and stability for current IGPs
      • timers limit an effect of external instability at the expense of an increased convergence time
      • it is hard to get data to do real studies/analysis to discern real from artificially imposed instability
      • better damping algorithms remain elusive
    • researchers & sysadmins can help optimize navigation of routing trends [less hope of changing them]



    Worse news: we really do not understand the design space,
    while problems with the current routing have not even begun.
    • routing architecture stagnates (unless you count the hacks)
    • there is no way to judge success or failure of proposed architecture, or to verify operational integrity
    • any change sufficiently ambitious to address problems is also sufficiently ominous
      to scare any vested interest in those organizations whose support is required
    • BGP has no mechanism to route around saturated chunks of core
      - core Internet chunks operate for weeks/months at/near capacity
    • too much manual tweaking is going on to justify an assumption that the hell will not break loose at some point
    • Even scarier: Proposed overloading of BGP infrastructure to distribute "non-routing" information" (BGPVPN -- auto-discovery mechanisms for Layer 3) is not particularly comforting since it means adding even more responsibility to a system we do not really understand.


    This situation actually gets pretty ominous. Many researchers believe that the routing system may find a state of non-convergence that is so disruptive as to bring down large portions of the Internet. We talk about malice, but the more frightening truth is that we are not sure a typo could not accomplish the same thing.
    • we can not even trace back DOS attacks
    • debugging routing problems remains black art
    • routing protocols interact with each other in 'interesting' (non-understood, sometimes nondeterministic) ways
    • intelligent routing throws a wrench into the melting pot
    • scalability and robustness require even more 'damage control' complexity
    • perfect 'normal accident' (a possibility suggested by a BGP expert Tim Griffin):
      • no single ISP will be able to identify and debug the problem
      • it will take days to fix and cost the world economy billions of dollars
      • the press will learn that the Internet engineering community had known about this lurking problem all along....
    • For a front-row seat at melange of finger pointing, keep an eye on Internet routing system.
    • I promise you (sys & net admins) will not be left out.
    • In the meantime, you have an enviable job security and an excruciating (if not impossible) job.
      • 3600 RFCs later, and your job gets harder rather than easier each day.
      • What is wrong with this picture?
      • RFC used to stand for something...


    Consensus is difficult to get even (especially) among routing experts:
    (Akamai researcher Bruce Maggs, October 2003 routing workshop)
    1. Where (if anywhere) is the congestion in the Internet?
    2. How much capacity does the Internet have, and how fast is it growing?
    3. How much traffic does the Internet core carry and what does it look like?
    4. How fast is network traffic growing?
    5. What will traffic patterns look like five years from now?
    6. Can we scale the network to support the demands of users five years from now?
    7. How much does/will it cost to increase network capacity?
    8. Will stub networks soon be employing sophisticated traffic engineering mechanisms on their own, e.g, those based on multihoming and overlay routing? What impact might these techniques have?
    9. What fraction of traffic are CDNs carrying? What effects derive from DNS tricks to route traffic?

    What is needed? (courtesy of Tim Griffin, Phil Karn)
    1. defined routing policy languages guaranteed to be globally sane no matter what local policies are defined,
      and BGP speakers must be forced to use them (i.e, MUST standardize)
    2. give user control of packets to/from his/her own IP address
      (rather than clumsy, brittle firewalls not understood by the ISP's phone support anyway)
    3. open standard for the secure remote control of a generic packet-filtering firewall
    Research questions:
    1. is it possible to design such languages and protocols?
    2. how can we find the right balance between local policy expressiveness and global sanity?
    3. what exactly do we mean by "autonomy" of routing policy?
    4. do we need additional protocols to enforce global sanity conditions?
    5. how can we enforce compliance of policy language usage?

    What sysadmins can do now?
    • most important: do not assume we have this under control
    • read through some of geoff huston's work -- great introduction
    • get involved with the IETF:
      • ask if new routing products/services make things better or worse for the commons
      • ask why the underlying architecture does not obviate the need for hacks
      • management willing [i am aware that many of you are already doing N FTE's of work]
      • strategic involvement [IETF can be imprudent use of your time. get experienced mentor]
    • use routeviews.org to look at the routing system -- it was built for you
    • document your own topology internally: do it often, and at more than one layer
    • work with researchers, give them your operational insight wrt data analysis, visualization
      - lots of them are looking for good problems and you definitely have some
      - consider it a chance to save them a few years of irrelevant research
  11. normal accidents (not just a recommended book anymore)
    Accidents in the Internet are waiting to happen.
    • hard coded IP addresses are wreaking havoc (see examples in Dave plonka's talk
    • DNS is struggling with inability to evaluate macroscopic performance
      - caida's rfc1918 paper, effect of anycast traffic, etc.
    • it is common now to deploy a half a million homogeneous Internet hosts (low price point)
      thus causing a dramatic change in Internet OS landscape
      - what happens when each bic lighter has an IP address?
    • market pressure forestalls adequate testing [not that we even know what that means],
      rendering testing for tomorrow's Internet an intractable task
    • there is no body to define and enforce specification and conformance with RFC-defined standards by designers/manufacturers/vendors
    • no underwriters laboratory (ul.com) for things that talk to the Internet
      who would also fight back when needed measurement functionality is unsupported
      - who would take this on?
      - if this does not happen on its own, will some www.dhs.gov spirit force it?
    This problem is important because when the industry will recover (including from its post traumatic stress disorder) these normal accidents will increase in number and ramifications (cost). The stakes grow monotonically.

    No rigorous study exists of root causes of Internet performance problems/outages.
    Anecdotal survey (courtesy of Sean Donelan NANOG post):
    1. network engineers (what does this command do?)
    2. power failures (what does this switch do?)
    3. cable cuts (backhoes, enough said)
    4. hardware failures (what is that smell?)
    5. congestion (more bandwidth! Captain, I'm giving you all she's got!)
    6. attacks (malicious, you know who you are)
    7. software bugs (your call is very important to us....)
    "knowing what we don't know" offers little comfort
    What can be done to help?
    • need labs mimicking your infrastructure (smaller bandwidth is ok)
      - work with your vendor, e.g, cisco provides labs for software upgrades
    • collaborative efforts among operational and research communities: isc, pch, routeviews, ripe, nlnet, caida
  12. intellectual property and digital rights
    • mostly out of scope for sysadmins and researchers (FCC must get involved)
    • but this issue is too hot not to touch
    • Lessig covers it quite well in his books
      • Code and other laws of cyberspace and The Future of Ideas.
    • The Internet is not a dichotomy between commercialization and open standards,
      but rather a trichotomy, with the third piece being regulation, property rights and protection.
    • Sysadmins will be in a position to read subpoenas from RIAA (some of you already have).
    • The instantaneous sharing is both the best and the worst thing about the Internet.
    • The biggest threat is the "downside of the upside".


    What can be done to help?
    The solution is not blocked on technology (or computer science, or sysadmins), but on our lack of social consensus how to incorporate the reality of digital ubicopyright into our model for appropriate human interaction.

    Note: It is not just about artists. we have not sorted out the impact of a networked society in general, including the increase in effective value of in-person contact (a neglected side effect).

    So, read Lessig and talk/write to/for legislators. Tell them what you know or else they will decide without us.

  13. governance

    We covered this yesterday in Vixie's talk.

    Shared resources need global administration: universally agreed allocation of addresses, ASes, domain names, and protocol numbers. [aside: mention Bill Manning's proposed the multiple-NATed Internet: backbones use the RFC1918 space, all customers use the rest of IPv4 space as private space.]

    • Operating heavy machinery under the intoxicating influence of mind bending revenue potential (like .com) should be considered harmful.
    • Heaven forfend the DNS root system.
    • SiteFinder and countermeasures fall into cyberwarlordism category.
    • Can we (socially) increase the set of parties (besides shareholders) that an Internet company considers a constituency?
    • But at least SiteFinder put the 'steward' vs 'owner' issue on our kitchen table -- where it belongs!


    What can be done to help?

    • It may sound a little old by now but: participate!
    • ICANN policy mail lists, www.icannwatch.org, www.arin.net
    • Join local ISOC, go to ARIN and ICANN meetings (all open).
    • The policy process has been taken away from us less than we think (still more than we wish, but it is no excuse to withdraw).
    • No proposed option has been better. (Feel free to fix this too.)



  14. growth in traffic and user expectations
    • bandwidth budgets are frozen, but application designer creativity is not
      - VOIP, IM, P2P, streaming
    • data center consolidation
      - move servers are away from users, and they often put traffic across WAN
    • more users!

    User expectations (and expectations about those users):
    • users are abstracting much faster than we can
      - they know all this stuff should work any day now..
    • they are ready to configure things we do not even know how to describe, much less support
      - e.g, `please give me this much quality of service for this long'
    • they have been promised `tennis racket' transparency by the media
    • right now we have the recession as an excuse: no capital for infrastructure expansion
    • as recession subsides, user expectations will increase

    confession: I am not going to offer solutions to these problems because I rather like having them.
    Why should anyone miss out on the Internet?

  15. interprovider and vendor/business coordination

    Companies avoid publicizing a vulnerability of their own infrastructure. However, silence renders the overall system more vulnerable.

    What can be done to help?
    • need measurement repositories for data to support debugging
    • make friends with researcher, or provider
    • refine requirements, approaches, and cost models
    • do not underestimate the value of cross-fertilizing your brains
    • avoid government accusations of antitrust activity by including them
      Department of Homeland Security will make this 'easier'
  16. time management and prioritization of tasks

    Networking is strongly interrupt-driven field, but sometimes we need allocated time just to think and to plan strategically rather than tactically. Unfortunately, this problem pervades all levels of Internet design [i hear other fields have it too].

    Note: `overspecialization' was also mentioned as a top problem, e.g., `solaris OS team does not upgrade solaris NICs, that is networking's job'. Undoubtedly, this approach is a large company phenomenon. It also involves an inherent tension between automation and job security.


enough problems -- let's talk doomsday four perspectives that adequately cover the doomsday memes
(you need not read anything else on this issue for a year):
  1. "the Internet is dying" -- Karl Auerbach provocative article
    • between spam, anti-spam blacklists, rogue packets, never-forgetting search engines, viruses, old machines, bad regulatory bodies, and bad implementations
    • Internet will lose half its users in 6 months [i know some of you would not consider this a problem]
    • in its place a much more controlled approved set of communications will appear
    • lesson 1: do not run tcpdump if you do not want to get depressed -- most of it IS garbage
      - last i checked most TV was garbage too. Is it losing users? or do we get smart technologies to help us use it?
  2. "digital imprimateur" -- John Walker
    • "how big brother and big media can put the Internet genie back in the bottle"
    • rich `optimistic pessimism'
  3. Larry Lessig's code of laws and future of ideas
    • by leaving policy to the policy folks your future derives directly from their clue level
      - we need to own up to that
    • most optimistic pessimist lawyer on my bookshelf
  4. Bruce Sterling keynote at NSF workshop, feb 2002
    • all SF writers are optimistic pessimist so that is not his accomplishment
    • his writing is... exceptional
    • ubicomp, and ultrawideband, and machines-building-machines are his messiahs
    • he does not hold back against the computer industry [cute!]
An obligatory quote from Castells (writing a trilogy, quote from the 1st volume below):
"It is the beginning of a new existence, and indeed the beginning of a new age, the information age, marked by the autonomy of culture vis-a-vis the material bases of our existence. But this is not necessarily an exhilarating moment. Because, alone at last in our human world, we shall have to look at ourselves in the mirror of historical reality. And we may not like the vision. -- Manuel Castells, rise of the network society, vol. 1, p. 478
It is not the easiest reading, but we all should read it anyway.

caveat: i am a short-term optimist and a long-term optimist but a medium-term pessimist.
which means life on the Internet is going to get harder before it gets easier.


IV. Conclusions - so, what now?

Most important virtues to shepherd: patience, persistence, and perspective
  • Bruce Sterling described these in 1994 as the most important virtues for the computer industry to embrace.
  • Normally, these are 'somewhat dull virtues' and 'the most difficult to manifest from within a revolution'.
    (If you think the revolution is over, you are not paying attention.)
  • Note that you are all shouldering generations of neglected architectural responsibilities [i did apologize for that already...].
  • Politically minded disposition never hurts.
Awareness of your role
  • sysadmins are key (and unsafely ignored) channel between R&D community and the real world
    - shrapnel-closest to real problems
    - hard-earned intuition and insight that the rest of us do not have
  • consider participating in the policy process that will render you cogs of big brother if you do not
    - e.g, IETF wiretapping issue in February 2002
  • get involved with policy organizations
    - you are dangerously underrepresented in NANOG, ARIN, NANOG, IETF
    - all these people are playing with layers that you have to manage
    - and they are people sometimes more loyal to interests other than the Internet
  • educate your management
  • check in with the research community every so often
    - go to research workshops and conferences
    - most of us hit the bar every so often so you probably do not have to walk that far
    - and we need good relevant problems
  • also, reminder: continually watch your network
    [still be confused but on a higher level]

USENIX/LISA: expand your role as an operations research forum
  • more peer-reviewed research
  • court the traditional network research community
  • court the operational networking community
  • joint workshops on configuration management problems
  • operations folks can give feedback on how research ideas interface with current operational reality
    - save researchers a few years

[Sorry for the empowerment speak.
I would not get so lofty if you did not hold the ring of power, Frodo!
At least I hope you have it,
because I have asked around and nobody else thinks they have it...]

The Internet has done a phenomenal job at dramatically reducing the space in between people who want to communicate.
The next step is to reduce the space between the people who want to communicate and the Internet --
and that is an increasing proportion of your job today.
  • mind the gap that will continue to increase in the future
  • if that means making parts of your job unnecessary, we promise we'll make more messy technology
  • making parts of your job unnecessary should be your overriding professional concern for this decade
concluding thought: if sysadmin job security ever becomes a top problem of the Internet,
i promise i'll give another invited talk on what you should do next

"Disorder increases with time because we measure time in the direction in which disorder increases."
-- Stephen Hawking

Appendix. Trilogy of action for scientists and engineers

For 'seekers of the larger view':
  • draw together pieces of science and technology to create a system, whether that system is xerography, telegraphy or steam navigation
  • find the economic feasibility for a new technology by virtue of a wide grasp of the worlds of man and matter
  • reach harmony through intuition by meditating on deep knowledge of the field so as to arrive at a new result
  • build a model: a simplified representation of the problem, subject to experimental analysis
  • serve as a science-technologist generalist, who, many times/year, extracts the missing point out of a complicated situation
  • make decisions or help others make decisions by imaginative interaction with alternatives calculated as consequent on those decisions
-- John Archibald Wheeler
  Last Modified: Mon Mar-6-2006 15:20:24 PDT
  Page URL: http://www.caida.org/publications/presentations/2003/netproblems_lisa03/topten.xml