ISMA Data Catalog 2004 Workshop Report
On June 3, 2004, CAIDA hosted its 11th Internet Statistics and
Metrics Analysis workshop. This workshop focused on evaluating the design for an
Internet Measurement Data Catalog. We invited a group of researchers
spanning both active producers of network data and ardent consumers
of available data. We asked participants to discuss their existing
Internet data sets, their policies for sharing that data, and their
methods of dataset management and distribution.
In this report, we present our goals for the workshop, the key findings
and future work that resulted from the discussion. For those who wish to
delve more deeply into the workshop proceedings, we include
informal summaries of meeting presentations with links to
actual slides in .pdf format.
Introduction/Goals
The Internet infrastructure currently has no framework for system-level
studies of wide-area, cross-domain Internet traffic behavior. As a result,
neither representative longitudinal
analysis of macroscopic workload trends nor sound preparation
for the growing expectations of Internet users is possible today.
While measurement data is difficult to collect and distribute, the
largest problem impeding the process of computer network research
is a lack of relevant data for testing new theories and technologies.
While we encourage additional data sets and collection projects,
simply making researchers aware of data that already exists will
leverage both the quality and the quantity of research performed
today. To this end, CAIDA is designing an Internet Measurement
Data Catalog (IMDC) to organize the heterogeneous datasets (both
publicly accessible and restricted usage) into a database that
researchers can query to find relevant data to support their work.
We provide annotation capabilities for researchers so that bugs,
novel features, and other information about datasets can be shared
by investigators with experience using a particular dataset. In
addition to providing the fodder for new inquiries, the IMDC will
also facilitate robust science by documenting exactly the data used
in a study in a way that allows others to reproduce published
results.
Our goal is to create a database architecture and annotation system
that will accommodate the diversity of existing and future data
sets. We will provide both a web-based user-interface for users
to query the database and an API to allow trusted parties to automate
contribution of catalog entries for their data sets.
The Internet Measurement Data Catalog will not actually store copies
of the dataset, in the same way that a mail order catalog is distinct
from the warehouse; we provide a clearinghouse of information about
data available elsewhere, complete with Acceptable Use Policy
information and access instructions.
The main objectives of the workshop were:
- present our IMDC design to the research community
- get feedback on compatibility of the proposed
architecture and their data production and usage practices
- identify remaining problems with data
catalog creation and distribution that require further research
Highlights and Key Findings
Workshop participants enthusiastically welcomed the idea of publicly
available catalog of the Internet measurement data. Such a catalog, if populated
with relevant high quality data sets and properly maintained, would greatly
benefit the Internet research community. It will
advance the reproducibility of analyses and results, enable
longitudinal and cross-discipline studies of the Internet, and
open up new cross-domain areas of networking research. The participants highly encouraged
continuation of CAIDA's IMDC work and of NSF support of this project.
Over the course of the workshop, the participants made the following
suggestions for improved design and future features of the IMDC:
- implement some version of "derived-from" DataD modifier
- consider possibility of cataloging scripts with data
- add some scoring system to distinguish "good" data from "bad" data
- add an indicator showing if a given data item is "freely and easily"
obtainable or not
- add "warning" to the list of standard annotations
- add "smart IDs" to the database (URL-like) for further use or, at least, for
citation purposes
- provide a mechanism for continuous addition of data (i.e. a trace
every day)
- enable the simplest version of search, "single-box"
- provide google-like search
- implement an option to create an output in XML format
- keep track of IMDC usage and display plots of relevant statistics
vs. time
- implement automatic e-mail notification of interested users
- release the database code to other groups for their internal use
The participants identified convenience of catalog use and the security and reliability of catalog information
as two main conditions enabling widespread use and general popularity of the
catalog. The following areas in the catalog design raised concerns:
- find ways to compress/automatize display of search results (do not
show a search result of 10,000 nearly identical entries)
- deal carefully with public/private information for Contacts
- find ways to prevent pollution of the catalog with useless/poor
data
- provide convenient means to insert catalog entries for existing
voluminous data sets into the catalog and to export search results
The participants concluded that making data used in publications available to
other researchers should become an integral part of the Internet research process.
Future Work:
Workshop Session Notes
Internet Measurement Data Catalog
-
kc claffy (CAIDA) - [pdf slides]
-
introduced IMDC. This database is one of the tasks of a
3-year project "Correlating Heterogeneous Measurement Data to
Achieve System-Level Analysis of Internet Traffic Trends"
funded by NSF. The project is currently at the end of the 2nd
year.
Internet measurement is rife with challenges and obstacles.
Scientifically rigorous monitoring, or even instrumentation,
was not a high priority in the post-NSFnet Internet. The data
that the research community does use are disparate, incoherent,
limited in scope, unindexed, and sometimes proprietary. There
is a widely recognized need for globally relevant measurements,
including rational architectures for data collection, and
hardware support for monitoring high speed links.
Informational science or "data about data" has become an
essential NSF goal. This project is about developing meta-data
and annotations to describe the data. The mission is to help
researchers share data and streamline future collections. Well
managed meta-data should include: how collected, by whom,
when, where saved, access policies, format, packaging,
compression.
-
Colleen Shannon (CAIDA)
-
talked about motivation and challenges of IMDC. There are
lots of data out there (traces, routing tables, traceroutes,
security, names, geographic) of variable quality and research
importance. The main goal of IMDC is to provide an easy way
for users to find data and for contributors to publish their
data. There is a perpetual conflict between these two
communities. Users want perfect (100% complete, 100% accurate)
freely available data. Contributors do not want (or cannot)
spend time and effort on disseminating their data (and often
are not funded for this task as well).
IMDC design goals include: (1) flexible framework for
contributors; (2) good search capabilities (both simple and
sophisticated modes); and (3) ability to share information
discovered in data and to correct wrong information.
Design principles include: (1) be ambitious, anticipate
possible future uses; (2) start with simple implementation and
build it up in steps; and (3) provide multiple access modes as
necessary.
-
David Moore (CAIDA)
-
presented the central concepts and the overall architecture
of the IMDC database and demonstrated the currently built
prototype.
The focus of IMDC is on helping users find data. Below we
list the main types of objects in the database. Fields common
for all objects in the database are: creator (of data);
contributor (actually puts data in database); creation time;
modification time.
-
Data Descriptor (DD):
- the central conceptual object (atomic entity) of the
data catalog. A data descriptor represents a single
file containing data that resides on a computer
somewhere in the world. A single data descriptor is
used to reference all copies of the data item, even
copies on disparate computers at different sites. DD
fields are: name; description - long, short; URL;
keywords; file size; format; location - geographic,
net, logistic; platform; time period - start, end, time zone
offset, time zone name; creation process*; MD5 hash (to detect
duplicates, to check for corruption).
* - The creation process will be a text field until we
gain better understanding of what people might want to
put here. It may indicate that data was derived from
other data.
-
Format Descriptor (FD):
- points to information about file formats. FD contains:
name, description, keywords, package or data format,
type (ASCII/binary/mixed), file suffixes.
-
Package Descriptor (PD):
- physical grouping of one or more data files, can be
thought of as a downloadable unit. A package may have
multiple data files, and a data file may be in multiple
packages. PD fields are: name, description, keywords,
file size, format ID, MD5 hash, linkage to contained
DD/PD via a path.
-
Location Descriptor (LD):
- tells how to actually fetch some data. Packages may be
available from multiple locations, but not all packages
will be directly available. LD fields: download URL,
download procedure (includes AUPs), geographic location
of server.
-
Contact Descriptor (CD):
- human component of the database. CD fields are: login,
password, name, description (long, short, URL), email
(hideable), phone (hideable), address (hideable),
country (hideable), organization, research interests.
-
Tool and ToolSet Descriptors (TD):
- what tools are available to conduct measurements, what
tools were used to generate data, versions information.
TD fields: name, description, keywords, release date,
OS. We will finalize the fields after getting some
experience with usage. Notes and bugs will be in
annotations.
-
Study Descriptors (SD):
- keeps track of data and results used in a particular
publication (but is not meant to replace/overtake
citeseer). SD fields: name, description, keywords,
linkage to DDs, TDs, linkage to StudyWriteup (i.e.
actual text of publication).
-
Collections:
- In general, collections are logical groupings of data
with a specific purpose. Such groupings may not exist
physically, but they could be very important for identifying
the data sets used in a paper, or for others to use.
-
Annotations
- include all additional information about a given
object in the database. They can be used to let other
people know about important findings in the data.
Annotations dictionary: key name (e.g. hierarchical
namespace, FORMAT-pcap-snaplen), description, value type,
position type (time range, all, string). We will
standardize certain annotations further when widely
accepted. Annotation fields: dictionary key, "object" of
annotation (DD, PD, LD), value, position (e.g. time).
The first phase of the catalog implementation deals with
creating and cross-linking tables of data, formats, packages,
contacts, and locations. A demo version allows the user to
browse the database, search for objects of a specified type, look at
the detailed information to decide what data is interesting and
find out how to get it. (In the course of discussion following
the IMDC description, participants of the workshop highly
approved the proposed design and made many useful suggestions.
Their recommendations aimed to improve the IMDC accuracy and
versatility are summarized under Key Findings in the beginning
of this report.)
CAIDA data sets
-
Colleen Shannon (CAIDA)
-
presented current project areas and related passive data
collections at CAIDA. She identified the main challenges in
trace collecting and storing: (1) maintenance of remote monitors;
(2) large file transfers from monitor sites to UCSD; (3) storage of
data.
Availability of CAIDA-housed traces is on a case-by-case
basis. Provider-specific agreements determine the use policy for information
collected on the backbone links. Data captured on UCSD links are available to
researchers but have rigid restrictions on capture of
user payload. Release of the UCSD network telescope data is
subject to a number of security constraints.
Enabling data access involves a number of steps: (1) sanitation:
anonymization, payload stripping; (2) developing Acceptable Use Policies (AUP);
(3) setting up a system of user tracking and support;
(4) data pre-processing, aggregation, and packaging with
time-sensitive information (such as contemporaneous
name lookups, routing tables, etc.).
-
Brad Huffaker (CAIDA) - [pdf slides]
-
discussed active Internet probing projects at CAIDA and
resulting data. A probing tool skitter that we use to
collect IP forward path topology is deployed on 25 monitors in
8 countries on 4 continents. We have collected this forward IP
topology data continuously since 1998. Another active probing
tool is iffinder, which finds multiple interfaces
belonging to the same router.
Signing an AUP agreement is required in order to access the
raw topology data. AS- and router-level topology derived from
raw IP path data are downloadable without restrictions. The
probing lists can be released, but with appropriate
restrictions including prohibition of active probing for
non-CAIDA projects. Note that responding IPs disappear at the
rate of about 1% per month, forcing us to replenish the lists
on a regular basis. CAIDA also maintains a "do-not-probe-me"
list.
CAIDA has limited abilities for mapping IPs to geographical
locations. Our own NetGeo database has been unsupported
since 2002 and is becoming obsolete. A new tool owl that will
parse whois databases is in development, but it is not funded.
CAIDA also has a private contract for use of netacuity geographical
server (Digital Envoy commercial tool).
Existing Internet Measurement Data
Participants of the workshop shared their experience in data collection and
management.
-
Supratik Bhattacharyya (Sprint ATL) - [pdf slides]
-
uses a special IPMON system that includes GPS clock and
DAG card to collect 44 bytes of each packet on selected links
in SprintLink PoPs. They also collect Cisco Netflow data,
periodic BGP tables, continuous BGP and IS-IS table updates,
and SNMP utilization. Currently, there are more than 60 IPMON
systems deployed. Sprint uses Sistina Global
File System as data storage and management
infrastructure. Meta-data are entered by hand and data are
archived on tapes.
Original goal of data management was to provide fully
automated analyses of the traces for requesting researchers.
This approach did not work for a number of reasons:
- difficult to support arbitrary operations,
- necessary to filter and sanitize results,
- necessary to automate allocation of computing resources,
- existing user base was not ready: tools were unstable
and poorly documented, users wanted direct access rather
than through queries, it was hard to keep meta-data
updated.
Now the system is partially automated: after trace is
archived on tape, cleaned and put on SAN, a script checks for
new clean traces and runs flow analysis. The IPMON web site organizes traces
by date of collection. For each trace, the following parameters
are shown: link utilization, active flows, traffic breakdown by
protocol, by application, packet size distribution, and packet
count.
Current measurement projects at Sprint ATL are:
- CMON (continuous monitoring system) that runs on a
24/7 basis, computes low-level statistics correlated
with routing information and retains a limited history
of packet-level information for trouble shooting. Two
systems are deployed in San Jose PoP, and more are to
follow.
- packet trace analysis for security aimed at
establishing characteristics of "normal" behavior (a
non-trivial problem)
- 3G data network monitoring attempts to replicate
measurement techniques from wired to wireless Sprint
PCS 3G data network.
-
Dan Gunter (LBNL) - [pdf slides]
-
representing Network
Measurements Working Group of the Global Grid Forum
discussed their standard schemas for Grid network measurements.
The group published a document that targets end users, network
admins and researchers, and Grid middleware developers. They
proposed a classification and naming methodology for Grid
network measurements. NM-WG focus is now on XML schemas used
to describe sets of results and to structure user requests
(such as querying archived data, running tests on demand,
etc.).
-
Martin Swany (U. Delaware) - [pdf slides]
-
continued discussion of
NM-WG work. Created schemas have to be very versatile in order to be
useful for a broad community. Ideally, they would like to re-use a single interface
in many different ways.
For series of data, consistent parts of meta data can be
incorporated by reference (to an XML object) rather than
repeated in every measurement. The next step to data
normalization is identifying all data by three broad classes of
meta data and timestamp:
-
characteristic: what measured, type of event
-
entity/subject/target: what entity measured,
generated the event
-
parameters/methodology: what parameters were in the
measurement tool, conditions, what system, who measured
Normalization enables more efficient querying.
In dealing with derived data streams, the subject becomes a
view. The characteristic and parameters encode
the transformation of the original data. A document describing
approach to derived chains of data is in progress and will be
presented to the community when ready.
-
Henk Uijterwaal (RIPE NCC) - [pdf slides]
-
talked about Internet measurements and data at RIPE.
Test Traffic Measurements
(TTM) measures key parameters of the connectivity between a user's
site and other points on the Internet: delay, losses, and other
IPPM metrics. Raw data are: traceroutes, packets sent, packets
arrived. A database storing processed data opened
up for public access on January 1, 2004. Anonymization of data and
circulating the paper in RIPE community for comments prior to publication are
the two conditions of access.
Routing Information
Service (RIS) collects routing information by using Remote
Route Collectors at different locations around the world and
integrates this information into a comprehensive view. Raw data
are: RIB dumps (3 per day), timestamped BGP updates (IPv4 -
from up to 12 locations, since September 1999, 250 Gbyte/yr;
IPv6 - since October 2002, a few Gbyte/yr). Log files and
software to read files are available online. An AUP allows to
download and analyze the data and requests to inform RIPE about
publications.
DNSMON is
a beta-service monitoring all DNS root and seven TLD servers
from a few dozens of locations. Full deployment is expected next year. Data
will be open for research.
RIPE also offers access to whois database
(restricted due to contact information) and to regularly
updated registration (allocation) data. Future plans envision
further development of information services with emphasis on
providing data for the community. Possibly, different sets of
data can be created as necessary for different target
groups.
-
Matthew Zekauskas (Internet2) - [html slides]
-
presented
Abilene Observatory datasets: flow data (last 11 bits of IP
addresses are zeroed), one-way latencies for 2*112
paths, router snapshots, 1 and 5 min SNMP usage data,
throughput (measured with iperf). They will start collecting
more types of data in the near future.
The data are archived in many places with different AUPs.
Summaries (graphs, tables, time series of summaries) are stored
forever and served on the Web. Raw data in diverse formats
(with, probably, insufficient meta-data) are available only by
special request and have to be manually recovered. Flow data
(collected using Mark
Fullmer's flow tools) are kept for 30 days. 5 min SNMP data
(polled using custom software) are stored in RRD files.
Future plans include: creating new databases for IGP and
BGP data, using Homeland Security grant to clean up databases
and to improve access.
-
George Riley (Georgia Tech) - [pdf slides]
-
advertised NETI@Home, an
open-source software package for conducting passive Internet
measurements from world-wide vantage end-points. It collects
network performance statistics for a number of commonly used Internet
protocols in order to capture "real" users experiences. The software
can operate on multiple platforms and is easy to install and upgrade.
It runs in the background and reports results to Georgia Tech for
subsequent analysis and posting. Users can protect their privacy by
selecting the desired disclosure level (e.g., no IP address, or
first 24 bits of the IP address (default), or full disclosure).
The tool does not sniff packets in a promiscuous mode, but does
measurements on a per flow bidirectional basis.
Altruism and pretty pictures (NETIMap) are expected to provide
motivation for potential users. Currently, there are 730 unique users
since Jan 7, 2004. It woud be ideal to have 10,000 users.
About 500 MB of uncompressed binary data have been collected in
one week since May 26, 2004.
-
Christos Papadopoulos (USC/ISI)[pdf slides]
-
talked about data collection projects on Los Nettos, a 15
year old regional net for Los Angeles area. They currently
monitor one (out of three) upstream provider and Internet2. Two
minute traces are taken using tcpdump software on
FreeBSD PCs and stored on RAID boxes.
The data were used to study DDos attacks signatures (single
source vs. multiple source) and to attempt detection of congested
links. Data on about 80 DDos attacks are anonymized, binned into
1 ms time series and available on DVDs for external researchers
with a reasonable one-page AUP. So far, 8-10 users have requested
access to these data.
-
Les Cottrell's (SLAC) - [pdf slides]
-
main interest is in end-user Internet
measurements. He presented PingER - a more than 7 year old
Internet measurement project involving 35 monitor sites and
about 550 remote sites in more than a hundred countries. Very
lightweight ping probes are sent every 30 minutes between a
growing number (currently, about 3700) of source-destination
pairs. A monitor site collects about 0.5 MB/pair/month. Data
are archived at SLAC and FNAL. About 40 users access these data
when they are posted, and there are a few requests for archived
data per year.
Another project, IEPM-BW, monitors high performance paths
using iperf, bbcp, bbftp, GridFTP, and ping. Ten monitor sites
and about 60 remote hosts from nine countries participate in
this project. Raw measurements are stored in flat files and in
Oracle database. Recent data are available via Web Services
(using NM-WG request schema).
Continuous measurements are hard. Keeping remote sites accessible,
collecting data from monitor hosts, and continuous evolution of NM-WG
schema definitions are among challenging issues.
-
Nick Feamster (MIT) - [pdf slides]
-
presented wide-area network data and analysis efforts at
MIT. They use RON testbed of 31 widely distributed nodes with
stratum 1 NTP servers and CDMA time synchronization. Periodic
pairwise active probes measure one-way delay and loss, while
three consecutive lost probes trigger a traceroute. There are
also daily pairwise traceroutes over testbed topology and iBGP
feeds at eight measurement hosts. All data are pushed to a central
measurement host.
The following problems are associated with the data:
- changes in connectivity (IP renumbering, upstream providers
change)
- non-standardized and sometimes buggy tools
- data management (continuous collection vs.
archival, equipment failures and outages, complaints, etc.)
- miscellaneous issues (keeping track of occurring problems,
hosts are not firewalled, iBGP sessions to border router on
the same LAN, etc.)
A few projects make use of the collected data.
BGP monitor overview
summarizes BGP updates by time. Failure characterization study
showed that failures typically occur about 3-4 minutes before
BGP activity. 60% of failures that appeared at 3 or more hops
from an end host coincided with at least one BGP message.
Invalid prefix advertisement study showed that a large number
of offending ASes leaks out routes from private address space.
Simple static filters would alleviate this transgression. Over
50% of bogus routes persist for more than one hour and many of
them stay around for a day or more.
-
Yuval Shavitt (Tel Aviv University)
-
proposed to let the Internet measure itself. His project Distributed Internet
MEasurement and Simulation (DIMES) aims to study the
structure and topology of the Internet and is similar in
concept to NETI@home. It relies upon assistance of thousands of
volunteers, who will download and run the open source DIMES
agent to perform network measurements such as Ping and
Traceroute from all corners of the globe. The tool has a very
low bandwidth consumption (< 1 KB/s) and does not
monitor any activity performed by a user.
The project is now in its testing phase and a fully working
version is expected by the fall. The data will be collected and
archived at the Tel Aviv University with processed data made
available on the web. The following analyses have been proposed:
characterizing completeness of Internet AS maps,
tracking Internet growth, studying router PoP level topology,
investigating BGP optimality and convergence, testing Internet
virus protection methodologies.
-
Bill Yurcik (NCSA) - [pdf slides]
-
gave an overview of scalable security data management for
internal/external data sharing. There are many different
incentives to share data: saving time and effort, getting
economic advantages, legal requirements, research interests,
and SECURITY (probably, the carrot that often drives data
collection). Thus, there is no one-size-fits-all solution for
data sharing. It is important to recognize that cooperation and
sharing need to be promoted since they make us less vulnerable
to malicious Internet activities*.
* - provided that the data
are shared "with the right people".
Security solution space is multidimensional. Network data
come in 18 commonly available logs, and each one has unique
characteristics. Processing algorithms should look at different
attributes across all logs in order to achieve maximum
situational awareness and to enable smart human decisions. Log
anonymization at multiple levels may be a good solution for
data sharing.
The Forum of Incident
Response and Security Teams (FIRST) is a good example of
cooperation that works. It started from ground up and,
currently, has more than 100 members (by invitation only) from
government, commercial, and academic organizations. FIRST
members cooperate in reacting quickly and preventing incidents
and share the relevant information among themselves and with
the community at large.
Discussion of supporting tools and
techniques
-
Mark Allman (ICIR) - [pdf slides]
-
started this session with discussion of challenges encountered
in building a culture that values data catalogs. Obviously, a working
data catalog would lead to a better science by improving
reproducibility of the results, adding more vantage points, and
providing longitudinal views of the Internet. Then why do not
researchers share more data? As a rule, they do not get credit for
releasing their data although the effort required is comparable to
that of writing a paper or software. Multiple
privacy/policy/legal/competitive issues impede sharing of passive
measurements. In dealing with active measurements, laziness (or the
lack of designated funding?) is the main obstacle to sharing since
cleaning and packaging the data is often a time consuming task. Also,
it is difficult to make data collected for someone's own purposes
useful for others when meta-data are often insufficient.
Route
Views is a prominent and instructive example of broad
participation in gathering data and using them. Features that
led to its success are: homogeneous measurements of just one
type, easy to set up, useful to both researchers and operators
(giving them a motivation to participate). The impact of this
project on the community is dramatic since "everyone uses
routeviews".
We need a real cultural shift and commitment from the
research community in order to change prevalent attitudes
toward data and to keep the IMDC repository operational. Some
suggestions and recommendations are:
- repository must be easy and useful to researchers
- tools should help to collect meta-data
- we need tools to help researchers integrate their measurements
to the catalog
- we need anonymization techniques that work
- concentrate on easy stuff first: catalog active
measurements
- find pioneers to seed the system with their data sets
- in publications, require citations as acknowledgments of
data used
More drastic suggestions:
- reject papers whose authors do not release the data
- make data contribution a condition for funding.
-
Ethan Blanton (Purdue University) - [pdf slides]
-
shared his view of the Scalable Internet Measurement
Repository (SIMR) which appeared as a forerunner of the IMDC.
Working schema definitions are the crux of the project. Careful
enumeration of interesting characteristics maximizes
consistency and makes searching more effective, but decreases
flexibility and may impede future developments. Details of
measurements (level of anonymization, concurrent conditions at
measurement time, host location, sampling used, etc.) are very
important, but often invisible when looking at the data.
Annotating all details is hard, especially because different
studies care about different things.
Other challenges faced by a large measurement database are:
(i) drawing an explicit line between data to catalog and derived
results, (ii) database pollution and preserving signal-to-noise
ratio above a certain threshold, (iii) scalability of user
interaction with database to find/get individual data items.
Yu. Shavitt: would it be a good idea to design the
data catalog for maximum submission simplicity encouraging
people to put their information in? If over-engineered and
cumbersome to contribute to, the catalog will not be
populated. At the same time, there is a huge community of
users who cannot generate data, but are willing to filter
signal from noise in their searches. Eventually, tools
will emerge to cleanup data or to get good data.
E.
Blanton: both paths are bad, leading to either a giant
database of worthless crud or to an empty database. We need
to find the right balance.
-
Timur Friedman (Paris 6) - [pdf slides]
-
works on the French measurement infrastructure Metropolis.
It is a multi-partner project funded by government for three
years to measure RENATER and other French networks with
emphasis on security. For active measurements they use RIPE
TTM boxes, SATURNE boxes, generic BSD and Linux boxes (equipped
with NIMI or a new tool Pandora) installed at each of French
partners. These measurements will be extended to other European
partners as well. For passive measurements they employ DAG
cards, QoSMOS and Ipanema boxes. Sampling is implemented as
necessary to measure OC192 links.
Currently, the data are not advertised, but available to
researchers on request. AUPs and restrictions vary depending
on which institution conducted the measurements. Passive traces
are subject to prefix preserving transformation.
Timur gave the following recommendations to the IMDC team:
- always discuss what was NOT measured (e.g., distributed
monitors may fail randomly and would bias results)
- plan experiments and build meta data (tools, arguments and
parameters, platforms, times, etc.) into distributed
systems
- support the idea of data 'publication' and 'citation'
- convert data to XML format (easy to parse, standardized -
but cumbersome)
-
Juana Sanchez (UCLA) - [pdf slides]
-
teaches Statistics to graduate and undergraduate students
using the Internet as an example of a complex probabilistic
system. Her objective is to introduce students to the field and
motivate them to propose ideas and solutions for real situations.
For educational purposes, she would like to have access to already
processed datasets free of engineering issues.
Examples of possible research problems:
- probabilistic modeling: do packet counts follow a mixture
of Poisson distributions?
- statistical characterization of traces (using Hurst
parameter)
- what causes burstiness and why do bursts cause long-range
dependencies?
- does a burst correspond to a bump in the wavelet spectrum?
- is the queuing theory (that models telephone networks very
well) applicable to the Internet?
- general network tomography: apply pseudo-likelihood
approach to estimate source-destination traffic intensities
from link data
- network topology identification: knowing final delays, can
we estimate the link tree structure in the middle?
- sampling problems: how to get full information about a
certain population at a lower cost then full census and
what are the right metrics?
-
Dave Plonka (U. of Wisconsin - Madison) - [pdf slides]
-
presented his approach to bare-bones measurement data
archiving. There are the following types of data to deal with:
- passive: exported flow data and SNMP-gathered
measurement data (50K of RRD files, for 16K switch
ports on campus)
- active: traceroute and ping-like text output, BGP from
Route Views and from campus routers.
Flow data are packet-sampled flow records from Juniper
(with varying sample rates and varying regularity) and
non-sampled flow-data from Ciscos (sometimes lossy, always
voluminous).
Raw (binary) flow files, sometimes compressed, are kept for
5-14 days. This life time is enough for operational use while
storage space limitations make longer intervals infeasible. RRD
files storing up to 10 years of data with 5 min granularity and
occasional copies of raw data are archived for a long term.
Anonymization is rather cumbersome and takes hours for a
day-long flow data set.
Each data directory contains detailed README files and implements
meaningful file naming conventions
{collector}.{date}.{time}{TZ}{encoding}.{fmt}.
There is also a journal/log of events ('events.txt').
An AUP to access the data initially resembled the
NLANR/CAIDA model: signing usage agreement documents, keeping
data (and therefore analysis) on the central server, releasing
as little as possible (but no less), asking researchers to
describe their projects when they apply for data. This approach
turned out to be impractical and eventually evolved into trust
relationships between researchers and practitioner
(creator/archiver). The result is minimally successful,
time-consuming, and not scalable. A possible future solution
may be: the older the data, the less restrictions to release
them.
-
The authors of the following
two talks were not available to present at the workshop, but their
slide sets are available:
-
-
Dave Meyers (U. of Oregon and Cisco Systems)
-
Route Views Update
-
Bill Manning (ep.net)
-
DNS software - authoritative server checks
Acknowledgments
We are indebted to Matthew Zekauskas and Les Cottrell
for their invaluable minutes taken during the workshop and heavily used
in preparing this report.
The workshop was sponsored by the NSF grant "Correlating
Heterogeneous Measurement Data to Achieve System-Level Analysis of
Internet Traffic Trends" NSF ANI-0137121 and by WIDE gift fund.
|
|