Internet Traffic Classification
The Internet continually evolves in scope and complexity, much faster than our ability to characterize, understand, control, or predict it. The field of Internet traffic classification research includes many papers representing various attempts to classify whatever traffic samples a given researcher has access to, with no systematic integration of results. Here we provide a rough taxonomy of papers, and explain some issues and challenges in traffic classification.
Many media-rich entertainment applications have emerged on the Internet, which often use obfuscation techniques such as encrypted data transmission, random/changing ports, or proprietary communication protocols to prevent detection or filtering by network or content owners who believe the traffic is threatening their (infrastructural or intellectual) property. Other applications, e.g., PPStream, uTorrent, PPLive, supersede TCP with UDP. The rapidly changing nature of applications, even different versions of the same applications, presents a challenge for traffic classification techniques.
We use the phrase traffic classification to describe methods of classifying traffic based on features passively observed in the traffic, and according to specific classification goals. One might only have a coarse classification goal, i.e., whether it's transaction-oriented, bulk-transfer, or peer-to-peer file sharing. Or one might have a finer-grained classificaiton goal, i.e., the exact application represented by the traffic. Traffic features could include the port number, application payload, or temporal, packet size, and addressing characteristics of the traffic. Methods to classify include exact matching, e.g., of port number or payload, heuristic, or machine learning (statistics).
We have collected and reviewed papers published betweeen
1994 and 2009 (please email email@example.com if you know one that should be added), starting with papers from peer-reviewed
academic research conferences, and then including many papers
cited from this intial seeding set of papers, as well as follow-up
papers written by the same authors. We provide a flexible
interactive table that supports selection of relevant attributes
from these papers, including data sets and methods used, goals,
and basic empirical findings. We use five paper categories:
survey, analysis, methodology, tools, and other. Analysis
papers typically attempt to derive trustworthy numbers on actual
traffic cross-section, while methodology papers focus on
methods of classifications. Click on a checkbox below to
show that attribute for each paper in a separate column.
Below this table is a similar table of attributes of the data sets
analyzed in these set of papers.
Several public and private passive measurement infrastructures have provided a variety of different datasets for Internet traffic classification studies, which we group into four categories:
- Packet-based: packet-level traces, captured by hardware or software. Often Endance DAG cards are used for packet capture on high-bandwidth links (CAIDA uses these cards for its OC-192 backbone trace capture:equinix-chicago,equinix-sanjose.). Other capture hardware used over the years includes: ATM FORE, OC12 or OC13 POINT ATM, Napatech, INVEA-TECH. DAG cards can capture traffic on links of up to 10Gbp with less than 15ns timestamp resolution. Most software tools for capturing packets are based on kernel implementations such as tcpdump/libpcap; Coralreef, and Appmon are based on libpcap;. Other packet sniffers and network analyzers are also available.
- SNMP-based: traffic counters and statistics obtained from network devices through the SNMP and RMON MIBs;
- Flow-based: flow-level descriptions of a traffic stream:(Cisco netflow, Juniper CFlowd, Foundry sFlow, Huawei NetStream);
- Other: except from above, such as application level session logs from web sites;
|ID||Name||Year||Link Type||Capture Environments||Geographic Location||Payload||Size and Length||PaperID|
|3||TunnelSSH-2006||2006||Academic||Edge/Border||Europe||Yes(full)||0.25 h each hour
for three weeks
|2006||Academic||Backbone||Europe||No||0.33*4 h each day
for 20 days
|2006||Academic||Backbone||Europe||No||276 randomized times(10 mins)
during 80 days
|6||LosNettos-2005||2005||Academic and Commercial||Backbone||North America||24 h, 08/31/2005||[P-17]|
|7||LosNettos-2006||2006||Academic and Commercial||Backbone||North America||24 h, 10/03/2006||[P-17]|
|8||CAIDA-OC48-2003||2003||Backbone||North America||No||1 h
|9||Abilene-ABIL-2004||2004||Academic||Backbone||North America||No||1 h
|10||PAIX-PAY1-2004||2004||Backbone||North America||Yes(16 bytes)||1 h
|11||PAIX-PAY2-2004||2004||Backbone||North America||Yes(16 bytes)||1 h
27 Gbytes;35 Gbytes;
|18||WirelessCrawdad-2003||2003||Academic||Europe||No||5 h 30 mins
|19||Enter-||Commerical||Edge/Border||None||Yes||1 h 20 mins
|20||UMass-2005||2005||Academic||Edge/Border||North America||Yes(4 bytes)||[P-25][P-26]|
|21||MicroResearch1-2005||2005||Commercial||Edge/Border||North America||Yes||1 month||[P-29]|
|22||MicroResearch2-2005||2005||Commercial||Edge/Border||North America||Yes||2 weeks||[P-29]|
|26||Calgary1-2006||2006||Academic and Commercial||North America||Yes(full)||1*48 h over 6 months||[P-34]|
|27||Calgary2-2006||2006||Academic||North America||Yes(full)||1*8 h over 4 days||[P-35]|
|28||PAIX-I||2004||Commercial||Backbone||North America||Yes(16 bytes)||2 h
|29||PAIX-II||2004||Commercial||Backbone||North America||Yes(16 bytes)||2 h 2 mins
|30||WIDE-1||2006||Backbone||Oceania||Yes(40 bytes)||55 mins
|31||KEIO-I||2006||Academic||Edge/Border||Asia||Yes(40 bytes)||30 mins
|32||KEIO-II||2006||Academic||Edge/Border||Asia||Yes(40 bytes)||30 mins
|33||KAIST-I||2006||Academic||Edge/Border||Asia||Yes(40 bytes)||48 h 12 mins
|34||KAIST-II||2006||Academic||Edge/Border||Asia||Yes(40 bytes)||21 h 16 mins
|36||AUCK-||2001/2003||Edge/Border||Oceania||1 h /3 days||[P-20][P-42][P-48][P-58][P-59]|
|37||OC48-2003||2003||Backbone||North America||1 h 2 mins||[P-48]|
|38||UCSD-honeypot||Academic||Intranet||North America||5 mins||[P-48]|
|39||Calgary3-2006||2006||Academic||Edge/Border||North America||Yes(full)||1 h||[P-20]|
|41||Wireless-2006||2006||Academic||Intranet||North America||5 days||[P-21]|
|42||UCSDDepart-2006||2006||Academic||Backbone||North America||1 h||[P-21]|
|43||GMU-2003||2003||Academic||North America||10 mins of each quarter
hour over 2 months
|48||Genome Academic||Academic||Edge/Border||Europe||Yes(full)||24 h/43.9 h
268 Gbytes/495 Gbytes
|49||Tier1ISP-||Backbone||North America||No||24 h/3 h
|54||Abilene-2003||2003||Academic||Backbone||North America||No||20 days(1/100 pkts)||[P-47]|
|55||Geant-2004||2004||Backbone||Europe||No||23 days(1/1000 pkts)||[P-47]|
over 1 month
|57||ADSL-2002/3||2003-2004||Backbone||None||weekdays and weekend
days of Sep 2004
and June 2003
as testing purposes
|60||UCSD-NAP-2002||2002||Commercial||North America||31 days||[P-46]|
|61||Research-2002||2002||Academic||Edge/Border||None||39 days (roughly 15,000 hosts)||[P-46]|
|63||DARPA-1998||1998||2 and 7 weeks of
|66||Tier1-multi||2001-2002||Backbone||None||1 h - 6 days||[P-13]|
|69||CAIDA-OC48-2002/4||2002-2004||Backbone||North America||Yes(4 bytes)||1 h||[P-5]|
|70||CAIDA-OC48-2003/4||2003-2004||Backbone||North America||Yes(part)||1-2 h||[P-6]|
|71||InternetAccessTrace-2003||2003||None||Yes(full)||24 h and 18 h
|73||MultiRouter-2001||2001||Backbone||None||8000 million flow
|74||Tier1ISP-OC12-2001||2001||Backbone||North America||No||3.5 days||[P-55]|
|75||USC-2006||2006||Edge/Border||North America||No||14 hour period||[P-60]|
|76||USC-2006||2003-4||Edge/Border||North America||No||2 years||[P-61]|
|78||AccessNetwork-||Intranet||None||No||43 hours; 6 Gbytes||[P-62]|
|79||MobileNetwork-||Europe and Asia||Yes(full)||[P-63]|
|80||JanpanISPSNMP-||2004-2008||Commercial||Backbone||Asia||aggregated SNMP data (month-long) from 6 ISPs;||[P-64]|
|81||JanpanISPNetFlow-||2005,2008||Commercial||Backbone||Asia||Sampled NetFlow data from 1 ISP;||[P-64]|
|82||NapleItaly||Academic||Backbone||Europe||Generated by a set of conttolled boxes;||[P-65]|
|84||ADSLPoPFrance||2006, 2008||Commercial||Europe||yes(full)||1-2 h
|85||ISPEuropean||2008, 2009||Commercial||Europe||2*24 h, 14*90 mins
>4 Tbytes, 100-600 Gbytes
|86||DSLSession||2009||Commercial||Europe||10 days, 6*24 h
P2P traffic is one of the most challenging traffic types to classify, partly due to substantial legal interest in identifying it and even more substantial negative repercussions to the user if P2P traffic is accurately identified. The misaligned incentives between those who want to use and those who want to identify P2P applications, together with the tremendous legal and privacy constraints against traffic research, renders scientific study of this question near impossible, and even if possible, wide variation across links would prevent a simple numeric answer to the question of how much P2P traffic there is on the Internet. But our taxonomy does reveal insights: the fraction of peer-to-peer file sharing traffic observed ranges from 1.2% to 93% across the 18 papers that provide such numbers. We also know that the average fractions reported have increased considerably from 2002 to 2006 (Table 1). Tables 2 and 3 show that results also vary widely by link and geographic location. Table 3 suggests that P2P is more popular in Europe, probably due to stricter policies (MPAA and RIAA) in North America. Note that the Asian results are from Japanese data sets, in which 1.34% and 1.29% are based on port numbers and therefore likely to significantly underestimate the fraction of P2P traffic. Furthermore, the amount of P2P traffic also varies by time of day, with higher fractions at night (Table 4).
One study  suggests that peer-to-peer applications are used more often at home than in the office. Finally, a study  in Europe found a higher fraction of P2P traffic on a university link in Europe than some Canadian academics  found on their campus.
Some numbers are based on statistical or host-behavioral classification. The remaining numbers are based on P2P detection via payload signature matching, the most reliable method of detecting an application (if unecrypted), which however is fraught with legal and privacy issues.
UDP Traffic Analysis
It's still an accepted assumption that Internet traffic is dominated by TCP, which is also the basic of most current traffic classification works; however, the rise of new streaming applications (e.g. IPTV such as PPStream, PPLive) and new P2P protocols (e.g. uTP) trying to avoid traffic shaping techniques is expected to increase the usage of UDP as transport protocol.
In this analysis section, we collect some UDP analysis from existing works, and then compare the usage of UDP and TCP on several traffic traces colleced in different network and geographical locations, as well as in different time periods.
Table 5 shows that UDP/TCP ratio ranges from 0.01 to 0.20 based on the existing works (There is a high value in residential trace). For better evaluating the amount of UDP and TCP traffic on real-traces (in terms of flows, packets and bytes), we analyze several available traces collected in the period 2002-2009 on serveral backbone links located in the US and Sweden. Table 6 shows that the use of UDP as transport protocol has rapidly increased from 2002 to 2009, although TCP sessions are still responsible for most of packets and bytes. However, in terms of flows UDP turns out to be the dominant transport protocol.
This overview page presented a rough taxonomy of traffic classification approaches, based on features, methods, goals and data sets.
Our survey review also reveals shortcomings with current traffic classification efforts. The variety of data sets used does not allow systematic comparison of methods. Few research groups (can) share their datasets. Already true ten years ago, the field of traffic classification research still needs publicly available, modern data sets as reference data for validating approaches. The poor comparability of results is further amplified by the lack of standardized measures and classification goals. For example, there exists no clear definition for traffic classes such as P2P or file-sharing.
However, the taxonomy above allows meta-analyses of relevant open questions, such as trends and development of traffic classes or features, yielding new insights into Internet traffic. We showed this by shedding some insight on questions such as: "how much of modern Internet traffic is P2P?" Though we found some trends and indications, we have far too little data available to make conclusive claims beyond "there is a wide range of P2P traffic on Internet links; see your specific link of interest and classification technique you trust for more details."
This work was made possible thanks to funding from DHS-PREDICT, the National Science Foundation, Beijing Jiaotong University, and the China Scholarship Council.