Internet Traffic Classification
Introduction
The Internet continually evolves in scope and complexity, much faster than our ability to characterize, understand, control, or predict it. The field of Internet traffic classification research includes many papers representing various attempts to classify whatever traffic samples a given researcher has access to, with no systematic integration of results. Here we provide a rough taxonomy of papers, and explain some issues and challenges in traffic classification.
Application Trends
Many media-rich entertainment applications have emerged on the Internet, which often use obfuscation techniques such as encrypted data transmission, random/changing ports, or proprietary communication protocols to prevent detection or filtering by network or content owners who believe the traffic is threatening their (infrastructural or intellectual) property. Other applications, e.g., PPStream, uTorrent, PPLive, supersede TCP with UDP. The rapidly changing nature of applications, even different versions of the same applications, presents a challenge for traffic classification techniques.
Definitions
We use the phrase traffic classification to describe methods of classifying traffic based on features passively observed in the traffic, and according to specific classification goals. One might only have a coarse classification goal, i.e., whether it's transaction-oriented, bulk-transfer, or peer-to-peer file sharing. Or one might have a finer-grained classificaiton goal, i.e., the exact application represented by the traffic. Traffic features could include the port number, application payload, or temporal, packet size, and addressing characteristics of the traffic. Methods to classify include exact matching, e.g., of port number or payload, heuristic, or machine learning (statistics).
Annotated Papers
We have collected and reviewed papers published betweeen
1994 and 2009 (please email info@caida.org if you know one that should be added), starting with papers from peer-reviewed
academic research conferences, and then including many papers
cited from this intial seeding set of papers, as well as follow-up
papers written by the same authors. We provide a flexible
interactive table that supports selection of relevant attributes
from these papers, including data sets and methods used, goals,
and basic empirical findings. We use five paper categories:
survey, analysis, methodology, tools, and other. Analysis
papers typically attempt to derive trustworthy numbers on actual
traffic cross-section, while methodology papers focus on
methods of classifications. Click on a checkbox below to
show that attribute for each paper in a separate column.
Below this table is a similar table of attributes of the data sets
analyzed in these set of papers.
Datasets
Several public and private passive measurement infrastructures have provided a variety of different datasets for Internet traffic classification studies, which we group into four categories:
- Packet-based: packet-level traces, captured by hardware or software. Often Endance DAG cards are used for packet capture on high-bandwidth links (CAIDA uses these cards for its OC-192 backbone trace capture:equinix-chicago,equinix-sanjose.). Other capture hardware used over the years includes: ATM FORE, OC12 or OC13 POINT ATM, Napatech, INVEA-TECH. DAG cards can capture traffic on links of up to 10Gbp with less than 15ns timestamp resolution. Most software tools for capturing packets are based on kernel implementations such as tcpdump/libpcap; Coralreef, and Appmon are based on libpcap;. Other packet sniffers and network analyzers are also available.
- SNMP-based: traffic counters and statistics obtained from network devices through the SNMP and RMON MIBs;
- Flow-based: flow-level descriptions of a traffic stream:(Cisco netflow, Juniper CFlowd, Foundry sFlow, Huawei NetStream);
- Other: except from above, such as application level session logs from web sites;
ID | Name | Year | Link Type | Capture Environments | Geographic Location | Payload | Size and Length | PaperID | |
---|---|---|---|---|---|---|---|---|---|
1 | POSTECH-2007 | 2007 | Academic | Backbone | Asia | Yes(full) | 3 h 450 Gbytes |
[P-39] | |
2 | TunnelHunter-2007 | 2007 | Academic | Edge/Border | Europe | Yes(part) | [P-43] | ||
3 | TunnelSSH-2006 | 2006 | Academic | Edge/Border | Europe | Yes(full) | 0.25 h each hour for three weeks 50 Gbytes |
[P-44] | |
4 |
|
2006 | Academic | Backbone | Europe | No | 0.33*4 h each day for 20 days |
[P-49][P-50][P-54] | |
5 |
|
2006 | Academic | Backbone | Europe | No | 276 randomized times(10 mins) during 80 days |
[P-50] | |
6 | LosNettos-2005 | 2005 | Academic and Commercial | Backbone | North America | 24 h, 08/31/2005 | [P-17] | ||
7 | LosNettos-2006 | 2006 | Academic and Commercial | Backbone | North America | 24 h, 10/03/2006 | [P-17] | ||
8 | CAIDA-OC48-2003 | 2003 | Backbone | North America | No | 1 h 95 Gbytes |
[P-18] | ||
9 | Abilene-ABIL-2004 | 2004 | Academic | Backbone | North America | No | 1 h 714 Gbytes |
[P-18] | |
10 | PAIX-PAY1-2004 | 2004 | Backbone | North America | Yes(16 bytes) | 1 h 435 Gbytes |
[P-18] | ||
11 | PAIX-PAY2-2004 | 2004 | Backbone | North America | Yes(16 bytes) | 1 h 374 Gbytes |
[P-18] | ||
12 | UNIBS- | Academic | Edge/Border | Europe | Yes(part) | [P-19] | |||
13 | Paris6-2004 | 2004 | Academic | Edge/Border | Europe | Yes | 1 h | [P-26] | |
14 | Paris6-2006 | 2006 | Academic | Edge/Border | Europe | Yes | 1 h | [P-26] | |
15 | Paris6-2004-2005 | 2004-2005 | Academic | Edge/Border | Europe | Yes | 1*3 h 27 Gbytes;35 Gbytes; 300 Gbytes |
[P-25] | |
16 | College-2003 | 2003 | Academic | Europe | No | 15 mins 900 Mbytes |
[P-25] | ||
17 | ADSL-2004 | 2004 | None | No | 15 mins 2.3 Gbytes |
[P-25] | |||
18 | WirelessCrawdad-2003 | 2003 | Academic | Europe | No | 5 h 30 mins 330 Mbytes |
[P-25] | ||
19 | Enter- | Commerical | Edge/Border | None | Yes | 1 h 20 mins 300 Mbytes |
[P-25] | ||
20 | UMass-2005 | 2005 | Academic | Edge/Border | North America | Yes(4 bytes) | [P-25][P-26] | ||
21 | MicroResearch1-2005 | 2005 | Commercial | Edge/Border | North America | Yes | 1 month | [P-29] | |
22 | MicroResearch2-2005 | 2005 | Commercial | Edge/Border | North America | Yes | 2 weeks | [P-29] | |
23 | CISCO- | Backbone(/8) | No | [P-32] | |||||
24 | Polito-Academic-2006 | 2006 | Academic | Europe | 95 h | [P-30] | |||
25 | Polito-ISP-2006 | 2006 | Backbone | Europe | 24 h | [P-30] | |||
26 | Calgary1-2006 | 2006 | Academic and Commercial | North America | Yes(full) | 1*48 h over 6 months | [P-34] | ||
27 | Calgary2-2006 | 2006 | Academic | North America | Yes(full) | 1*8 h over 4 days | [P-35] | ||
28 | PAIX-I | 2004 | Commercial | Backbone | North America | Yes(16 bytes) | 2 h 91 Gbytes |
[P-37] | |
29 | PAIX-II | 2004 | Commercial | Backbone | North America | Yes(16 bytes) | 2 h 2 mins 891 Gbytes |
[P-37] | |
30 | WIDE-1 | 2006 | Backbone | Oceania | Yes(40 bytes) | 55 mins 14 Gbytes |
[P-37] | ||
31 | KEIO-I | 2006 | Academic | Edge/Border | Asia | Yes(40 bytes) | 30 mins 16 Gbytes |
[P-37] | |
32 | KEIO-II | 2006 | Academic | Edge/Border | Asia | Yes(40 bytes) | 30 mins 16 Gbytes |
[P-37] | |
33 | KAIST-I | 2006 | Academic | Edge/Border | Asia | Yes(40 bytes) | 48 h 12 mins 506 Gbytes |
[P-37] | |
34 | KAIST-II | 2006 | Academic | Edge/Border | Asia | Yes(40 bytes) | 21 h 16 mins 259 Gbytes |
[P-37] | |
35 | WIDE-2 | 2006 | Backbone | Oceania | 2 h | [P-48] | |||
36 | AUCK- | 2001/2003 | Edge/Border | Oceania | 1 h /3 days | [P-20][P-42][P-48][P-58][P-59] | |||
37 | OC48-2003 | 2003 | Backbone | North America | 1 h 2 mins | [P-48] | |||
38 | UCSD-honeypot | Academic | Intranet | North America | 5 mins | [P-48] | |||
39 | Calgary3-2006 | 2006 | Academic | Edge/Border | North America | Yes(full) | 1 h | [P-20] | |
40 | Cambridge-2003 | 2003 | Academic | Europe | 24 h | [P-21] | |||
41 | Wireless-2006 | 2006 | Academic | Intranet | North America | 5 days | [P-21] | ||
42 | UCSDDepart-2006 | 2006 | Academic | Backbone | North America | 1 h | [P-21] | ||
43 | GMU-2003 | 2003 | Academic | North America | 10 mins of each quarter hour over 2 months |
[P-22] | |||
44 | PMC- | Academic | Edge/Border | None | Yes | 1 hour | [P-23] | ||
45 | Callrecords-2005 | 2005 | Commercial | Edge/Border | Europe | logs | [P-52] | ||
46 | Leipzig-II-20030221 | 2003 | 1 h | [P-58] | |||||
47 | Nzix-II-2000 | 2000 | 1 h | [P-58] | |||||
48 | Genome Academic | Academic | Edge/Border | Europe | Yes(full) | 24 h/43.9 h 268 Gbytes/495 Gbytes |
[P-1][P-3][P-10][P-15][P-27] | ||
49 | Tier1ISP- | Backbone | North America | No | 24 h/3 h 11-98 Gbytes |
[P-8] | |||
50 | University-weekday | 2004 | Academic | None | 24.6 h 1223 Gbytes |
[P-10] | |||
51 | University-weekend | 2004 | Academic | None | 33.6 h 1652 Gbytes |
[P-10] | |||
52 | Mshmro-2002 | 2002 | Commercial | None | Yes | 7 days 60 Gbytes |
[P-31] | ||
53 | Accessnetwork-2004/5 | 2004-2005 | None | l/4/8 h 100 Gbytes |
[P-36] | ||||
54 | Abilene-2003 | 2003 | Academic | Backbone | North America | No | 20 days(1/100 pkts) | [P-47] | |
55 | Geant-2004 | 2004 | Backbone | Europe | No | 23 days(1/1000 pkts) | [P-47] | ||
56 | Waseda-2002 | 2002 | Academic | Edge/Border | Asia | No | weekday nights over 1 month |
[P-51] | |
57 | ADSL-2002/3 | 2003-2004 | Backbone | None | weekdays and weekend days of Sep 2004 and June 2003 |
[P-53] | |||
58 | WebServer- | None | 10,000 webServers as testing purposes |
[P-16] | |||||
59 | StreamingLogs | 2001 | Commercial | None | [P-4][P-56] | ||||
60 | UCSD-NAP-2002 | 2002 | Commercial | North America | 31 days | [P-46] | |||
61 | Research-2002 | 2002 | Academic | Edge/Border | None | 39 days (roughly 15,000 hosts) | [P-46] | ||
62 | OC48-2001 | 2001 | Backbone | None | 8 h | [P-46] | |||
63 | DARPA-1998 | 1998 | 2 and 7 weeks of network-based attacks |
[P-45] | |||||
64 | Mazu- | Commercial | North America | [P-41] | |||||
65 | BigComany- | Commercial | None | [P-41] | |||||
66 | Tier1-multi | 2001-2002 | Backbone | None | 1 h - 6 days | [P-13] | |||
67 | Saarland-2002 | 2002 | Academic | Europe | Yes | 8 days 950 Gbytes |
[P-7] | ||
68 | Gigascope | 2003/2004 | [P-4] | ||||||
69 | CAIDA-OC48-2002/4 | 2002-2004 | Backbone | North America | Yes(4 bytes) | 1 h | [P-5] | ||
70 | CAIDA-OC48-2003/4 | 2003-2004 | Backbone | North America | Yes(part) | 1-2 h | [P-6] | ||
71 | InternetAccessTrace-2003 | 2003 | None | Yes(full) | 24 h and 18 h 120 Gbytes |
[P-9] | |||
72 | VPN-2003 | 2003 | None | Yes(full) | 6 days 1.8 Tbytes |
[P-9] | |||
73 | MultiRouter-2001 | 2001 | Backbone | None | 8000 million flow level records |
[P-40] | |||
74 | Tier1ISP-OC12-2001 | 2001 | Backbone | North America | No | 3.5 days | [P-55] | ||
75 | USC-2006 | 2006 | Edge/Border | North America | No | 14 hour period | [P-60] | ||
76 | USC-2006 | 2003-4 | Edge/Border | North America | No | 2 years | [P-61] | ||
77 | POPs- | Backbone | North America | No | [P-12] | ||||
78 | AccessNetwork- | Intranet | None | No | 43 hours; 6 Gbytes | [P-62] | |||
79 | MobileNetwork- | Europe and Asia | Yes(full) | [P-63] | |||||
80 | JanpanISPSNMP- | 2004-2008 | Commercial | Backbone | Asia | aggregated SNMP data (month-long) from 6 ISPs; | [P-64] | ||
81 | JanpanISPNetFlow- | 2005,2008 | Commercial | Backbone | Asia | Sampled NetFlow data from 1 ISP; | [P-64] | ||
82 | NapleItaly | Academic | Backbone | Europe | Generated by a set of conttolled boxes; | [P-65] | |||
83 | WPIUSA | Academic | Backbone | North America | [P-65] | ||||
84 | ADSLPoPFrance | 2006, 2008 | Commercial | Europe | yes(full) | 1-2 h 26-60 Gbytes |
[P-67] | ||
85 | ISPEuropean | 2008, 2009 | Commercial | Europe | 2*24 h, 14*90 mins >4 Tbytes, 100-600 Gbytes |
[P-68] | |||
86 | DSLSession | 2009 | Commercial | Europe | 10 days, 6*24 h (DSL session) |
[P-68] |
Discussion
P2P traffic is one of the most challenging traffic types to classify, partly due to substantial legal interest in identifying it and even more substantial negative repercussions to the user if P2P traffic is accurately identified. The misaligned incentives between those who want to use and those who want to identify P2P applications, together with the tremendous legal and privacy constraints against traffic research, renders scientific study of this question near impossible, and even if possible, wide variation across links would prevent a simple numeric answer to the question of how much P2P traffic there is on the Internet. But our taxonomy does reveal insights: the fraction of peer-to-peer file sharing traffic observed ranges from 1.2% to 93% across the 18 papers that provide such numbers. We also know that the average fractions reported have increased considerably from 2002 to 2006 (Table 1). Tables 2 and 3 show that results also vary widely by link and geographic location. Table 3 suggests that P2P is more popular in Europe, probably due to stricter policies (MPAA and RIAA) in North America. Note that the Asian results are from Japanese data sets, in which 1.34% and 1.29% are based on port numbers and therefore likely to significantly underestimate the fraction of P2P traffic. Furthermore, the amount of P2P traffic also varies by time of day, with higher fractions at night (Table 4).
One study [34] suggests that peer-to-peer applications are used more often at home than in the office. Finally, a study [50] in Europe found a higher fraction of P2P traffic on a university link in Europe than some Canadian academics [34] found on their campus.
Some numbers are based on statistical or host-behavioral classification. The remaining numbers are based on P2P detection via payload signature matching, the most reliable method of detecting an application (if unecrypted), which however is fraught with legal and privacy issues.
|
|
|
|
UDP Traffic Analysis
It's still an accepted assumption that Internet traffic is dominated by TCP, which is also the basic of most current traffic classification works; however, the rise of new streaming applications (e.g. IPTV such as PPStream, PPLive) and new P2P protocols (e.g. uTP) trying to avoid traffic shaping techniques is expected to increase the usage of UDP as transport protocol.
In this analysis section, we collect some UDP analysis from existing works, and then compare the usage of UDP and TCP on several traffic traces colleced in different network and geographical locations, as well as in different time periods.
Table 5 shows that UDP/TCP ratio ranges from 0.01 to 0.20 based on the existing works (There is a high value in residential trace). For better evaluating the amount of UDP and TCP traffic on real-traces (in terms of flows, packets and bytes), we analyze several available traces collected in the period 2002-2009 on serveral backbone links located in the US and Sweden. Table 6 shows that the use of UDP as transport protocol has rapidly increased from 2002 to 2009, although TCP sessions are still responsible for most of packets and bytes. However, in terms of flows UDP turns out to be the dominant transport protocol.
|
|
Conclusion
This overview page presented a rough taxonomy of traffic classification approaches, based on features, methods, goals and data sets.
Our survey review also reveals shortcomings with current traffic classification efforts. The variety of data sets used does not allow systematic comparison of methods. Few research groups (can) share their datasets. Already true ten years ago, the field of traffic classification research still needs publicly available, modern data sets as reference data for validating approaches. The poor comparability of results is further amplified by the lack of standardized measures and classification goals. For example, there exists no clear definition for traffic classes such as P2P or file-sharing.
However, the taxonomy above allows meta-analyses of relevant open questions, such as trends and development of traffic classes or features, yielding new insights into Internet traffic. We showed this by shedding some insight on questions such as: "how much of modern Internet traffic is P2P?" Though we found some trends and indications, we have far too little data available to make conclusive claims beyond "there is a wide range of P2P traffic on Internet links; see your specific link of interest and classification technique you trust for more details."
Acknowledgments
This work was made possible thanks to funding from DHS-PREDICT, the National Science Foundation, Beijing Jiaotong University, and the China Scholarship Council.