Skip to Content
[CAIDA - Cooperative Association for Internet Data Analysis logo]
The Cooperative Association for Internet Data Analysis
www.caida.org > research : traffic-analysis : classification-overview
Internet Traffic Classification
Internet traffic classification gains continuous attentions while many applications emerge on the Internet with obfuscation techniques. Related papers tend to try to classify whatever traffic samples a researcher can find, with no systematic integration of results. To fill this gap, we have created a structured taxonomy of traffic classification papers and their data sets. Furthermore, we hope to reveal issues and challenges in traffic classification.

Introduction

The Internet continually evolves in scope and complexity, much faster than our ability to characterize, understand, control, or predict it. The field of Internet traffic classification research includes many papers representing various attempts to classify whatever traffic samples a given researcher has access to, with no systematic integration of results. Here we provide a rough taxonomy of papers, and explain some issues and challenges in traffic classification.

Application Trends

Many media-rich entertainment applications have emerged on the Internet, which often use obfuscation techniques such as encrypted data transmission, random/changing ports, or proprietary communication protocols to prevent detection or filtering by network or content owners who believe the traffic is threatening their (infrastructural or intellectual) property. Other applications, e.g., PPStream, uTorrent, PPLive, supersede TCP with UDP. The rapidly changing nature of applications, even different versions of the same applications, presents a challenge for traffic classification techniques.

Definitions

We use the phrase traffic classification to describe methods of classifying traffic based on features passively observed in the traffic, and according to specific classification goals. One might only have a coarse classification goal, i.e., whether it's transaction-oriented, bulk-transfer, or peer-to-peer file sharing. Or one might have a finer-grained classificaiton goal, i.e., the exact application represented by the traffic. Traffic features could include the port number, application payload, or temporal, packet size, and addressing characteristics of the traffic. Methods to classify include exact matching, e.g., of port number or payload, heuristic, or machine learning (statistics).

Annotated Papers

We have collected and reviewed papers published betweeen 1994 and 2009 (please email info@caida.org if you know one that should be added), starting with papers from peer-reviewed academic research conferences, and then including many papers cited from this intial seeding set of papers, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers, including data sets and methods used, goals, and basic empirical findings. We use five paper categories: survey, analysis, methodology, tools, and other. Analysis papers typically attempt to derive trustworthy numbers on actual traffic cross-section, while methodology papers focus on methods of classifications. Click on a checkbox below to show that attribute for each paper in a separate column. Below this table is a similar table of attributes of the data sets analyzed in these set of papers.

ID Title Year Publication Authors Paper Type Classfication Goals Classfication Characteristics Method Empirical Findings % of traffic P2P PDF DataID
1 Toward the Accurate Identification of Network Applications 2005 PAM A. Moore, K. Papagiannaki Methodology Coarse-grained Classification Application Payload Exact Matching Port-based:
64.54%BULK;27.30%WWW;
Content-based:
45.00%BULK;20.40%WWW;
1.5% [D-48]
2 Flow Clustering Using Machine Learning Techniques 2004 PAM A. McGregor, M. Hall, P. Lorier, J. Brunskill Methodology Coarse-grained Classification Flow Characteristics Machine Learning/Stat (EM)
3 Internet Traffic Classification Using Bayesian Analysis Techniques 2005 SIGMETRICS A. Moore, D. Zuev Methodology Coarse-grained Classification Flow Characteristics Machine Learning/Stat (Bayesian) 65% accuracy on per-flow classification and better than 95% with refinements [D-48]
4 Class-of-service Mapping for QoS 2004 IMC M. Roughan, S. Sen, O. Spatscheck, N. Duffield Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (NN,LDA) [D-59][D-68]
5 Is P2P Dying or just Hiding? 2004 GLOBECOM T. Karagiannis, A. Broido, N. Brownlee, K. Claffy, M. Faloutsos Methodology Fine-grained Classification Application Payload Exact Matching 2003:HTTP(72%,47.7%);SMTP(1.3%,1.2%);P2P(8%,10.7%)
2004:HTTP(56%,52.1%);SMTP(3.2%,9.7%);P2P(14%,9.9%)
8%,10.7% in 2003;14%,9.9% in 2004 [D-69]
6 Transport Layer Identification of P2P Traffic 2004 SIGCOMM T. Karagiannis, A. Broido, M. Faloutsos, K. Claffy Methodology Coarse-grained Classification (I) Flow Characteristics Heuristics P2P traffic continues to grow unabatedly 15%-20% [D-70]
7 An Analysis of Internet Chat Systems 2003 SIGCOMM C. Dewes, A. Wichmann, A. Feldmann Profiling/Analysis miss less than 8.3% of all existing chat connections and to correctly classify at least 93.1% [D-67]
8 Profiling Internet Backbone Traffic: Behavior Models and Applications 2005 SIGCOMM K. Xu, Z. Zhang, S. Bhattacharyya Profiling/Analysis [D-49]
9 Accurate Scalable In-Network Identification of P2P Traffic 2004 WWW S. Sen, O. Spatscheck, D. Wang Methodology Coarse-grained Classification (I) Flow Characteristics Heuristics less than 5% false positive and false negative ratios [D-71][D-72]
10 BLINC Multilevel Traffic Classification in the Dark 2005 SIGCOMM T. Karagiannis, K. Papagiannaki, M. Faloutsos Methodology Coarse-grained Classification Flow Characteristics Heuristics web:14%,37.5%,33.5%;data(ftp):67.4%,7.6%,5.4%; 1.2%,31.9%,31.3% [D-48]
11 CoralReef Software Suite as a Tool for System and Network Administrators 2001 LISA D. Moore, K. Keys, R. Koga, E. Lagache, K. Claffy Tools
12 Packet-level Traffic Measurements from the Sprint IP Backbone 2003 IEEE Network C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, R. Rockell, T. Seely Profiling/Analysis over 90% flows have packet sizes of 1495 bytes or greater 0.1%-80%(p2p+unknown) [D-77]
13 Snort Lightweight Intrusion Detection for Networks 1999 LISA M. Roesch Tools [D-66]
14 Internet Traffic Characterization 1994 K. Claffy Profiling/Analysis
15 Discriminators for Use in Flow-based Classification 2004 A. Moore, D. Zuev, M. Crogan Other [D-48]
16 Identifying the TCP Behavior of Web Servers 2001 J. Padhye, S. Floyd Profiling/Analysis [D-58]
17 Inherent Behaviors for On-line Detection of Peer-to-Peer File Sharing 2007 IEEE Global Intenet G. Bartlett, J. Heidemann, C. Papadopoulos Methodology Fine-grained Classification (I) Flow Characteristics Heuristics achieve up tp an 83% true positive rate with only a 2% false positive rate [D-6][D-7]
18 Graption: Automated Detection of P2P Applications Using Traffic Dispersion Graphs 2008 Technical Report M. Iliofotou, P. Pappu, M. Faloutsos Methodology Coarse-grained Classification (I) Flow Characteristics Heuristics more than 90% precision and recall for P2P detection 9.19% [D-8][D-11]
19 Traffic Classification through Simple Statistical Fingerprinting 2007 SIGCOMM CCR M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (Normalized thresholds) [D-12]
20 Traffic Classification Using Clustering Algorithms 2006 SIGCOMM J. Erman, M. Arlitt, A. Mahanti Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (K-Means, DBSCAN) 47.3%(Bytes,HTTP);35.1%(Bytes,P2P);6.0%(Bytes,SMTP) 35.1%(Bytes) [D-36][D-39]
21 Unexpected Means of Protocol Inference 2006 IMC J. Ma, K. Levchenko, C. Kreibich, S. Savage, G. Voelker Methodology Fine-grained Classification Application Payload Machine Learning/Stat (Product Distribution; Markov Processes; CSG) [D-40][D-41][D-42]
22 On Inferring Application Protocol Behaviors in Encrypted Network Traffic 2006 Journal of Machine Learning Research C. Wright, F. Monrose, G. Masson Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (HMM) achieve greater than 90% for serveral protocols in aggregate traffic [D-43]
23 Traffic Classification on the fly 2006 SIGCOMM CCR L. Bernaille, R. Teixeira, I. Akodjenou, A. Soule, K. Salamatian Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (K-Means) [D-44]
24 Blind Application Recognition Through Behavioral Classification 2005 L. Bernaille, A. Soule, M. Jeannin, K. Salamatian Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (HMM)
25 Early Application Identification 2006 CONEXT L. Bernaille, R. Teixeira, K. Salamatian Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (K-Means, GMM, Spectral Clustering) classify known applications with an accuracy over 90%; identify new applications as unknown with a probability of 60% [D-15][D-16][D-17][D-18][D-19][D-20]
26 Early Recognition of Encrypted Application 2007 PAM L. Bernaille, R. Teixeira Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat more than 85% accuracy in recognizing the application in an SSL connection [D-13][D-14][D-20]
27 Traffic Classification using a Statistical Approach 2005 PAM D. Zuev, A. Moore Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (Bayes) achieve better than 83% accuracy on both a per-byte and a per-packet basis [D-48]
28 Appmon: An Application for Accurate per Application Network Traffic Characterization 2006 BroadBand Europe D. Antoniades, M. Polychronakis, S. Antonatos, E. Markatos, S. Ubik Tools
29 Profiling the End Host 2007 PAM T. Karagiannis, K. Papagiannaki, N. Taft, M. Faloutso Profiling/Analysis [D-21][D-22]
30 Revealing Skype Traffic: when randomness plays with you 2007 SIGCOMM D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, P. Tofanelli Profiling/Analysis [D-24][D-25]
31 A Traffic Characterization of Popular on-line Games 2005 IEEE/ACM Transactions on Networking W. Chang Feng, F. Chang, W. chi Feng, J. Walpole Profiling/Analysis [D-52]
32 Hit-list worm detection and bot identification in large networks using protocol graphs 2007 RAID M. Collins, M. Reiter Profiling/Analysis [D-23]
33 Identifying Known and Unknown Peer-to-Peer Traffic 2006 IEEE NCA F. Constantinou, P. Mavrommatiss
34 Offline/realtime Traffic Classification Using Semi-supervised Learning 2007 Perform. Eval J. Erman, A. Manhanti, M. Arlitt, I. Cohen, C. Williamson Methodology Coarse-grained Classification Machine Learning/Stat (K-Means) 37.4%(Bytes,P2P,Campus);80%(Bytes,P2P,Residential);61.5%(Bytes,P2P,WLAN) 37.4%(Bytes,Campus);80%(Bytes,Residential);61.5%(Bytes,WLAN) [D-26]
35 Identifying and Discrimination between Web and Peer-to-Peer Traffic in the Network Core 2007 WWW J. Erman, A. Manhanti, M. Arlitt, C. Williamson Methodology Coarse-grained Classification Flow Characteristics Machine Learning/Stat (K-Means) 38.3% [D-27]
36 Acas: Automated Construction of Application Signatures 2005 SIGCOMM P. Haffner, S. Sen, O. Spatscheck, D. Wang Methodology Fine-grained Classification Application Payload (64 bytes) Machine Learning/Stat (Naive Bayes, AdaBoost, Maximum Entropy) [D-53]
37 Comparison of Internet Traffic Classification Tools 2007 IMRG WACI H. Kim, K. Claffy, M. Fomenkova, N. Browlee, D. Barman, M. Faloutsos Survey/Compare [D-28][D-29][D-30][D-31][D-32][D-33][D-34]
38 A Survey of Techniques for Internet Traffic Classification Using Machine Learning 2008 IEEE Communications Surveys and Tutorials T. Naguyen, G. Armitage Survey/Compare
39 Towards Automated Application Signature Generation 2008 NOMS B. Park, Y. Won, M. Kim, J.Hong Methodology Fine-grained Classification Application Payload Heuristics [D-1]
40 Analyzing Peer-to-Peer Traffic across Large networks 2004 IEEE/ACM Transactions on Networking S. Sen, J. Wang Profiling/Analysis [D-73]
41 Role Classification of Hosts within Enterprise Networks based on Connection Patterns 2003 USENIX G. Tan, M. Poletto, J. Guttag, F. Kaashoek Methodology Flow Characteristics Heuristics [D-64][D-65]
42 Self-learning IP Traffic Classification based on Statistical Flow Characteristics 2005 PAM S. Zander, T. Nguyen, G. Armitage Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (EM) [D-36]
43 Tunnel Hunter: Detecting Application-Layer Tunnels with Statistical Fingerprinting 2008 Computer Networks M. Dusi, M. Crotti, F. Gringoli, L. Salgarelli Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat [D-2]
44 A Preliminary Look at the Privacy of SSH Tunnels 2008 ICC M. Dusi, F. Gringoli, L. Salgarelli Methodology Fine-grained Classification Flow Characteristics Machine Learning/Stat (GMM) [D-3]
45 A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detecion 2003 SIAM A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava Survey/Compare [D-63]
46 Automatically Inferring Patterns of Resources Consumption in Network Traffic 2003 SIGCOMM C. Estan, S. Savage, G. Vargheses Methodology Coarse-grained Classification Flow Characteristics Heuristics [D-60][D-61][D-62]
47 Mining Anomalies Using Traffic Feature Distributions 2005 SIGCOMM A. Lakhina, M. Crovella, C. Diot Methodology Coarse-grained Classification Heuristics (K-Means, Hierarchical Agglomerative Algorithm) [D-54][D-55]
48 Network Traffic Analysis using Traffic Dispersion Graphs (TDGs): Techniques and Hardware Implementation 2007 Technical Report M. Iliofotou, P. Pappu, M. Faloutsos, M. Mitzenmacher, S. Singh, G. Varghese Tools [D-35][D-36][D-37][D-38]
49 Heuristics to Classify Internet Backbone Traffic based on Connection Patterns 2008 ICOIN W. John, S. Tafvelin Methodology Coarse-grained Classification Flow Characteristics Heuristic leave only 0.2% of the data unclassified; can identify 95% of P2P flows 42%(average) in connections, 79%(average) in traffic [D-4]
50 Trends and Differences in Connections Behavior within Classes of Internet Backbone Traffic 2008 PAM W. John, S. Tafvelin, T. Olovsson Profiling/Analysis P2P and HTTP traffic exhibit different peak times.P2P traffic was found to be clearly dominating with 90% of the transfer volums, especially during evening and night times. In contrast, HTTP traffic has its main activities(9% of the data volumes) during office hours. 93%(evening);91%(night);86%(office hours) [D-4][D-5]
51 Flow Analysis of Internet Traffic: World Wide Web versus Peer-to-Peer 2005 System and Computers in Janpan M. Perenyi, T. Dang, A. Gefferth, S. Monlnar Profiling/Analysis 57.52% for WWW, 21.53% for P2P, 20.95% for other 21.53% [D-56]
52 Identification and Analysis of Peer-to-Peer Traffic 2006 Journal of Communications M. Perenyi, T. Dang, A. Gefferth, S. Monlnar Methodology Fine-grained Classification Flow Characteristics Heuristics 60%-80% [D-45]
53 Analysis of Peer-to-Peer Traffic on ADSL 2005 PAM L. Plissonneau, J. Costeux, P. Brown Profiling/Analysis 40% of connections are only connection reattempts, and it concerns about 30% of peers 60% in 2004, 65% in 2003(lies on P2P ports) [D-57]
54 Analysis of Internet Backbone Traffic and Header Anomalies Observed 2007 IMC W. John, S. Tafvelin Profiling/Analysis [D-4]
55 Flow Classification by Histograms or How to Go on Safari in the Internet 2004 SIGMETRICS A. Soule, K. Salamatian, N. Taft, R. Emilion, K. Papagiannaki Methodology Coarse-grained Classification Flow Characteristics Machine Learning/Stat (EM) [D-74]
56 Streaming Video Traffic: Characterization and Network Impact 2002 WCW J. Merwe, S. Sen, C. Kalmanek Profiling/Analysis [D-59]
57 The architecture of CoralReef: an Internet traffic monitoring software suite 2001 PAM K. Keys, D. Moore, R. Koga, E. Lagache, M. Tesch, K. Claffy Tools
58 A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification 2006 SIGCOMM N. Williams, S. Zander, G. Armitage Survey/Compare [D-36][D-46][D-47]
59 Internet Traffic Identification using Machine Learning 2006 GLOBECOM J. Erman, A. Mahanti, M. Arlitt Methodogy Fine-grained Classification Flow Characteristics Machine Learning/Stat (EM, Bayes) 81.2%http;3.1%smtp;2.0%dns [D-36]
60 Estimating P2P Traffic Volume at USC 2007 G. Bartlett, J. Heidemann, C. Papadopoulos, J. Pepin Profiling/Analysis Port Number and Flow Characteristics Exact Matching and Heuristics 3%-13% of active hosts on campus participate in P2P 21%-33%(less than, Byte) [D-75]
61 A Longitudinal Study of P2P Traffic Classification 2006 MASCOTS A. Madhukar, C. Williamson Survey/Compare 30%-70% of the campus Internet traffic for 2003-2005 was P2P 30%-70% [D-76]
62 On the Validation of Traffic Classification Algorithms 2008 PAM G. Szabo, D. Orincasy, S. Malomsoky and I. Szabo Other Fine-grained Classification, for validate classifcation methods P2P:70%;Web:26%;VoIP:2%;Streaming:1%;Secure Channel:1%; 70% [D-78]
63 Accurate Traffic Classification 2007 WoWMoM G. Szabo, I. Szabo and D. Orincasyo Methodology Fine-grained Classification [D-79]
64 Observing Slow Crustal Movement in Residential User Traffic 2008 ACM CoNEXT Kenjiro Cho, Kensuke Fukuda, Hiroshi Esaki and Akira Kato Analysis Port Number Between May 2005 and May 2008, the average annual growth rate of (A1) RBB customers: 26% for inbound; 28% for outbound; 27% for the combined volume; 0.92%(gnutella,2005);0.25%(bittorrent,2005);0.12%(edonkey,2005);0.94%(gnutella,2008);0.22%(bittorrent,2008);0.13%(edonkey,2008); [D-80][D-81]
65 Classification of Network Traffic via Packet-Level Hidden Markov Models 2008 GLOBECOM Alberto Dainotti, Walter de Donato, Antonio Pescape and Pierluigi Salvo Rossi Methodology Flow Characteristics Machine Learning (HMM) [D-82][D-83]
66 TIE: a Community-oriented Traffic Classification Platform 2009 IMA Alberto Dainotti, Walter de Donato, Antonio Pescape and Giorgio Ventre Tools
67 Challenging Statistical Classification for Operational Usage: the ADSL Case 2009 IMC Marcin Pietrzyk, Jean-Laurent, Guillaume Urvoy-Keller and Taoufik Analysis TCP Flow Characteristics Machine Learning/Stat Most bytes and flows are due to eDonkey; The vast majority of traffic in the HTTP Streaming class is due to Dailymotion and Youtube; 5%-40%(flows);10%-50%(bytes) [D-84]
68 On Dominant Characteristics of Residential Broadband Internet Traffic 2009 IMC Gregor Maier, Anjia Feldmann, Vern Paxson and Mark Allman Analysis Port Number and Application Payload Exact Matching HTTP carries more than 50% traffic; Flash Video contributes 25% of all HTTP traffic; 14%(bytes) [D-85][D-86]

Datasets

Several public and private passive measurement infrastructures have provided a variety of different datasets for Internet traffic classification studies, which we group into four categories:

  • Packet-based: packet-level traces, captured by hardware or software. Often Endance DAG cards are used for packet capture on high-bandwidth links (CAIDA uses these cards for its OC-192 backbone trace capture:equinix-chicago,equinix-sanjose.). Other capture hardware used over the years includes: ATM FORE, OC12 or OC13 POINT ATM, Napatech, INVEA-TECH. DAG cards can capture traffic on links of up to 10Gbp with less than 15ns timestamp resolution. Most software tools for capturing packets are based on kernel implementations such as tcpdump/libpcap; Coralreef, Appmon, and Ethereal are based on libpcap;
  • SNMP-based: traffic counters and statistics obtained from network devices through the SNMP and RMON MIBs;
  • Flow-based: flow-level descriptions of a traffic stream:(Cisco netflow, Juniper CFlowd, Foundry sFlow, Huawei NetStream);
  • Other: except from above, such as application level session logs from web sites;
The data come from three types of capture environments:Intranet environment, Edge/Border environment, Backbone environment.

Words Filter:
Columns Filter:
Link Type Capture Environments Geographic Location
Payload Size and Length

ID Name Year Link Type Capture Environments Geographic Location Payload Size and Length PaperID
1 POSTECH-2007 2007 Academic Backbone Asia Yes(full) 3 h
450 Gbytes
[P-39]
2 TunnelHunter-2007 2007 Academic Edge/Border Europe Yes(part) [P-43]
3 TunnelSSH-2006 2006 Academic Edge/Border Europe Yes(full) 0.25 h each hour
for three weeks
50 Gbytes
[P-44]
4
SUNET1-2006
2006 Academic Backbone Europe No 0.33*4 h each day
for 20 days
[P-49][P-50][P-54]
5
SUNET2-2006
2006 Academic Backbone Europe No 276 randomized times(10 mins)
during 80 days
[P-50]
6 LosNettos-2005 2005 Academic and Commercial Backbone North America 24 h, 08/31/2005 [P-17]
7 LosNettos-2006 2006 Academic and Commercial Backbone North America 24 h, 10/03/2006 [P-17]
8 CAIDA-OC48-2003 2003 Backbone North America No 1 h
95 Gbytes
[P-18]
9 Abilene-ABIL-2004 2004 Academic Backbone North America No 1 h
714 Gbytes
[P-18]
10 PAIX-PAY1-2004 2004 Backbone North America Yes(16 bytes) 1 h
435 Gbytes
[P-18]
11 PAIX-PAY2-2004 2004 Backbone North America Yes(16 bytes) 1 h
374 Gbytes
[P-18]
12 UNIBS- Academic Edge/Border Europe Yes(part) [P-19]
13 Paris6-2004 2004 Academic Edge/Border Europe Yes 1 h [P-26]
14 Paris6-2006 2006 Academic Edge/Border Europe Yes 1 h [P-26]
15 Paris6-2004-2005 2004-2005 Academic Edge/Border Europe Yes 1*3 h
27 Gbytes;35 Gbytes;
300 Gbytes
[P-25]
16 College-2003 2003 Academic Europe No 15 mins
900 Mbytes
[P-25]
17 ADSL-2004 2004 None No 15 mins
2.3 Gbytes
[P-25]
18 WirelessCrawdad-2003 2003 Academic Europe No 5 h 30 mins
330 Mbytes
[P-25]
19 Enter- Commerical Edge/Border None Yes 1 h 20 mins
300 Mbytes
[P-25]
20 UMass-2005 2005 Academic Edge/Border North America Yes(4 bytes) [P-25][P-26]
21 MicroResearch1-2005 2005 Commercial Edge/Border North America Yes 1 month [P-29]
22 MicroResearch2-2005 2005 Commercial Edge/Border North America Yes 2 weeks [P-29]
23 CISCO- Backbone(/8) No [P-32]
24 Polito-Academic-2006 2006 Academic Europe 95 h [P-30]
25 Polito-ISP-2006 2006 Backbone Europe 24 h [P-30]
26 Calgary1-2006 2006 Academic and Commercial North America Yes(full) 1*48 h over 6 months [P-34]
27 Calgary2-2006 2006 Academic North America Yes(full) 1*8 h over 4 days [P-35]
28 PAIX-I 2004 Commercial Backbone North America Yes(16 bytes) 2 h
91 Gbytes
[P-37]
29 PAIX-II 2004 Commercial Backbone North America Yes(16 bytes) 2 h 2 mins
891 Gbytes
[P-37]
30 WIDE-1 2006 Backbone Oceania Yes(40 bytes) 55 mins
14 Gbytes
[P-37]
31 KEIO-I 2006 Academic Edge/Border Asia Yes(40 bytes) 30 mins
16 Gbytes
[P-37]
32 KEIO-II 2006 Academic Edge/Border Asia Yes(40 bytes) 30 mins
16 Gbytes
[P-37]
33 KAIST-I 2006 Academic Edge/Border Asia Yes(40 bytes) 48 h 12 mins
506 Gbytes
[P-37]
34 KAIST-II 2006 Academic Edge/Border Asia Yes(40 bytes) 21 h 16 mins
259 Gbytes
[P-37]
35 WIDE-2 2006 Backbone Oceania 2 h [P-48]
36 AUCK- 2001/2003 Edge/Border Oceania 1 h /3 days [P-20][P-42][P-48][P-58][P-59]
37 OC48-2003 2003 Backbone North America 1 h 2 mins [P-48]
38 UCSD-honeypot Academic Intranet North America 5 mins [P-48]
39 Calgary3-2006 2006 Academic Edge/Border North America Yes(full) 1 h [P-20]
40 Cambridge-2003 2003 Academic Europe 24 h [P-21]
41 Wireless-2006 2006 Academic Intranet North America 5 days [P-21]
42 UCSDDepart-2006 2006 Academic Backbone North America 1 h [P-21]
43 GMU-2003 2003 Academic North America 10 mins of each quarter
hour over 2 months
[P-22]
44 PMC- Academic Edge/Border None Yes 1 hour [P-23]
45 Callrecords-2005 2005 Commercial Edge/Border Europe logs [P-52]
46 Leipzig-II-20030221 2003 1 h [P-58]
47 Nzix-II-2000 2000 1 h [P-58]
48 Genome Academic Academic Edge/Border Europe Yes(full) 24 h/43.9 h
268 Gbytes/495 Gbytes
[P-1][P-3][P-10][P-15][P-27]
49 Tier1ISP- Backbone North America No 24 h/3 h
11-98 Gbytes
[P-8]
50 University-weekday 2004 Academic None 24.6 h
1223 Gbytes
[P-10]
51 University-weekend 2004 Academic None 33.6 h
1652 Gbytes
[P-10]
52 Mshmro-2002 2002 Commercial None Yes 7 days
60 Gbytes
[P-31]
53 Accessnetwork-2004/5 2004-2005 None l/4/8 h
100 Gbytes
[P-36]
54 Abilene-2003 2003 Academic Backbone North America No 20 days(1/100 pkts) [P-47]
55 Geant-2004 2004 Backbone Europe No 23 days(1/1000 pkts) [P-47]
56 Waseda-2002 2002 Academic Edge/Border Asia No weekday nights
over 1 month
[P-51]
57 ADSL-2002/3 2003-2004 Backbone None weekdays and weekend
days of Sep 2004
and June 2003
[P-53]
58 WebServer- None 10,000 webServers
as testing purposes
[P-16]
59 StreamingLogs 2001 Commercial None [P-4][P-56]
60 UCSD-NAP-2002 2002 Commercial North America 31 days [P-46]
61 Research-2002 2002 Academic Edge/Border None 39 days (roughly 15,000 hosts) [P-46]
62 OC48-2001 2001 Backbone None 8 h [P-46]
63 DARPA-1998 1998 2 and 7 weeks of
network-based attacks
[P-45]
64 Mazu- Commercial North America [P-41]
65 BigComany- Commercial None [P-41]
66 Tier1-multi 2001-2002 Backbone None 1 h - 6 days [P-13]
67 Saarland-2002 2002 Academic Europe Yes 8 days
950 Gbytes
[P-7]
68 Gigascope 2003/2004 [P-4]
69 CAIDA-OC48-2002/4 2002-2004 Backbone North America Yes(4 bytes) 1 h [P-5]
70 CAIDA-OC48-2003/4 2003-2004 Backbone North America Yes(part) 1-2 h [P-6]
71 InternetAccessTrace-2003 2003 None Yes(full) 24 h and 18 h
120 Gbytes
[P-9]
72 VPN-2003 2003 None Yes(full) 6 days
1.8 Tbytes
[P-9]
73 MultiRouter-2001 2001 Backbone None 8000 million flow
level records
[P-40]
74 Tier1ISP-OC12-2001 2001 Backbone North America No 3.5 days [P-55]
75 USC-2006 2006 Edge/Border North America No 14 hour period [P-60]
76 USC-2006 2003-4 Edge/Border North America No 2 years [P-61]
77 POPs- Backbone North America No [P-12]
78 AccessNetwork- Intranet None No 43 hours; 6 Gbytes [P-62]
79 MobileNetwork- Europe and Asia Yes(full) [P-63]
80 JanpanISPSNMP- 2004-2008 Commercial Backbone Asia aggregated SNMP data (month-long) from 6 ISPs; [P-64]
81 JanpanISPNetFlow- 2005,2008 Commercial Backbone Asia Sampled NetFlow data from 1 ISP; [P-64]
82 NapleItaly Academic Backbone Europe Generated by a set of conttolled boxes; [P-65]
83 WPIUSA Academic Backbone North America [P-65]
84 ADSLPoPFrance 2006, 2008 Commercial Europe yes(full) 1-2 h
26-60 Gbytes
[P-67]
85 ISPEuropean 2008, 2009 Commercial Europe 2*24 h, 14*90 mins
>4 Tbytes, 100-600 Gbytes
[P-68]
86 DSLSession 2009 Commercial Europe 10 days, 6*24 h
(DSL session)
[P-68]

Discussion

P2P traffic is one of the most challenging traffic types to classify, partly due to substantial legal interest in identifying it and even more substantial negative repercussions to the user if P2P traffic is accurately identified. The misaligned incentives between those who want to use and those who want to identify P2P applications, together with the tremendous legal and privacy constraints against traffic research, renders scientific study of this question near impossible, and even if possible, wide variation across links would prevent a simple numeric answer to the question of how much P2P traffic there is on the Internet. But our taxonomy does reveal insights: the fraction of peer-to-peer file sharing traffic observed ranges from 1.2% to 93% across the 18 papers that provide such numbers. We also know that the average fractions reported have increased considerably from 2002 to 2006 (Table 1). Tables 2 and 3 show that results also vary widely by link and geographic location. Table 3 suggests that P2P is more popular in Europe, probably due to stricter policies (MPAA and RIAA) in North America. Note that the Asian results are from Japanese data sets, in which 1.34% and 1.29% are based on port numbers and therefore likely to significantly underestimate the fraction of P2P traffic. Furthermore, the amount of P2P traffic also varies by time of day, with higher fractions at night (Table 4).

One study [34] suggests that peer-to-peer applications are used more often at home than in the office. Finally, a study [50] in Europe found a higher fraction of P2P traffic on a university link in Europe than some Canadian academics [34] found on their campus.

Some numbers are based on statistical or host-behavioral classification. The remaining numbers are based on P2P detection via payload signature matching, the most reliable method of detecting an application (if unecrypted), which however is fraught with legal and privacy issues.

Year Range PaperID
2002 21.5% [51]
2004 9.19-60% [5],[6],[10],[18],[53]
2006 35.1-93% [20],[34],[35],[50]
Table 1. P2P Range(Year).
 
Year Link Location Range PaperID
2004 Campus link 31.3% [10]
2004 ADSL link 60% [53]
2004 Backbone link 9-14% [5],[18]
17-25% [6]
Table 2. P2P Range(Link Location).
Geographic Location Year P2P Range PaperID
Europe 2005 60-80% [52]
2006 79-93% [49],[50]
North America 2003 8%,10.7% [5]
2004 14%,9.9% [5]
2003-04 9.19-70% [6],[18],[61]
2006 21-35.1% [20],[34],[35]
Asia 2002 21.53% [51]
2005 1.34% (port-based) [64]
2008 1.29% (port-based) [64]
Table 3. P2P Range(Geographic).
 
Year Time Range DataID PaperID
2006 midnight to 10am 80% [D-26] [34]
9am to 10am 61.5%
2006 evening 93% [D-4],[D-5] [50]
night 91%
office hours 86%
Table 4. P2P Range(Time).

UDP Traffic Analysis

It's still an accepted assumption that Internet traffic is dominated by TCP, which is also the basic of most current traffic classification works; however, the rise of new streaming applications (e.g. IPTV such as PPStream, PPLive) and new P2P protocols (e.g. uTP) trying to avoid traffic shaping techniques is expected to increase the usage of UDP as transport protocol.

In this analysis section, we collect some UDP analysis from existing works, and then compare the usage of UDP and TCP on several traffic traces colleced in different network and geographical locations, as well as in different time periods.

Table 5 shows that UDP/TCP ratio ranges from 0.01 to 0.20 based on the existing works (There is a high value in residential trace). For better evaluating the amount of UDP and TCP traffic on real-traces (in terms of flows, packets and bytes), we analyze several available traces collected in the period 2002-2009 on serveral backbone links located in the US and Sweden. Table 6 shows that the use of UDP as transport protocol has rapidly increased from 2002 to 2009, although TCP sessions are still responsible for most of packets and bytes. However, in terms of flows UDP turns out to be the dominant transport protocol.

PaperID
Year
UDP/TCP Ratio
Notes
pkts
bytes
flows
P-1
around 2003
0.01
0.01
2006
0.11
0.05
0.02
0.04
WLAN Trace
0.20
0.20
10-hour Residential Trace
2006
1.12
2005
0.01
2008
0.02
Table 5. Values of UDP/TCP Ratio(from papers).
 
Trace
Sample
UDP/TCP Ratio
Total IP Traffic
(pkts/bytes/flows)
pkts
bytes
flows
08-2002
0.11
0.03
0.11
(1371M/838GB/79M)
01-2003
0.12
0.05
0.27
(463M/267GB/26M)
04-2006
0.06
0.02
1.06
(422M/294GB/9M)
11-2006
0.08
0.03
1.45
06-2008
0.14
0.05
1.43
(4427M/2279GB/197M)
02-2009
0.19
0.07
2.34
(1922M/1410GB/110M)
01-2009
0.21
0.11
3.09
(1100M/657GB/41M)
02-2009
0.20
0.11
2.63
Table 6. Values of UDP/TCP Ratio(from real-traces).

Conclusion

This overview page presented a rough taxonomy of traffic classification approaches, based on features, methods, goals and data sets.

Our survey review also reveals shortcomings with current traffic classification efforts. The variety of data sets used does not allow systematic comparison of methods. Few research groups (can) share their datasets. Already true ten years ago, the field of traffic classification research still needs publicly available, modern data sets as reference data for validating approaches. The poor comparability of results is further amplified by the lack of standardized measures and classification goals. For example, there exists no clear definition for traffic classes such as P2P or file-sharing.

However, the taxonomy above allows meta-analyses of relevant open questions, such as trends and development of traffic classes or features, yielding new insights into Internet traffic. We showed this by shedding some insight on questions such as: "how much of modern Internet traffic is P2P?" Though we found some trends and indications, we have far too little data available to make conclusive claims beyond "there is a wide range of P2P traffic on Internet links; see your specific link of interest and classification technique you trust for more details."

Acknowledgements

This work was made possible thanks to funding from DHS-PREDICT, the National Science Foundation, Beijing Jiaotong University, and the China Scholarship Council.

  Last Modified: Wed Jul-17-2013 12:56:24 PDT
  Page URL: http://www.caida.org/research/traffic-analysis/classification-overview/index.xml