The contents of this legacy page are no longer maintained nor supported, and are made available only for historical purposes.

Internet Traffic Classification

Internet traffic classification gains continuous attentions while many applications emerge on the Internet with obfuscation techniques. Related papers tend to try to classify whatever traffic samples a researcher can find, with no systematic integration of results. To fill this gap, we have created a structured taxonomy of traffic classification papers and their data sets. Furthermore, we hope to reveal issues and challenges in traffic classification.

Introduction

The Internet continually evolves in scope and complexity, much faster than our ability to characterize, understand, control, or predict it. The field of Internet traffic classification research includes many papers representing various attempts to classify whatever traffic samples a given researcher has access to, with no systematic integration of results. Here we provide a rough taxonomy of papers, and explain some issues and challenges in traffic classification.

Application Trends

Many media-rich entertainment applications have emerged on the Internet, which often use obfuscation techniques such as encrypted data transmission, random/changing ports, or proprietary communication protocols to prevent detection or filtering by network or content owners who believe the traffic is threatening their (infrastructural or intellectual) property. Other applications, e.g., PPStream, uTorrent, PPLive, supersede TCP with UDP. The rapidly changing nature of applications, even different versions of the same applications, presents a challenge for traffic classification techniques.

Definitions

We use the phrase traffic classification to describe methods of classifying traffic based on features passively observed in the traffic, and according to specific classification goals. One might only have a coarse classification goal, i.e., whether it's transaction-oriented, bulk-transfer, or peer-to-peer file sharing. Or one might have a finer-grained classificaiton goal, i.e., the exact application represented by the traffic. Traffic features could include the port number, application payload, or temporal, packet size, and addressing characteristics of the traffic. Methods to classify include exact matching, e.g., of port number or payload, heuristic, or machine learning (statistics).

Annotated Papers

We have collected and reviewed papers published betweeen 1994 and 2009 (please email info@caida.org if you know one that should be added), starting with papers from peer-reviewed academic research conferences, and then including many papers cited from this intial seeding set of papers, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers, including data sets and methods used, goals, and basic empirical findings. We use five paper categories: survey, analysis, methodology, tools, and other. Analysis papers typically attempt to derive trustworthy numbers on actual traffic cross-section, while methodology papers focus on methods of classifications. Click on a checkbox below to show that attribute for each paper in a separate column. Below this table is a similar table of attributes of the data sets analyzed in these set of papers.



Datasets

Several public and private passive measurement infrastructures have provided a variety of different datasets for Internet traffic classification studies, which we group into four categories:

  • Packet-based: packet-level traces, captured by hardware or software. Often Endance DAG cards are used for packet capture on high-bandwidth links (CAIDA uses these cards for its OC-192 backbone trace capture:equinix-chicago,equinix-sanjose.). Other capture hardware used over the years includes: ATM FORE, OC12 or OC13 POINT ATM, Napatech, INVEA-TECH. DAG cards can capture traffic on links of up to 10Gbp with less than 15ns timestamp resolution. Most software tools for capturing packets are based on kernel implementations such as tcpdump/libpcap; Coralreef, and Appmon are based on libpcap;. Other packet sniffers and network analyzers are also available.
  • SNMP-based: traffic counters and statistics obtained from network devices through the SNMP and RMON MIBs;
  • Flow-based: flow-level descriptions of a traffic stream:(Cisco netflow, Juniper CFlowd, Foundry sFlow, Huawei NetStream);
  • Other: except from above, such as application level session logs from web sites;
The data come from three types of capture environments: Intranet environment, Edge/Border environment, Backbone environment.

ID Name Year Link Type Capture Environments Geographic Location Payload Size and Length PaperID
1 POSTECH-2007 2007 Academic Backbone Asia Yes(full) 3 h

450 Gbytes
[P-39]
2 TunnelHunter-2007 2007 Academic Edge/Border Europe Yes(part) [P-43]
3 TunnelSSH-2006 2006 Academic Edge/Border Europe Yes(full) 0.25 h each hour

for three weeks

50 Gbytes
[P-44]
4
SUNET1-2006
2006 Academic Backbone Europe No 0.33*4 h each day

for 20 days
[P-49][P-50][P-54]
5
SUNET2-2006
2006 Academic Backbone Europe No 276 randomized times(10 mins)

during 80 days
[P-50]
6 LosNettos-2005 2005 Academic and Commercial Backbone North America 24 h, 08/31/2005 [P-17]
7 LosNettos-2006 2006 Academic and Commercial Backbone North America 24 h, 10/03/2006 [P-17]
8 CAIDA-OC48-2003 2003 Backbone North America No 1 h

95 Gbytes
[P-18]
9 Abilene-ABIL-2004 2004 Academic Backbone North America No 1 h

714 Gbytes
[P-18]
10 PAIX-PAY1-2004 2004 Backbone North America Yes(16 bytes) 1 h

435 Gbytes
[P-18]
11 PAIX-PAY2-2004 2004 Backbone North America Yes(16 bytes) 1 h

374 Gbytes
[P-18]
12 UNIBS- Academic Edge/Border Europe Yes(part) [P-19]
13 Paris6-2004 2004 Academic Edge/Border Europe Yes 1 h [P-26]
14 Paris6-2006 2006 Academic Edge/Border Europe Yes 1 h [P-26]
15 Paris6-2004-2005 2004-2005 Academic Edge/Border Europe Yes 1*3 h

27 Gbytes;35 Gbytes;

300 Gbytes
[P-25]
16 College-2003 2003 Academic Europe No 15 mins

900 Mbytes
[P-25]
17 ADSL-2004 2004 None No 15 mins

2.3 Gbytes
[P-25]
18 WirelessCrawdad-2003 2003 Academic Europe No 5 h 30 mins

330 Mbytes
[P-25]
19 Enter- Commerical Edge/Border None Yes 1 h 20 mins

300 Mbytes
[P-25]
20 UMass-2005 2005 Academic Edge/Border North America Yes(4 bytes) [P-25][P-26]
21 MicroResearch1-2005 2005 Commercial Edge/Border North America Yes 1 month [P-29]
22 MicroResearch2-2005 2005 Commercial Edge/Border North America Yes 2 weeks [P-29]
23 CISCO- Backbone(/8) No [P-32]
24 Polito-Academic-2006 2006 Academic Europe 95 h [P-30]
25 Polito-ISP-2006 2006 Backbone Europe 24 h [P-30]
26 Calgary1-2006 2006 Academic and Commercial North America Yes(full) 1*48 h over 6 months [P-34]
27 Calgary2-2006 2006 Academic North America Yes(full) 1*8 h over 4 days [P-35]
28 PAIX-I 2004 Commercial Backbone North America Yes(16 bytes) 2 h

91 Gbytes
[P-37]
29 PAIX-II 2004 Commercial Backbone North America Yes(16 bytes) 2 h 2 mins

891 Gbytes
[P-37]
30 WIDE-1 2006 Backbone Oceania Yes(40 bytes) 55 mins

14 Gbytes
[P-37]
31 KEIO-I 2006 Academic Edge/Border Asia Yes(40 bytes) 30 mins

16 Gbytes
[P-37]
32 KEIO-II 2006 Academic Edge/Border Asia Yes(40 bytes) 30 mins

16 Gbytes
[P-37]
33 KAIST-I 2006 Academic Edge/Border Asia Yes(40 bytes) 48 h 12 mins

506 Gbytes
[P-37]
34 KAIST-II 2006 Academic Edge/Border Asia Yes(40 bytes) 21 h 16 mins

259 Gbytes
[P-37]
35 WIDE-2 2006 Backbone Oceania 2 h [P-48]
36 AUCK- 2001/2003 Edge/Border Oceania 1 h /3 days [P-20][P-42][P-48][P-58][P-59]
37 OC48-2003 2003 Backbone North America 1 h 2 mins [P-48]
38 UCSD-honeypot Academic Intranet North America 5 mins [P-48]
39 Calgary3-2006 2006 Academic Edge/Border North America Yes(full) 1 h [P-20]
40 Cambridge-2003 2003 Academic Europe 24 h [P-21]
41 Wireless-2006 2006 Academic Intranet North America 5 days [P-21]
42 UCSDDepart-2006 2006 Academic Backbone North America 1 h [P-21]
43 GMU-2003 2003 Academic North America 10 mins of each quarter

hour over 2 months
[P-22]
44 PMC- Academic Edge/Border None Yes 1 hour [P-23]
45 Callrecords-2005 2005 Commercial Edge/Border Europe logs [P-52]
46 Leipzig-II-20030221 2003 1 h [P-58]
47 Nzix-II-2000 2000 1 h [P-58]
48 Genome Academic Academic Edge/Border Europe Yes(full) 24 h/43.9 h

268 Gbytes/495 Gbytes
[P-1][P-3][P-10][P-15][P-27]
49 Tier1ISP- Backbone North America No 24 h/3 h

11-98 Gbytes
[P-8]
50 University-weekday 2004 Academic None 24.6 h

1223 Gbytes
[P-10]
51 University-weekend 2004 Academic None 33.6 h

1652 Gbytes
[P-10]
52 Mshmro-2002 2002 Commercial None Yes 7 days

60 Gbytes
[P-31]
53 Accessnetwork-2004/5 2004-2005 None l/4/8 h

100 Gbytes
[P-36]
54 Abilene-2003 2003 Academic Backbone North America No 20 days(1/100 pkts) [P-47]
55 Geant-2004 2004 Backbone Europe No 23 days(1/1000 pkts) [P-47]
56 Waseda-2002 2002 Academic Edge/Border Asia No weekday nights

over 1 month
[P-51]
57 ADSL-2002/3 2003-2004 Backbone None weekdays and weekend

days of Sep 2004

and June 2003
[P-53]
58 WebServer- None 10,000 webServers

as testing purposes
[P-16]
59 StreamingLogs 2001 Commercial None [P-4][P-56]
60 UCSD-NAP-2002 2002 Commercial North America 31 days [P-46]
61 Research-2002 2002 Academic Edge/Border None 39 days (roughly 15,000 hosts) [P-46]
62 OC48-2001 2001 Backbone None 8 h [P-46]
63 DARPA-1998 1998 2 and 7 weeks of

network-based attacks
[P-45]
64 Mazu- Commercial North America [P-41]
65 BigComany- Commercial None [P-41]
66 Tier1-multi 2001-2002 Backbone None 1 h - 6 days [P-13]
67 Saarland-2002 2002 Academic Europe Yes 8 days

950 Gbytes
[P-7]
68 Gigascope 2003/2004 [P-4]
69 CAIDA-OC48-2002/4 2002-2004 Backbone North America Yes(4 bytes) 1 h [P-5]
70 CAIDA-OC48-2003/4 2003-2004 Backbone North America Yes(part) 1-2 h [P-6]
71 InternetAccessTrace-2003 2003 None Yes(full) 24 h and 18 h

120 Gbytes
[P-9]
72 VPN-2003 2003 None Yes(full) 6 days

1.8 Tbytes
[P-9]
73 MultiRouter-2001 2001 Backbone None 8000 million flow

level records
[P-40]
74 Tier1ISP-OC12-2001 2001 Backbone North America No 3.5 days [P-55]
75 USC-2006 2006 Edge/Border North America No 14 hour period [P-60]
76 USC-2006 2003-4 Edge/Border North America No 2 years [P-61]
77 POPs- Backbone North America No [P-12]
78 AccessNetwork- Intranet None No 43 hours; 6 Gbytes [P-62]
79 MobileNetwork- Europe and Asia Yes(full) [P-63]
80 JanpanISPSNMP- 2004-2008 Commercial Backbone Asia aggregated SNMP data (month-long) from 6 ISPs; [P-64]
81 JanpanISPNetFlow- 2005,2008 Commercial Backbone Asia Sampled NetFlow data from 1 ISP; [P-64]
82 NapleItaly Academic Backbone Europe Generated by a set of conttolled boxes; [P-65]
83 WPIUSA Academic Backbone North America [P-65]
84 ADSLPoPFrance 2006, 2008 Commercial Europe yes(full) 1-2 h

26-60 Gbytes
[P-67]
85 ISPEuropean 2008, 2009 Commercial Europe 2*24 h, 14*90 mins

>4 Tbytes, 100-600 Gbytes
[P-68]
86 DSLSession 2009 Commercial Europe 10 days, 6*24 h

(DSL session)
[P-68]

Discussion

P2P traffic is one of the most challenging traffic types to classify, partly due to substantial legal interest in identifying it and even more substantial negative repercussions to the user if P2P traffic is accurately identified. The misaligned incentives between those who want to use and those who want to identify P2P applications, together with the tremendous legal and privacy constraints against traffic research, renders scientific study of this question near impossible, and even if possible, wide variation across links would prevent a simple numeric answer to the question of how much P2P traffic there is on the Internet. But our taxonomy does reveal insights: the fraction of peer-to-peer file sharing traffic observed ranges from 1.2% to 93% across the 18 papers that provide such numbers. We also know that the average fractions reported have increased considerably from 2002 to 2006 (Table 1). Tables 2 and 3 show that results also vary widely by link and geographic location. Table 3 suggests that P2P is more popular in Europe, probably due to stricter policies (MPAA and RIAA) in North America. Note that the Asian results are from Japanese data sets, in which 1.34% and 1.29% are based on port numbers and therefore likely to significantly underestimate the fraction of P2P traffic. Furthermore, the amount of P2P traffic also varies by time of day, with higher fractions at night (Table 4).

One study [34] suggests that peer-to-peer applications are used more often at home than in the office. Finally, a study [50] in Europe found a higher fraction of P2P traffic on a university link in Europe than some Canadian academics [34] found on their campus.

Some numbers are based on statistical or host-behavioral classification. The remaining numbers are based on P2P detection via payload signature matching, the most reliable method of detecting an application (if unecrypted), which however is fraught with legal and privacy issues.

Table 1. P2P Range(Year).
Year Range PaperID
2002 21.5% [51]
2004 9.19-60% [5],[6],[10],[18],[53]
2006 35.1-93% [20],[34],[35],[50]
  Table 2. P2P Range(Link Location).
Year Link Location Range PaperID
2004 Campus link 31.3% [10]
2004 ADSL link 60% [53]
2004 Backbone link 9-14% [5],[18]
17-25% [6]
Table 3. P2P Range(Geographic).
Geographic Location Year P2P Range PaperID
Europe 2005 60-80% [52]
2006 79-93% [49],[50]
North America 2003 8%,10.7% [5]
2004 14%,9.9% [5]
2003-04 9.19-70% [6],[18],[61]
2006 21-35.1% [20],[34],[35]
Asia 2002 21.53% [51]
2005 1.34% (port-based) [64]
2008 1.29% (port-based) [64]
  Table 4. P2P Range(Time).
Year Time Range DataID PaperID
2006 midnight to 10am 80% [D-26] [34]
9am to 10am 61.5%
2006 evening 93% [D-4],[D-5] [50]
night 91%
office hours 86%

UDP Traffic Analysis

It's still an accepted assumption that Internet traffic is dominated by TCP, which is also the basic of most current traffic classification works; however, the rise of new streaming applications (e.g. IPTV such as PPStream, PPLive) and new P2P protocols (e.g. uTP) trying to avoid traffic shaping techniques is expected to increase the usage of UDP as transport protocol.

In this analysis section, we collect some UDP analysis from existing works, and then compare the usage of UDP and TCP on several traffic traces colleced in different network and geographical locations, as well as in different time periods.

Table 5 shows that UDP/TCP ratio ranges from 0.01 to 0.20 based on the existing works (There is a high value in residential trace). For better evaluating the amount of UDP and TCP traffic on real-traces (in terms of flows, packets and bytes), we analyze several available traces collected in the period 2002-2009 on serveral backbone links located in the US and Sweden. Table 6 shows that the use of UDP as transport protocol has rapidly increased from 2002 to 2009, although TCP sessions are still responsible for most of packets and bytes. However, in terms of flows UDP turns out to be the dominant transport protocol.

Table 5. Values of UDP/TCP Ratio(from papers).
PaperID
Year
UDP/TCP Ratio
Notes
pkts
bytes
flows
around 2003
0.01
0.01
2006
0.11
0.05
0.02
0.04
WLAN Trace
0.20
0.20
10-hour Residential Trace
2006
1.12
2005
0.01
2008
0.02
  Table 6. Values of UDP/TCP Ratio(from real-traces).
Trace
Sample
UDP/TCP Ratio
Total IP Traffic
(pkts/bytes/flows)
pkts
bytes
flows
08-2002
0.11
0.03
0.11
(1371M/838GB/79M)
01-2003
0.12
0.05
0.27
(463M/267GB/26M)
GigaSUNET
04-2006
0.06
0.02
1.06
(422M/294GB/9M)
11-2006
0.08
0.03
1.45
06-2008
0.14
0.05
1.43
(4427M/2279GB/197M)
02-2009
0.19
0.07
2.34
(1922M/1410GB/110M)
OptoSUNET
01-2009
0.21
0.11
3.09
(1100M/657GB/41M)
02-2009
0.20
0.11
2.63

Conclusion

This overview page presented a rough taxonomy of traffic classification approaches, based on features, methods, goals and data sets.

Our survey review also reveals shortcomings with current traffic classification efforts. The variety of data sets used does not allow systematic comparison of methods. Few research groups (can) share their datasets. Already true ten years ago, the field of traffic classification research still needs publicly available, modern data sets as reference data for validating approaches. The poor comparability of results is further amplified by the lack of standardized measures and classification goals. For example, there exists no clear definition for traffic classes such as P2P or file-sharing.

However, the taxonomy above allows meta-analyses of relevant open questions, such as trends and development of traffic classes or features, yielding new insights into Internet traffic. We showed this by shedding some insight on questions such as: "how much of modern Internet traffic is P2P?" Though we found some trends and indications, we have far too little data available to make conclusive claims beyond "there is a wide range of P2P traffic on Internet links; see your specific link of interest and classification technique you trust for more details."

Acknowledgements

This work was made possible thanks to funding from DHS-PREDICT, the National Science Foundation, Beijing Jiaotong University, and the China Scholarship Council.

Published
Last Modified