The contents of this legacy page are no longer maintained nor supported, and are made available only for historical purposes.

Internet Traffic Classification

Internet traffic classification gains continuous attentions while many applications emerge on the Internet with obfuscation techniques. Related papers tend to try to classify whatever traffic samples a researcher can find, with no systematic integration of results. To fill this gap, we have created a structured taxonomy of traffic classification papers and their data sets. Furthermore, we hope to reveal issues and challenges in traffic classification.

Introduction

The Internet continually evolves in scope and complexity, much faster than our ability to characterize, understand, control, or predict it. The field of Internet traffic classification research includes many papers representing various attempts to classify whatever traffic samples a given researcher has access to, with no systematic integration of results. Here we provide a rough taxonomy of papers, and explain some issues and challenges in traffic classification.

Application Trends

Many media-rich entertainment applications have emerged on the Internet, which often use obfuscation techniques such as encrypted data transmission, random/changing ports, or proprietary communication protocols to prevent detection or filtering by network or content owners who believe the traffic is threatening their (infrastructural or intellectual) property. Other applications, e.g., PPStream, uTorrent, PPLive, supersede TCP with UDP. The rapidly changing nature of applications, even different versions of the same applications, presents a challenge for traffic classification techniques.

Definitions

We use the phrase traffic classification to describe methods of classifying traffic based on features passively observed in the traffic, and according to specific classification goals. One might only have a coarse classification goal, i.e., whether it's transaction-oriented, bulk-transfer, or peer-to-peer file sharing. Or one might have a finer-grained classificaiton goal, i.e., the exact application represented by the traffic. Traffic features could include the port number, application payload, or temporal, packet size, and addressing characteristics of the traffic. Methods to classify include exact matching, e.g., of port number or payload, heuristic, or machine learning (statistics).

Annotated Papers

We have collected and reviewed papers published betweeen 1994 and 2009 (please email info@caida.org if you know one that should be added), starting with papers from peer-reviewed academic research conferences, and then including many papers cited from this intial seeding set of papers, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers, including data sets and methods used, goals, and basic empirical findings. We use five paper categories: survey, analysis, methodology, tools, and other. Analysis papers typically attempt to derive trustworthy numbers on actual traffic cross-section, while methodology papers focus on methods of classifications. Click on a checkbox below to show that attribute for each paper in a separate column. Below this table is a similar table of attributes of the data sets analyzed in these set of papers.

Datasets

Several public and private passive measurement infrastructures have provided a variety of different datasets for Internet traffic classification studies, which we group into four categories:

Packet-based: packet-level traces, captured by hardware or software. Often Endance DAG cards are used for packet capture on high-bandwidth links (CAIDA uses these cards for its OC-192 backbone trace capture:equinix-chicago,equinix-sanjose.). Other capture hardware used over the years includes: ATM FORE, OC12 or OC13 POINT ATM, Napatech, INVEA-TECH. DAG cards can capture traffic on links of up to 10Gbp with less than 15ns timestamp resolution. Most software tools for capturing packets are based on kernel implementations such as tcpdump/libpcap; Coralreef, and Appmon are based on libpcap;. Other packet sniffers and network analyzers are also available.
SNMP-based: traffic counters and statistics obtained from network devices through the SNMP and RMON MIBs;
Flow-based: flow-level descriptions of a traffic stream:(Cisco netflow, Juniper CFlowd, Foundry sFlow, Huawei NetStream);
Other: except from above, such as application level session logs from web sites;

The data come from three types of capture environments: Intranet environment, Edge/Border environment, Backbone environment.

ID

Name

Year

Link Type

Capture Environments

Geographic Location

Payload

Size and Length

PaperID

1

POSTECH-2007

2007

Academic

Backbone

Asia

Yes(full)

3 h

450 Gbytes

2

TunnelHunter-2007

2007

Academic

Edge/Border

Europe

Yes(part)

3

TunnelSSH-2006

2006

Academic

Edge/Border

Europe

Yes(full)

0.25 h each hour

for three weeks

50 Gbytes

4

SUNET1-2006

2006

Academic

Backbone

Europe

No

0.33*4 h each day

for 20 days

[P-49][P-50][P-54]

5

SUNET2-2006

2006

Academic

Backbone

Europe

No

276 randomized times(10 mins)

during 80 days

6

LosNettos-2005

2005

Academic and Commercial

Backbone

North America

24 h, 08/31/2005

7

LosNettos-2006

2006

Academic and Commercial

Backbone

North America

24 h, 10/03/2006

8

CAIDA-OC48-2003

2003

Backbone

North America

No

1 h

95 Gbytes

9

Abilene-ABIL-2004

2004

Academic

Backbone

North America

No

1 h

714 Gbytes

10

PAIX-PAY1-2004

2004

Backbone

North America

Yes(16 bytes)

1 h

435 Gbytes

11

PAIX-PAY2-2004

2004

Backbone

North America

Yes(16 bytes)

1 h

374 Gbytes

12

UNIBS-

Academic

Edge/Border

Europe

Yes(part)

13

Paris6-2004

2004

Academic

Edge/Border

Europe

Yes

1 h

14

Paris6-2006

2006

Academic

Edge/Border

Europe

Yes

1 h

15

Paris6-2004-2005

2004-2005

Academic

Edge/Border

Europe

Yes

1*3 h

27 Gbytes;35 Gbytes;

300 Gbytes

16

College-2003

2003

Academic

Europe

No

15 mins

900 Mbytes

17

ADSL-2004

2004

None

No

15 mins

2.3 Gbytes

18

WirelessCrawdad-2003

2003

Academic

Europe

No

5 h 30 mins

330 Mbytes

19

Enter-

Commerical

Edge/Border

None

Yes

1 h 20 mins

300 Mbytes

20

UMass-2005

2005

Academic

Edge/Border

North America

Yes(4 bytes)

21

MicroResearch1-2005

2005

Commercial

Edge/Border

North America

Yes

1 month

22

MicroResearch2-2005

2005

Commercial

Edge/Border

North America

Yes

2 weeks

23

CISCO-

Backbone(/8)

No

24

Polito-Academic-2006

2006

Academic

Europe

95 h

25

Polito-ISP-2006

2006

Backbone

Europe

24 h

26

Calgary1-2006

2006

Academic and Commercial

North America

Yes(full)

1*48 h over 6 months

27

Calgary2-2006

2006

Academic

North America

Yes(full)

1*8 h over 4 days

28

PAIX-I

2004

Commercial

Backbone

North America

Yes(16 bytes)

2 h

91 Gbytes

29

PAIX-II

2004

Commercial

Backbone

North America

Yes(16 bytes)

2 h 2 mins

891 Gbytes

30

WIDE-1

2006

Backbone

Oceania

Yes(40 bytes)

55 mins

14 Gbytes

31

KEIO-I

2006

Academic

Edge/Border

Asia

Yes(40 bytes)

30 mins

16 Gbytes

32

KEIO-II

2006

Academic

Edge/Border

Asia

Yes(40 bytes)

30 mins

16 Gbytes

33

KAIST-I

2006

Academic

Edge/Border

Asia

Yes(40 bytes)

48 h 12 mins

506 Gbytes

34

KAIST-II

2006

Academic

Edge/Border

Asia

Yes(40 bytes)

21 h 16 mins

259 Gbytes

35

WIDE-2

2006

Backbone

Oceania

2 h

36

AUCK-

2001/2003

Edge/Border

Oceania

1 h /3 days

[P-20][P-42][P-48][P-58][P-59]

37

OC48-2003

2003

Backbone

North America

1 h 2 mins

38

UCSD-honeypot

Academic

Intranet

North America

5 mins

39

Calgary3-2006

2006

Academic

Edge/Border

North America

Yes(full)

1 h

40

Cambridge-2003

2003

Academic

Europe

24 h

41

Wireless-2006

2006

Academic

Intranet

North America

5 days

42

UCSDDepart-2006

2006

Academic

Backbone

North America

1 h

43

GMU-2003

2003

Academic

North America

10 mins of each quarter

hour over 2 months

44

PMC-

Academic

Edge/Border

None

Yes

1 hour

45

Callrecords-2005

2005

Commercial

Edge/Border

Europe

logs

46

Leipzig-II-20030221

2003

1 h

47

Nzix-II-2000

2000

1 h

48

Genome Academic

Academic

Edge/Border

Europe

Yes(full)

24 h/43.9 h

268 Gbytes/495 Gbytes

[P-1][P-3][P-10][P-15][P-27]

49

Tier1ISP-

Backbone

North America

No

24 h/3 h

11-98 Gbytes

50

University-weekday

2004

Academic

None

24.6 h

1223 Gbytes

51

University-weekend

2004

Academic

None

33.6 h

1652 Gbytes

52

Mshmro-2002

2002

Commercial

None

Yes

7 days

60 Gbytes

53

Accessnetwork-2004/5

2004-2005

None

l/4/8 h

100 Gbytes

54

Abilene-2003

2003

Academic

Backbone

North America

No

20 days(1/100 pkts)

55

Geant-2004

2004

Backbone

Europe

No

23 days(1/1000 pkts)

56

Waseda-2002

2002

Academic

Edge/Border

Asia

No

weekday nights

over 1 month

57

ADSL-2002/3

2003-2004

Backbone

None

weekdays and weekend

days of Sep 2004

and June 2003

58

WebServer-

None

10,000 webServers

as testing purposes

59

StreamingLogs

2001

Commercial

None

60

UCSD-NAP-2002

2002

Commercial

North America

31 days

61

Research-2002

2002

Academic

Edge/Border

None

39 days (roughly 15,000 hosts)

62

OC48-2001

2001

Backbone

None

8 h

63

DARPA-1998

1998

2 and 7 weeks of

network-based attacks

64

Mazu-

Commercial

North America

65

BigComany-

Commercial

None

66

Tier1-multi

2001-2002

Backbone

None

1 h - 6 days

67

Saarland-2002

2002

Academic

Europe

Yes

8 days

950 Gbytes

68

Gigascope

2003/2004

69

CAIDA-OC48-2002/4

2002-2004

Backbone

North America

Yes(4 bytes)

1 h

70

CAIDA-OC48-2003/4

2003-2004

Backbone

North America

Yes(part)

1-2 h

71

InternetAccessTrace-2003

2003

None

Yes(full)

24 h and 18 h

120 Gbytes

72

VPN-2003

2003

None

Yes(full)

6 days

1.8 Tbytes

73

MultiRouter-2001

2001

Backbone

None

8000 million flow

level records

74

Tier1ISP-OC12-2001

2001

Backbone

North America

No

3.5 days

75

USC-2006

2006

Edge/Border

North America

No

14 hour period

76

USC-2006

2003-4

Edge/Border

North America

No

2 years

77

POPs-

Backbone

North America

No

78

AccessNetwork-

Intranet

None

No

43 hours; 6 Gbytes

79

MobileNetwork-

Europe and Asia

Yes(full)

80

JanpanISPSNMP-

2004-2008

Commercial

Backbone

Asia

aggregated SNMP data (month-long) from 6 ISPs;

81

JanpanISPNetFlow-

2005,2008

Commercial

Backbone

Asia

Sampled NetFlow data from 1 ISP;

82

NapleItaly

Academic

Backbone

Europe

Generated by a set of conttolled boxes;

83

WPIUSA

Academic

Backbone

North America

84

ADSLPoPFrance

2006, 2008

Commercial

Europe

yes(full)

1-2 h

26-60 Gbytes

85

ISPEuropean

2008, 2009

Commercial

Europe

2*24 h, 14*90 mins

>4 Tbytes, 100-600 Gbytes

86

DSLSession

2009

Commercial

Europe

10 days, 6*24 h

(DSL session)

Discussion

P2P traffic is one of the most challenging traffic types to classify, partly due to substantial legal interest in identifying it and even more substantial negative repercussions to the user if P2P traffic is accurately identified. The misaligned incentives between those who want to use and those who want to identify P2P applications, together with the tremendous legal and privacy constraints against traffic research, renders scientific study of this question near impossible, and even if possible, wide variation across links would prevent a simple numeric answer to the question of how much P2P traffic there is on the Internet. But our taxonomy does reveal insights: the fraction of peer-to-peer file sharing traffic observed ranges from 1.2% to 93% across the 18 papers that provide such numbers. We also know that the average fractions reported have increased considerably from 2002 to 2006 (Table 1). Tables 2 and 3 show that results also vary widely by link and geographic location. Table 3 suggests that P2P is more popular in Europe, probably due to stricter policies (MPAA and RIAA) in North America. Note that the Asian results are from Japanese data sets, in which 1.34% and 1.29% are based on port numbers and therefore likely to significantly underestimate the fraction of P2P traffic. Furthermore, the amount of P2P traffic also varies by time of day, with higher fractions at night (Table 4).

One study [34] suggests that peer-to-peer applications are used more often at home than in the office. Finally, a study [50] in Europe found a higher fraction of P2P traffic on a university link in Europe than some Canadian academics [34] found on their campus.

Some numbers are based on statistical or host-behavioral classification. The remaining numbers are based on P2P detection via payload signature matching, the most reliable method of detecting an application (if unecrypted), which however is fraught with legal and privacy issues.

Table 1. P2P Range(Year).

Year	Range	PaperID
2002	21.5%	[51]
2004	9.19-60%	[5],[6],[10],[18],[53]
2006	35.1-93%	[20],[34],[35],[50]

Table 2. P2P Range(Link Location).

Year	Link Location	Range	PaperID
2004	Campus link	31.3%	[10]
2004	ADSL link	60%	[53]
2004	Backbone link	9-14%	[5],[18]
2004	Backbone link	17-25%	[6]

Table 3. P2P Range(Geographic).

Geographic Location	Year	P2P Range	PaperID
Europe	2005	60-80%	[52]
Europe	2006	79-93%	[49],[50]
North America	2003	8%,10.7%	[5]
	2004	14%,9.9%	[5]
	2003-04	9.19-70%	[6],[18],[61]
	2006	21-35.1%	[20],[34],[35]
Asia	2002	21.53%	[51]
	2005	1.34% (port-based)	[64]
	2008	1.29% (port-based)	[64]

Table 4. P2P Range(Time).

Year	Time	Range	DataID	PaperID
2006	midnight to 10am	80%	[D-26]	[34]
2006	9am to 10am	61.5%	[D-26]	[34]
2006	evening	93%	[D-4],[D-5]	[50]
	night	91%
	office hours	86%

UDP Traffic Analysis

It's still an accepted assumption that Internet traffic is dominated by TCP, which is also the basic of most current traffic classification works; however, the rise of new streaming applications (e.g. IPTV such as PPStream, PPLive) and new P2P protocols (e.g. uTP) trying to avoid traffic shaping techniques is expected to increase the usage of UDP as transport protocol.

In this analysis section, we collect some UDP analysis from existing works, and then compare the usage of UDP and TCP on several traffic traces colleced in different network and geographical locations, as well as in different time periods.

Table 5 shows that UDP/TCP ratio ranges from 0.01 to 0.20 based on the existing works (There is a high value in residential trace). For better evaluating the amount of UDP and TCP traffic on real-traces (in terms of flows, packets and bytes), we analyze several available traces collected in the period 2002-2009 on serveral backbone links located in the US and Sweden. Table 6 shows that the use of UDP as transport protocol has rapidly increased from 2002 to 2009, although TCP sessions are still responsible for most of packets and bytes. However, in terms of flows UDP turns out to be the dominant transport protocol.

Table 5. Values of UDP/TCP Ratio(from papers).

PaperID	Year	UDP/TCP Ratio			Notes
PaperID	Year	pkts	bytes	flows	Notes
P-1	around 2003	0.01	0.01
P-34	2006	0.11	0.05
		0.02	0.04		WLAN Trace
		0.20	0.20		10-hour Residential Trace
P-49 P-54	2006			1.12
P-64	2005		0.01
P-64	2008		0.02

Table 6. Values of UDP/TCP Ratio(from real-traces).

Trace	Sample	UDP/TCP Ratio			Total IP Traffic (pkts/bytes/flows)
Trace	Sample	pkts	bytes	flows	Total IP Traffic (pkts/bytes/flows)
CAIDA-OC48	08-2002	0.11	0.03	0.11	(1371M/838GB/79M)
CAIDA-OC48	01-2003	0.12	0.05	0.27	(463M/267GB/26M)
GigaSUNET	04-2006	0.06	0.02	1.06	(422M/294GB/9M)
GigaSUNET	11-2006	0.08	0.03	1.45	(422M/294GB/9M)
CAIDA-OC192	06-2008	0.14	0.05	1.43	(4427M/2279GB/197M)
CAIDA-OC192	02-2009	0.19	0.07	2.34	(1922M/1410GB/110M)
OptoSUNET	01-2009	0.21	0.11	3.09	(1100M/657GB/41M)
OptoSUNET	02-2009	0.20	0.11	2.63	(1100M/657GB/41M)

Conclusion

This overview page presented a rough taxonomy of traffic classification approaches, based on features, methods, goals and data sets.

Our survey review also reveals shortcomings with current traffic classification efforts. The variety of data sets used does not allow systematic comparison of methods. Few research groups (can) share their datasets. Already true ten years ago, the field of traffic classification research still needs publicly available, modern data sets as reference data for validating approaches. The poor comparability of results is further amplified by the lack of standardized measures and classification goals. For example, there exists no clear definition for traffic classes such as P2P or file-sharing.

However, the taxonomy above allows meta-analyses of relevant open questions, such as trends and development of traffic classes or features, yielding new insights into Internet traffic. We showed this by shedding some insight on questions such as: "how much of modern Internet traffic is P2P?" Though we found some trends and indications, we have far too little data available to make conclusive claims beyond "there is a wide range of P2P traffic on Internet links; see your specific link of interest and classification technique you trust for more details."

Acknowledgments

This work was made possible thanks to funding from DHS-PREDICT, the National Science Foundation, Beijing Jiaotong University, and the China Scholarship Council.