Promotion of Data Sharing
Internet research relies on a wide variety of data on the structure, dynamics, and usage patterns of operational Internet infrastructure, for parameterization and validation of scientific modeling and analysis efforts. As our use of and dependence on the Internet expands, an expanding range of data of interest and utility to an increasing number of disciplines must bring with it deeper consideration of privacy in collaboration and data sharing models, especially between industry and academia.
We have proposed to move the Internet research stakeholder community beyond the relatively siloed data sharing practices and into a more reputable and pervasive scientific discipline, by self-regulating through a transparent and repeatable sharing framework. Our model -- the Privacy-Sensitive Sharing (PS2) framework -- integrates privacy-enhancing technologies with a policy framework that applies proven and standard privacy principles and obligations of data seekers and data providers, in coordination with techniques that implement and provably enforce those obligations. The PS2 framework considers practical challenges confronting security professionals, network analysts, systems administrators, researchers, and legal advisors. It embodies the proposition that privacy problems are exacerbated by a shortage of transparency surrounding the who, what, when, where, how and why of sharing privacy-sensitive information We evaluate our framework along two primary criteria: (1) how well the policies and techniques address privacy risks; and, (2) how well policies and techniques achieve utility objectives. Below we excerpt from this paper, including a review of the practical risks and benefits of data sharing, as well as the motivation, components, and evaluation of our model.
Information on CAIDA's datasets can be found in the overview of CAIDA datasets page.
The current default, defensive posture to not share network data derives from the purgatory formed by the gaps in regulation and law, commercial pressures, and evolving considerations of both threat models and ethical behavior. The threat model from not data sharing is necessarily vague, as damages resulting from knowledge management deficiencies are beset with causation and correlation challenges. More fundamentally, we lack a risk profile for our communications fabric, partly as a result of the data-sharing dearth. Notably, society has not felt the pain points that normally motivate legislative, judicial or policy change - explicit and immediate body counts or billion dollar losses. Admittedly, the policies that have given rise to the Internet's tremendous growth and support for network innovations have also rendered the entire sector opaque, unamenable to objective empirical macroscopic analysis, in ways and for reasons disconcertingly resonant with the U.S. financial sector before its 2008 meltdown. The opaqueness, juxtaposed with this decade's proliferation of Internet security, scalability, sustainability, and stewardship issues, is a cause for concern for the integrity of the infrastructure as well the information economy it supports. .
Internet research stakeholders have an opportunity to tip the risk scales in favor of more protected data sharing, by proactively implementing appropriate management of privacy risks. Transparent and morally defensible self-regulation in the interests of building social capital and informing legal and judicial regimes will allow stakeholders to more practically influence policy and law at these crossroads. Information security controls were initially considered a liability (from a cost perspective) until regulations rendered lack of security a compliance liability. We anticipate circumstances to reveal that rather than data-sharing being a risk, not sharing data is a liability. We offer the PS2 as a tool to help move the community mindset in that direction as productively and safely as possible.
The strategic challenge is similar to other domains: how to balance utility goals with privacy risks for data seeker (DS) and data providers (DP). Internet researchers and systems security personnel are generally DS - entities seeking to share, responsibly disclose, acquire or otherwise exchange lawfully possessed network data. While data providers (DP) acknowledge the potential benefits of sharing, they are sufficiently uncertain about the privacy-utility risk that they yield to a normative presumption that the risks outweigh potential rewards. Data sharing relationships that occur are market-driven or organically developed. Unsurprisingly then, there are no widespread and standard procedures for network measurement data exchange. Inconsistent, ad hoc and/or opaque exchange protocols exist, but measuring their effectiveness and benefit is challenging. A formidable consequence is the difficulty of justifying resources for research and other collaboration costs that incentivize a sharing regime. On the other hand, the high cost of independently acquiring datasets is a motivation for re-use where possible.
Privacy is difficult to quantify, as is the utility of measurement-based research. Both variables are dynamic and lack normative understanding among both domain professionals and the general citizenry. As fields of study, privacy and network science are both hindered by the absence of: common vocabulary, open and configurable reference models, uniform means of analysis, common sets of use cases, and unsurprisingly, any standard cost (liability) accounting or ROI formulas. A circular conundrum is that the risk-averse data provider needs utility demonstrated before data is released, and the researcher needs data to prove utility.
The rational predilection against sharing is strengthened by an uncertain legal regime and the social costs of sensationalism-over-accuracy-driven media accounts in cases of anonymized data being reverse engineered. While there are no procedures or regulatory framework to foster widespread exchange, there is also no framework that prohibits it either. Although there is interest in efficient and widespread sharing of measurement data, it hangs against a backdrop of legal ambiguity and flawed solution models. This backdrop and our experiences with data-sharing inform our privacy-sensitive sharing framework (PS2).
2.1 An Uncertain Legal Regime
At least in the U.S. and European Union (EU) regulatory regimes, the concept of personally identifiable information (PII) is central to privacy law and data stewardship in general. Unlike the EU model which allots overarching protection for PII, the U.S. protects PII across a patchwork of caselaw, state and federal industry-specific laws covering health data, financial data, education data, employment data, insurance records, government-issued records, credit information, and cable and telephone records. Although the definition of PII bears common threads across sectors, it is nonetheless fractured along a continuum of first and second-order identifiers (defined in Section 3.1). Crafting frameworks that generalize PII across domains, and or support cross-domain information necessitates attaching to the most expansive definition.
At its core, there is ambiguity over fundamental concepts upon which privacy risk assessment turns. First, privacy presumes identity so unless identity is defined in relation to network data artifacts, the notions of privacy and PII are already disjointed in the Internet data realm. Both the legal and Internet research communities acknowledge that the concept of PII in Internet data is not clear - its definition is context-dependent, both in terms of technology and topology. Further, the ability to link network data to individuals - as well as the cost of doing so - changes over time as technologies and protocols evolve. Yet, PII is fundamental to interpreting and applying many laws. Most notably is the United States' primary law covering the privacy of network traffic: the Electronic Communications and Privacy Act (ECPA), which provides statutory privacy protection for the interception and disclosure of certain electronic communications.
For example, blanket characterizations of IPA or URLs as PII (or not) are necessarily inaccurate because they alone cannot capture the range of privacy risks - either category could include an instance with PII, but most observed instances do not. In practice there is little functional differentiation between these traffic components and other, privacy-protected PII, yet the related legal treatment of IPAs and URLs is far less consistent. A more accurate risk assessment depends on knowing who collected it and how they use, disclose, and dispose of the traffic data.
The risk management challenge lies in the linguistic incongruity between the legal and technical discourse about traffic data - its definitions, semantic classifications and interpretations. Officers of the court associate IPAs with a greater privacy risk than URLs based on our past and still partial ability to link IPAs to an individual. This distinction was always artificial (albeit not totally unfounded) since both types of data reference a device or virtual location rather than an individual, and many URLs directly reveal much more user information than an IP address.
More specifically, this legal-technical gap exposes privacy risks with network operational data insofar as many laws do not explicitly allow for research use of network data , and there is no bright line caselaw applying their respective exceptions to the context of sharing Internet data for research.
2.2 Flawed Technology Models
Most data-sharing efforts by the networking research community focus on improving computing technologies to solve the privacy problem, with anonymization commanding the bulk of the attention. A typical researcher approach is to enumerate the possibly privacy-sensitive information present in network traffic traces, and then implement a technical, typically cryptographic, solution to replace this information completely or partially with synthetic identifiers, normally implemented by encrypting or otherwise removing all or part of identifiers.
Since privacy risk is influenced by evolving contexts associated with relationships between people, data, technology and institutions, solely technical solutions are inherently insufficient to balance the privacy/utility tradeoff. Technical researchers may rightly ask why we would predicate sharing architectures on ambiguous, unquantifiable and fallible human trust enforced by law and policy, if we can build trust through technology. The response is simple: while a purely technical approach may significantly ameliorate privacy risk, it largely fails to render empirically grounded answers to most questions being asked about the Internet today.
For example, while anonymization schemes can enhance the privacy of IPA in shared network traces, if it removes the ability to do geographic or topological analysis, the research utility of that data for studying DDoS modus operandi is dramatically reduced. A policy control framework enables the technical dials to allow for more privacy risk if a specific use justifies it. For example, traces protected by prefix-preserving anonymization may be subject to re-identification risk or content observation risk, but policy controls can help data providers minimize the chances that sensitive information is misused or wrongfully disclosed.
2.3 Reactive Top-Down Policy
Strategies to incentivize sharing by amending or enacting legislation merit consideration, and if the past is any indication, communications legislation will eventually be updated to reflect the evolved needs from the last few decades. However, regulation, especially in the technology arena, is largely reactive to unanticipated side-effects and dangers rather than making proactive, fundamental adjustments to predictable difficulties. Further, the length of the legislative policy cycle, confluence of variables involved in changing law, and unpredictable change agents are not amenable to immediate solutions that interested stakeholder DS and DPs can execute. In the Internet measurement space, a legislative solution means awaiting the familiar change agent aforementioned: body counts or billion dollar losses that result from the lack of ground truth about the structure and function of networks that comprise our critical communications infrastructure.
3 Sharing risks and benefits
3.1 Who, What, When
Who is at risk when network
data is shared?
Entities potentially at risk when network traffic is shared include: persons who are identified or identifiable in network traffic, researchers, and network providers (NP) such as ISPs, backbone providers, and private network owners. In addition to legal liabilities and ethical responsibilities, researchers and their institutions also risk withdrawal of data and or funding as a result of privacy leakage. Society also bears costs associated with misinformation, mistrust, and internalizing behavioral norms that may result from privacy harms.
Which traffic data components
We call a first-order identifier one which functionally distinguishes an individual: first and last name, social security number, government-issued and other account identifiers, physical and email addresses, certain biometric markers, and possibly the same information about immediate family. A second-order identifier could be an IP address (IPA), machine access code (MAC) address, host name, birthdate, phone number, zip code, gender, and financial, health, or geographic information. These indirect identifiers can also include aggregated or behavioral profile information such as IP header information, which in many cases can reveal which applications are used, how often, and with which machines. Indirect identifiers also include URL click streams, which can reveal information about the content of communications, including search terms.
Under what conditions do these
data types pose risk?
Network traffic measurement data can present a privacy risk when information in packets and flow records can directly expose non-public information about persons - such as health, sexual orientation, political affiliation, religious affiliation, criminal activity, associations, behavioral activities, physical or virtual location; or, organizations - such as intellectual property, trade secrets or other proprietary information. Network traffic may also indirectly expose non-public, sensitive information if correlated (linked) with other public or private data, such as lists of IPAs of worm-infected and thus vulnerable hosts. Network data can also yield mistaken attributions and inferences about behavior, potentially more damaging than correct inferences.
The privacy risk across time may also vary, as the threat may be immediately manifest upon disclosure of data, or it may be a latent risk which is held in abeyance until some future condition arises. Lack of transparency between the DP and DS regarding the shared data's nature, scope, and lineage is invariably a condition that enhances risk.
3.2 Privacy Risks of Internet Research - Laws and Courts of Public Opinion
It is impractical to enumerate all laws that may affect privacy risk, but such an enumeration is not prerequisite to capturing the foreseeable risks of network data sharing. It is sufficient to note that legal liability or ethical obligations underlie each privacy risk. Dismissing ethical obligations as discretionary and unenforceable overlooks how ethical violations are treated by public opinion, and also ignores the fact that many laws are an evolution of ethical norms. In the U.S., privacy-related legal liabilities can derive from the federal Constitution (most notably the Fourth Amendment), federal law and regulation, contract law, tort law (e.g., invasion of privacy), state law equivalents, and organizations' privacy policies. Beyond the legal risk, violations of ethical obligations can create normative harms that implicate reputation and cause financial damages.
We break down the privacy risks of data sharing into two categories: disclosure and misuse.
Public disclosure is the act of making information or data readily available to the general public via publication or posting on the web. The privacy risks of sharing data containing PII which is subsequently displayed on the web are obvious and incontrovertible. More common and challenging are publicly available network traces and activity logs which reveal identifying information about infected hosts. Such disclosure raises the risk that unpatched or vulnerable hosts will be further exploited, thus creating security and reputation risks for individuals and organizations.
Accidental or malicious disclosure is the act of making information or data available to a third party(s) as a result of inadequate data protection. AOL provided a quintessential example in 2006 when they released an anonymized data set of search queries that, with sufficient public meta-data were linked back to users conducting the searches who were then exposed in the NYT. 
Compelled disclosure to third parties risk arises with the obligations attendant to possessing data, such as having to respond to subpoenas requesting data disclosure in lawsuits. The RIAA campaign to massively subpoena ISPs and universities in an attempt to identify copyright infringers is a notorious example. To illustrate, many entities (including research organizations) have chosen not to retain traffic statistics of operational and research interest, to avoid any such compulsion.
Government disclosure involves the release of data to government entities. An infamous example is the disclosure of call data records by major telecommunications carriers to the National Security Agency around 2007 . Release to the government introduces another level of risk involving civil rights and liberties, such as imprisonment and restrictions on speech and associations.
Misuse of user or network profiles arises with network traffic that contains information about proprietary or security-sensitive network architectures or business operations. Advancing traffic and topology analysis, data mining and classification techniques can derive sensitive information from seemingly benign traffic data, and thereby reveal user behaviors, associations, preferences or interests, which attackers, advertisers, or content owners can then exploit. Network operators themselves may use such information for network management, illustrated by Comcast's recent throttling of BitTorrent traffic. The relatively invasive traffic engineering technique, combined with a lack of transparency in deploying it, led to public uproar and an unprecedented FCC regulatory ruling.
Inference misuse risk involves synthesizing first-order or second-order identifiers to draw inferences about a person's behavior or identity.
Re-identification and De-anonymizing misuse involves reversing data anonymization or masks to link an obfuscated identifier with its associated person. Shared anonymized data poses a misuse risk because it is variably vulnerable to re-identification attacks using public or private information whose (increasing) availability is beyond the knowledge or control of the original or intermediate data provider . Anonymized data may not immediately expose PII, but any time a piece of de-identified data has been linked to first order identifying information, other anonymous aspects of the obfuscated data are easier to de-anonymize. Aggregation or statistical techniques for anonymization are not immune to re-identification risk. Examples of reidentification risk are the 2007 Netflix prize incident , and a similarly embarrassing episode of re-identification within the Internet research community .
De-anonymization risk bears special consideration in the growing incongruity around PII. DPs face increasing legal and societal pressures to protect the expanding amounts of PII they amass for legitimate business purposes. Yet, DPs are under equal pressure from the marketplace to uncover and mine PII in order to better connect supply and demand, and increase profit margins on their goods and services. DPs will turn to anonymization to avoid triggering privacy laws that exempt aggregate or anonymized data.
Like the arms race between exploits and defenses in the systems security arena, de-anonymization techniques will likely become commoditized to support investigative reporting, law enforcement, business intelligence, research, legal dispute resolution, and the presumed criminal threatscape. Several state legislatures have enacted laws to ban the release of sensitive private information because of this re-identification risk , although these are not contention-free. Re-identifications concerns motivated the National Human Genome Research Institute's recent removal of open access to the pooled genomics data it posted on the Internet in 2006 .
3.3 Utility of Internet Measurement
The benefits of network traffic measurement derive from the value of empirical network science , which includes a better understanding of the structure and functions of networks that comprise critical Internet infrastructure. Network researchers and funding agencies struggle to establish a science agenda, partly due to their lack of visibility into the infrastructure, but also because the field is younger and less well-defined than traditional scientific disciplines.
The following criteria help measure and communicate empirical network research utility:
- The objective for sharing the data produces or promotes social welfare or generalizable knowledge.
- The network research data is not already being shared, or if it is, there remains a qualitative need for sharing between other DS and DPs
- The research could not be conducted without the shared data.
- The scientific methodology using the shared data is transparent, objective, and repeatable relative to any privacy controls that are implemented.
- Research results can be acted upon meaningfully.
- Research results can be integrated with business processes or security operations, such as situational awareness of critical infrastructure.
Research that could satisfy the above criteria include:
- information and network security questions regarding system threats, including characterizing baseline and anomalous workloads, modeling malware, developing effective strategies to deal with threats.
- macroscopic analysis of Internet topology; understanding the how the evolution of the network is affecting the efficiency and capabilities of the underlying routing, transport, and naming protocols.
- understanding the effect of the prevalence and growth of new applications on Internet workload, topology, and infrastructure economics.
- validation of traffic, congestion control, and performance assumptions, models, and analyses, both for current and proposed new technologies.
- development and evaluation of new technology, including measurement and sampling techniques.
4 PS2 Framework: Elements, Execution, and Evaluation
We describe the Privacy-Sensitive Sharing Framework and then evaluate the model's ability to address the privacy risks outlined in 3.2 and the utility criteria in 3.3. Recognizing that privacy risk management is a collective action problem, our PS2 framework contains this risk by replicating the collection, use, disclosure and disposition controls over to the DS. This framework contemplates that the privacy risks associated with shared data are contagious - if the data is transferred, responsibility for containing the risk lies with both provider and seeker of data. In other words, there is no automatic detachment of control or ownership by the DP when the data is shared.
4.1 Elements of PS2
While not framed around specific legislation, The components of our framework are rooted in principles and practices that underlie privacy laws and policies on both the national and global levels. The Fair Information Practices (FIPS) are considered de facto, international standards for information privacy and address collection, maintenance, use, disclosure, and processing of personal information. The FIPs have spawned a series of authoritative reports, guidelines, and model codes that implement these principles . The PS2 is an attempt to apply these principles to the context of Internet measurement and sharing, aiming to build a touchstone for ethically defensible sharing scenarios.
- Authorization - Internal authorization to share requires explicit consent of the DP and DS, and may require consent of individuals identifiable in network traffic, which can often be implicit via proxy consent with the DP.
- Oversight - The DP and DS should obtain some external oversight of the proposed sharing, such an Institutional Review Boards (IRB).
- Transparency - The DP and DS should agree on the objectives and obligations associated with shared data. Data-sharing terms might require that the algorithms be public but that the data and or conclusions remain protected, or vice versa .
- Compliance with applicable law(s) - Collection and use of data should comport to a reasonable if not case-law precedented interpretation of laws that speak directly and clearly to sharing risks about proscribed behaviors or mandated obligations.
- Purpose adherence - The data should be used to try to achieve the documented goal of sharing.
- Access limitations - The shared data should be restricted from those who do not have a need and right to access the shared data.
- Use specification and limitation - Unless otherwise agreed, The DP should deny merging or linking identifiable data contained in the traffic data.
Collection and Disclosure Minimization - The DS
should apply privacy-sensitive techniques to stewardship
of the network traffic such as:
- Deleting sensitive data.
- Deleting part(s) of the sensitive data.
- Anonymizing Hashing De-Identifying all or parts of the sensitive data.
- Aggregating or sampling.
- Mediation analysis Human Proxy - a sandbox approach that involves "sending the code to the data" rather than releasing sensitive data for analyses.
- Aging the data - such that traffic that contains sensitive data that is non-current, i.e., no longer a direct or indirect identifier.
- Size quantity limitation - minimizing the quantity of traces shared.
- Multiple layers of anonymization.
- Audit tools - Techniques for provable compliance with policies for data use and disclosure, e.g., secure audit logging via a tamper-resistant, cryptographically protected device connected to but separate from the protected data, accounting policies to enforce access rules on protected data.
- Redress mechanisms - Procedures to address harms from inappropriate data use or disclosure, including a feedback mechanism to support correction of datasets and or erroneous conclusions.
- Quality data and analyses assurances - Awareness by the DS and DP of inference confidence levels associated with the data.
- Security - Controls should reasonably ensure that sensitive PII is protected from unauthorized collection, use, disclosure, and destruction.
- Training - some level of education and awareness of the privacy controls and principles by those who are authorized to engage the data.
- Impact assessment - Research design should consider potential collateral effects on affected parties, and seeks methods that do no further harm.
- Transfer to third parties - prohibited unless the same data control obligations are transferred, relative to the disclosure risks associated with that data.
4.2 Execution of PS2
To navigate the legal and ethical ambiguity around disclosure and use of network measurement data discussed in Section 2.1, we propose Memoranda of Understanding (MOUs), Memoranda of Agreement (MOAs), model contracts, and binding organizational policy as enforceable vehicles for addressing privacy risk both proactively and reactively. For less privacy-sensitive data, a unidirectional Acceptable Use Policy AUP may be cost-preferential to negotiating bilateral agreements. Explicit consent about controls for shared data provides an enforceable standard and certainty that can serve as a safe harbor for liability under data privacy laws.
4.3 Evaluation of PS2
The PS2 framework facilitates a rigorous examination of whether the proposed research balances privacy risks and utility rewards. For an oversight committee, it helps determine whether possible risks are justified, by specifically asking the user to assess sharing risks against technical and policy controls, as well as to assess the achievement of utility goals against those controls. For the prospective DP, the assessment will assist the determination whether or not to participate.
Table 1 assesses whether the privacy risks are mitigated by the primary components of PS2. The X's indicate that the particular PS2 policy component in the row fails to mitigate against the privacy risk enumerated in the corresponding column. The table starkly shows that purely policy components of PS2 still leave wide gaps in addressing the full range of privacy risks. Further, it suggests that technical minimization techniques (Section 4.1) can address all privacy risks, implying the sufficiency of a purely technical sharing framework in lieu of a policy control backdrop. However, evaluating minimization techniques against the utility goals in Table 2 show the weakness of this one-dimensional technical approach. This weakness is unsurprising, since data minimization techniques intentionally obfuscate information often essential to most Internet research. These utility gaps can be modulated ("dialed down") with the policy components of PS2. In short, a purely technical approach breaks down along the utility dimension, and the pure policy approach may leave too high privacy risk exposure, justifying a hybrid framework that covers both privacy risks and utility goals. We note that evaluation of a framework must also consider practical issues such as education costs, whether new privacy risk(s) are introduced, whether control(s) are forward-looking or also address legacy privacy risks, and free rider problems created by DPs who choose not to share.
|PS2 Privacy Risk||Public Disclosure||Compelled Disclosure||Malicious Disclosure||Government Disclosure||Misuse||Inference Risk||Re-ID Risk|
|Minimiz.Tech.||Is Purpose Worthwhile?||Is there a need?||Is it already being done?||Are there alternatives?||Is there a scientific basis?||Can results be acted upon?||Can DS & DP implement?||Reasonable education costs?||Forward & backward controls?||No new privacy risks created?||No free rider problem created?|
- Allman, M., and Paxson, V. Issues and etiquette concerning use of shared measurement data. In IMC (2007).
- Barbaro, M., and T. Zeller, J. A Face is Exposed for AOL Searcher No. 4417749. New York Times (Aug 2006).
- Burkhart, M., Schatzmann, D., Trammel, B., Boschi, E., and Plattner, B. The role of network trace anonymization under attack. ACM SIGCOMM Comp. Comm. Rev. (2009).
- Burstein, A. Amending the ECPA to Enable a Culture of Cybersecurity Research. Harvard Journal of Law & Technology 22, 1 (2008), 167-222.
- C. B. Duke, et al., Ed. Network Science. The National Academies Press, Washington, 2006.
- Cauley, L. NSA has massive database of Americans' phone calls. USA Today (May 2006).
- Center, E. P. I. The U.S. First Circuit Court of Appeals upheld a New Hampshire law that bans the sale of prescriber-identifiable prescription drug data for marketing purposes. http://epic.org/privacy/imshealth/11_18_08_order.pdf.
- Center for Democracy and Technology. CDT's Guide to Online Privacy, 2009.
- Clabby, C. DNA Research Commons Scales Back. American Scientist 97, 3 (May 2009).
- Claffy, K. Ten Things Lawyers should know about Internet research, August 2008. https://catalog.caida.org/details/paper/2008_lawyers_top_ten.
- Crovella, M., and Krishnamurthy, B. Internet Measurement: Infrastructure, Traffic and Applications. John Wiley and Sons, Inc., 2006.
- Narayanan, A., and Shmatikov, V. Robust De-anonymization of Large Sparse Datasets. IEEE Symposium on Security and Privacy (2008).
- OECD. Guidelines on the protection of privacy and transborder flows of personal data, 1980.
- Porter, C. De-Identified Data and Third Party Data Mining: The Risk of Re-Identification of Personal Information. Shilder Journal of Law, Communication, and Technology, 3 (2008).
- Swire, P. A Theory of Disclosure for Security and Competitive Reasons: Open Source, Proprietary Software, and Government Agencies. Houston Law Review 42, 5 (January 2006).