IDSU: Applying NAIRR Pilot Resources to Improve Scientific Data Utility

An abbreviated version of the original EArly-concept Grants for Exploratory Research (EAGER) proposal is shown below. For the full proposal for "Applying NAIRR Pilot Resources to Improve Scientific Data Utility", please see the proposal in PDF.

1 Objective: Assessing Scientific Data Utility

We propose a NAIRR Pilot EAGER project to demonstrate how NAIRR Pilot resources at SDSC can facilitate a new service to the cybersecurity research community: inferring utility of datasets and software tools based on their documented use in scientific publications. This solution will require context-aware extraction and interpretation of complex relationships from unstructured text. The service will address persistent community challenges, and will serve as a model for other scientific disciplines that face similar data integrity and data utility assessment challenges.

1.1 Motivation

Sustaining data collection and curation is expensive, and AI-ready data sets will be even more expensive to sustain and curate. Funding agencies need mechanisms to assess which data sets sig- nificantly contribute to scientific discovery and innovation. For many years CAIDA has tracked the use of our scientific data in publications, in part to support our rich context catalog that links networking and Internet security research papers to data sets and software tools used in these papers. This catalog has accelerated scientific discovery and development of cybersecurity data science and research skills [1]. However, even state-of-the-art techniques for automatically ex- tracting this information from publications cannot handle the complexity, variability, and con- textual nuances of scientific language , e.g., in discerning a passing reference to a resource from its actual use in a paper. The problem is exacerbated by the proliferation of research publications, some of which may now be authored by LLMs unbeknownst to their readers. Addressing these limitations requires advanced, context-aware extraction methods, such as those enabled by LLMs, which can leverage patterns in language to interpret complex relationships in scientific texts.

1.2 Why the Proposed Work is Appropriate for EAGER Consideration

Our proposal aims to develop a novel service that infers the utility of datasets and software tools based on their documented use in scientific publications. This initiative aligns with the NSF’s EAGER program, which supports exploratory work in its early stages on untested but potentially transformative research ideas or approaches. The proposed research involves untested method- ologies for context-aware extraction and interpretation of complex relationships from unstruc- tured text. By leveraging advanced techniques, we aim to overcome the limitations of current methods in discerning nuanced references within scientific literature. This approach embodies the high-risk, high-reward paradigm that the EAGER program seeks to promote. The project combines expertise from cybersecurity, data science, and AI to address persistent challenges in assessing data utility. By leveraging LLMs to extract and interpret complex relation- ships from unstructured text, the project applies new methodologies to the field of cybersecurity research. This interdisciplinary approach is characteristic of the type of research EAGER aims to support. If successful, this project could significantly enhance the way funding agencies and researchers assess the contribution of datasets to scientific discovery and innovation. By provid- ing a model applicable to related issues in other data-focused scientific disciplines, the project has the cpotential to transform current practices in data utility assessment, leading to more informed investment decisions and optimized AI-ready data curation.

1.3 Technical Approach

We propose to leverage CAIDA’s (cybersecurity and network infrastructure) domain-specific ex- pertise and extensive domain-specific datasets, SDSC’s AI and cyberinfrastructure expertise, open- source LLMs, and the NAIRR Pilot resources at SDSC (Expanse and Voyager) to design and proto- type a service to detect, validate, and characterize use of data sets and software tools in scientific publications. We will use state-of-the-art open-source LLMs for this task. Specifically, we plan to use LLaMA 3.1, which has a context length of 128K tokens, and has been shown to be competitive with lead- ing open and closed LLMs for a range of tasks [2]. A LLaMa 3.1-70B model [3] is available as part of SDSC LLM, an LLM-as-a-service, currently available to SDSC staff for development purposes. Members of our team initiated the effort to build SDSC LLM as an internal resource to provide quick, cost-effective, and private access to LLM capabilities for SDSC researchers. SDSC LLM uses vLLM [4] for LLM serving and OpenWebUI [5] for chat UI and API access to the underly- ing models. The LLaMa 3.1-70B model is currently deployed on 4x A100s 80GB GPUs. We will leverage SDSC LLM for tasks based on inference. For tasks requiring model parameter updates, we will make use of AI-optimized accelerators on Voyager and GPUs on Expanse. The Voyager supercomputer, hosted at SDSC, is designed for deep learning workloads [6]. SDSC also hosts the Expanse supercomputer, which provides compute resources for a wide range of applications [7]. We will explore various approaches to leverage LLMs for this task. We will start with prompt engineering, followed with fine tuning, and if time permits, RAG and prompt tuning. Prompt en- gineering is the process of designing and refining prompts to elicit desired responses from LLMs. We will evaluate several prompt engineering techniques, including prompt crafting, shot prompt- ing, and chain-of-thought. RAG is a method to enhance the quality of LLM responses by incor- porating additional information from an external source. Prompt tuning is a technique to adapt LLMs to a specific task by adjusting only a few parameters corresponding to the prompt em- beddings to guide the model’s output. We will pursue these latter techniques (RAG and prompt tuning) if time allows, or integrate them into follow-on work.

2 Methodology for inferring resource (data and tools) utility analytics

The objective of this task is to extract knowledge about data and software resources used in Inter- net research publications. To facilitate human audit of the result, we will have the LLM excerpt sentences from each publication (and citations within these sentences) that indicate use of a spe- cific resource. We will then include text from the cited references to guide and/or fine-tune the LLM. Minimizing false positives will require including negative examples, i.e., publications that reference but do not actually use a specific data set or tool, in the multi-shot prompt. Using man- ually annotated papers, we will investigate LLM-based approaches for extracting information related to data and software resources from text.

2.1 Preliminary Approach and Example

The three-step prompt engineering approach is typically: (1) create the training data set, (2) use the training data to engineer a set of one-shot prompts using segments of the paper, then a final prompt using the entire paper, and (3) validate the results. We describe the process and then illus- trate it with an example analysis for a single publication.

Step 1: Create training data set. We will first manually identify resources (data sets or software tools) used or referenced by a set of research papers. We will represent each resource in YAML with five fields: (Figure 1):

(a) resource identifier URL if provided in the paper

(b) resource type (dataset or software)

(c) classification as merely mentioned or used by the paper

(d) sentences the human considered necessary to make the labels

(e) resource’s bibliographic references cited in those sentences.

Figure 1: Example of extraction of metadata (and YAML encoding underneath) that indicates a mention of a data set (left) and an actual use of the data set (right). This example is from the publication “ECN with QUIC: Challenges in the Wild” [8].

Step 2: Use training data to engineer prompt. This step will construct a series of examples to use in a multi-shot query [9].

(a) Step 2.1: Creating query-response pairs for a multi-shot example. We will create a series of query-response pairs each generated from the labeled resource record that we manually extracted from a single paper in Step 1. Each example consists of two pieces: the user prompt (query), and the manually constructed expected re- sponse. This pair simulates an exchange between the user and the LLM. As a starting point, the example query will include only sentences that refer to the the labeled re- source, and references nearby in the text. Figure 2 provides an example with the user ( role : user) message containing the example query and the assistant’s ( role : assistant) message containing the expected response. Note that the ideal one-shot query could in- clude the full text of the labeled paper, but given the LLM’s context length limit, using the entire paper would limit the number of possible multi-shots.

(b) Step 2.2: Use examples from Step 2.1 to create a multi-shot query. We will include several examples in the input to generate a multi-shot prompt that we provide to the model. The purpose of multi-shot prompting is to provide the model with a set of example queries and expected responses before providing the target query. We will investigate trade-offs between accuracy and processing time with different numbers of shots (i.e., examples).

(c) Step 2.3: Append final query to multi-and execute final query. After we construct the set of multi-shot examples, we will append to it a query that includes the full text of the paper. The prompt will end with a single user message with the target paper’s full text. Figure 3 contains an example target query with the full text of the target paper.
Step 3: Sanitize and evaluate the response. We will use a tool such asjsonrepair[12] to fix the response format in order to verify if the reference provided in the response is correct. We will identify incorrect responses by finding resources with either a sentence or reference not included in the paper’s full text. Using the set of manually labeled papers, we will evaluate prompt performance by comparing the manually labeled resources against those provided in the LLM’s response.

In addition to prompt engineering, we will also investigate approaches using fine-tuning, and if time permits, RAG and prompt tuning, and compare approaches based on accuracy, processing time, and amount of labeled data required.

2.2 Data to be used

We have already indexed many (a subset of) previous external publications that describe the use of CAIDA datasets. Some of CAIDA’s datasets are already annotated using the annotation schema designed for this (MSRI GMI) project [13], which includes labels such asAutonomousSystem[14], and metadata stored in YAML format. We will leverage this previous work to accelerate training.

2.3 Outcome

This task will achieve two critical goals: enabling discovery of the most generative data sets that CAIDA produces, and facilitating a quantitative assessment of the utility of our data sets, which informs investment decisions about which data collection and curation is most important to sus- tain. Importantly, the tools and approach will be useful for other data-focused disciplines, ad- dressing what will be a growing need in the AI research community: evaluating the return-on- investment in complex AI-enabled as well as AI-enabling data sets.

3 Intellectual Merit

This task will require advanced, context-aware extraction of metadata from unstructured text and associated inference of relevant annotations. We will leverage LLMs to enable more accurate in- terpretation and explanation of complex relationships in scientific publications. We will com- bine CAIDA’s (cybersecurity and critical network infrastructure) domain-specific expertise with SDSC’s AI and cyberinfrastructure expertise to develop LLM-based approaches such as prompt engineering and fine tuning for extracting relevant information from text.

Figure 2: JSON representation of example query and expected response that constitute a single shot, generated from “Replication: Towards a Publicly Available Internet scale IP Geolocation Dataset”[10]. The blue text is the target query, green the label resources’ sentences, and red the labeled resources’ references.

Figure 3: JSON representation of the final message in the prompt, with query in blue and target’s full text in purple. (Publication: “Using Gaming Footage as a Source of Internet Latency Information” [11]).

4 Broader Impacts

Services to Benefit the Research and Cybersecurity Community: The ultimate goal of this research is to improve the quality of data and services that CAIDA provides to the research community. Macroscopic Internet data sets are notoriously expen- sive to collect, maintain, and share; NAIRR can play a key role in navigating these chal- lenges, by using AI tools to inform data utility assessments. The immediate outcome will address two needs commonly articulated by Internet researchers: getting started with Inter- net data, and understanding the utility of such data for further research.
Extension of NAIRR Pilot Capabilities: This project will extend NAIRR Pilot’s capabilities by integrating advanced AI tools for metadata extraction and annotation specific to our domain. We will also provide instruc- tions and code we used, for adaptation by other disciplines.
Novel AI-Cyberinfrastructure Innovations: The development of LLM-based methodologies for metadata extraction from natural lan- guage sources represents a significant innovation in AI-cyberinfrastructure. We will update our existing data process pipeline to include these innovations, and document these changes to help other fields benefit from them. Understanding which datasets are most scientifically generative as a whole will help not only scientists, but funding agencies who must set in- vestment priorities.

References

[1]. CAIDA, “Internet Science Resource Catalog,” 2024. https://catalog.caida.org

[2]. Meta, “Meta LLaMA 3.1,” 2024. https://ai.meta.com/blog/meta-llama-3-1/

[3]. Hugging Face, “Meta‑Llama‑3.1‑70B‑Instruct,” 2024. https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct

[4]. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang & Ion Stoica, “vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention.” https://docs.vllm.ai/en/latest/

[5]. OpenWebUI. https://openwebui.com

[6]. San Diego Supercomputer Center, “Voyager User Guide (NAIRR resource),” 2024. https://sdsc.edu/support/user_guides/voyager.html

[7]. San Diego Supercomputer Center, “Expanse User Guide (NAIRR resource),” 2024. https://sdsc.edu/support/user_guides/expanse.html

[8]. C. Sander, I. Kunze, L. Blöcher, M. Kosek & K. Wehrle, “ECN with QUIC: Challenges in the Wild,” in Proceedings of the 2023 ACM on Internet Measurement Conference (IMC ’23), p. 540–553, ACM, Oct. 2023.

[9]. “Multi‑Shot (multiple examples),” 2023. https://guide.teahouseai.com/teahouseai/master-llms/main-concepts/multi-shot-multiple-examples

[10]. O. Darwich, H. Rimlinger, M. Dreyfus, M. Gouel & K. Vermeulen, “Replication: Towards a Publicly Available Internet Scale IP Geolocation Dataset,” in Proceedings of the 2023 ACM on Internet Measurement Conference (IMC ’23), p. 1–15, ACM, 2023.

[11]. C. Alvarez & K. Argyraki, “Using Gaming Footage as a Source of Internet Latency Information,” in Proceedings of the 2023 ACM on Internet Measurement Conference (IMC ’23), p. 606–626, ACM, 2023.

[12]. S. Baccianella, “JSON Repair: to repair invalid JSON, used to parse the output of LLMs,” 2024. https://github.com/mangiucugna/json_repair

[13]. B. Huffaker & K. Claffy, “Annotated Schema: Mapping Ontologies onto Dataset Schemas,” CAIDA Technical Report, May 2023.

[14]. CAIDA, “CAIDA Ontology.” https://catalog.caida.org/ontology