Applying NAIRR Pilot Resources to Infer Data Set Utility

This project will use National Artificial Intelligence Research Resource (NAIRR) Pilot resources to develop a new capability for the cybersecurity research community: a service that infers the utility of datasets and software tools based on their documented use in scientific publications.

Sponsored by:
National Science Foundation (NSF)

Principal Investigators: Kimberly Claffy Bradley Huffaker

Funding source:  OAC-2526448 Period of performance: April 3, 2025 - December 31, 2025.


Project Summary

This project will use National Artificial Intelligence Research Resource (NAIRR) Pilot resources to develop a new capability for the cybersecurity research community: a service that infers the utility of datasets and software tools based on their documented use in scientific publications. This solution will require data collection, preprocessing, model selection/training, and evaluation/deployment. The service will address persistent community challenges, and serve as a model for related problems in engineering as well as social science disciplines.

The capability to assess scientific data utility will inform AI-ready data investments. For many years CAIDA has tracked use of its scientific data in publications, in part to support its rich context catalog that links research papers to resources (data sets and software tools) used in these papers. However, even current state-of-the-art techniques for automatically extracting links between resources cannot handle the complexity, variability, and contextual nuances of natural language, e.g., in discerning a reference to a resource from its actual use in a paper.

Project leads, key team members

kc claffy (CAIDA/SDSC/UCSD)
Bradley Huffaker (CAIDA/SDSC/UCSD)
Elena Yulaeva (CAIDA/SDSC/UCSD)
Mai H. Nguyen (SDSC/UCSD)

Advancing NAIRR Infrastructure

This project will extend NAIRR Pilot’s capabilities by integrating advanced AI tools for metadata extraction and annotation specific to our domain.

Advancing AI-enabled scientific research

This project will develop an efficient approach to generating concise few-shot learning examples directly from research papers, thereby enabling the construction of larger and more effective multi-shot learning prompts within the limited context window of Large Language Models. We will also provide instructions and code we used, for adaptation by other disciplines.

Broader Impacts

The knowledge generated by this project will be integrated into CAIDA’s catalog (catalog.caida.org), an established resource for discovering Internet-related publications, presentations, software, and datasets. This integration will provide funding agencies with a clear understanding of the datasets and software demonstrating the broadest impact. Furthermore, it will empower researchers to more easily identify relationships between diverse datasets and understand their utilization. While the initial focus is on Internet research publications, the underlying methodology and resulting code will be adaptable to other domains, enabling the creation of insightful context graphs across various fields.

IDSU workflow diagram

Tasks

We propose to leverage CAIDA’s cybersecurity and network infrastructure domain-specific expertise and extensive domain-specific datasets, San Diego Supercomputer Center’s (SDSC) AI and cyberinfrastructure expertise, open-source Large-Language Models (LLMs), and the NAIRR Pilot resources at SDSC (Expanse and Voyager) to develop two solutions: one focused on extracting security-relevant meta-data about Internet infrastructure properties, and one on inferring utility of datasets and software tools based on their documented use in scientific publications. We will use state-of-the-art open-source LLMs for both tasks.

Task 1 (Extraction of security-relevant metadata) aims to improve the accuracy of CAIDA’s existing data sets by encoding information gained from natural language sources. This will help infer relationships between different components, such as Autonomous Systems (AS), and improve the overall accuracy of the data sets.

Task 2 (Resource data and tools utility analytics) aims to extract knowledge about data and software resources used in scientific publications, specifically in the context of Internet infrastructure security. The goal is to infer the utility of these resources based on their documented use in publications.

Acknowledgment of awarding agency’s support

National Science Foundation (NSF)

This material is based on research sponsored by the National Science Foundation (NSF) grant OAC-2526448. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF.

Published
Last Modified