Skip to main content

Statistics in the Age of AI: Theory, Methods, and Data

Didong Li

Assistant Professor, Department of Biostatistics, University of North Carolina at Chapel Hill

Didong Li

Abstract: Artificial Intelligence (AI) has surged in popularity, creating both opportunities and challenges for statistics. In this talk, I will present three recent directions from my lab that reflect our efforts to engage with the age of AI. First, I will discuss theoretical results for decoder-based generative models, providing statistical foundations that connect latent dimension, approximation error, and model complexity. Second, I will discuss a method to use embeddings from large language models to enhance high-dimensional hypothesis testing, a widely used statistical tool in scientific domains, motivated by problems in cancer genomics where traditional methods are underpowered. I will also discuss extensions to genetic studies, where we curated annotations for 8.9 billion genetic variants from the human genome, and obtained embeddings of these 8.9B variants for downstream tasks such as GWAS and phenotype prediction. Finally, I will switch to an infrastructural view, introducing STimage-1K4M, one of the first and largest publicly available spatial transcriptomics datasets curated by my group, consisting of 1,149 slides and more than 4 million pathology image–gene expression pairs across 10 species and 50 tissue types. This resource has been downloaded over 180,000 times on HuggingFace and has facilitated the training of multimodal foundation models. Together, these examples illustrate how theory, methodology, and data curation advance both statistics and AI.

Skip to content