Babak Shahbaba

(placeholder)

Babak Shahbaba, PhD

Professor of Statistics

University of California, Irvine


Statistical Machine Learning

Nonparametric Bayesian Methods

Statistical Methods in Biomedical Sciences



RESEARCH


My independent research focuses on Bayesian nonparametric methods and hierarchical Bayesian models and their applications in large-scale biological sciences. Because Bayesian methods tend to be computationally intensive (especially for large-scale studies), I have also devoted a part of my research to developing more efficient computational methods in order to facilitate the application of Bayesian statistics to data-intensive scientific problems. I am currently focusing on the following areas:


Statistical Machine Learning

A main part of my research resides at the intersection of statistics and machine learning, with our findings published in  journals and conferences including JASA, JMLR, TMLR, NeurIPS, ICML, UAI, and AAAI. In a recent work (to appear in JASA), we have introduced a graph neural network model that integrates information of varying orders and provides interpretable results. The model is inspired by and applied to a real-world application designed to extract task-critical information from brain activity data. Inspired by the same application, we have also developed a new latent representation learning framework for data from multiple modalities (to appear in NeurIPS). For proper uncertainly quantification in deep learning, we have developed a new algorithm (to appear in TMLR) that significantly reduces computational cost without compromising accuracy. Finally, over the past several years, we have developed a flexible model for time series analysis using a latent factor Gaussian process (LFGP) model, a novel optimal transport method to integrate data across heterogeneous subjects, and a reinforcement learning method to obtain individualized optimal policy.


Stochastic Process Modeling

Modern nonparametric methods rely on stochastic processes, such as the Dirichlet process (DP) and Gaussian process (GP), to overcome the limitations associated with assuming simple distributional forms and linear relationships. Dirichlet process mixture (DPM) models are typically used for nonparametric density estimation and clustering. Early in my career, I expanded the application of DPM by proposing a novel nonlinear classifier that models the joint distribution of the response and predictors nonparametrically using Dirichlet process mixtures. Within each component of the mixture, the relationship is assumed to be linear. However, the overall relationship is nonlinear if the mixture contains more than one component. In this way, the linear model is embedded within a framework that offers much greater flexibility, yet still accommodates linearity if the data necessitate it. With my collaborators, I have also developed a Dirichlet process mixture model for analyzing large-scale genomics  and neuroscience studies. Inspired by neuroscience problems, my students and I have also developed Gaussian process models for time series data. These methods can be used to make inference regarding dependencies among multiple time series.


Computational Methods

Bayesian methods provide a coherent framework for incorporating domain knowledge, but tend to be computationally intensive, since they usually rely on Markov Chain Monte Carlo (MCMC) algorithms to simulate samples from intractable distributions. To address this issue, I have been working on computationally efficient sampling algorithms based on geometrically motivated methods such as Hamiltonian Monte Carlo (HMC) and its variants. Over the past several years, my students and I have introduced new techniques to improve the computationally efficiency of HMC, extended these methods for sampling from constrained or multimodal distributions, and applied them to practical problems in neuroscience and population dynamics.


Biomedical Applications

My independent methodological research is mainly motivated by applied problems and collaborative projects. The main focus of my applied research has been on large-scale biomedical studies, where sophisticated experiments have created new challenges for data scientists. In recent years, one of my primary interests has been large-scale neuroscience studies, starting with our work on cross-neuronal interactions. In recent years, my students and I are have worked closely with neuroscientists to analyze high-dimensional neural data collected based on memory-related experiments. Our main paper on this topic has been published in Nature Communications, which provided the first direct evidence that the lookahead process reported in the place cell literature extends to non-spatial information. We have also developed a new method for identifying differentially activated brain regions using light sheet fluorescence microscopy, a cutting-edge whole-brain imaging technique. I have also been involved in several large-scale genomic studies, including our recent work on predicting circadian time, which has been published in Nature Communications.


Research Grants

Current

Collaborative Research: HDR DSC: Data Science Training and Practices: Preparing a Diverse Workforce via Academic and Industrial Partnership

NSF (Role: Lead PI)

Through engaging students selected from a pool of highly diverse populations in STEM areas, this project, California Data Experience Transformation (CADET), will facilitate data science training via curriculum development, hands-on experiences, and close interactions with both academic and non-academic organizations.


Individualized Learning and Prediction for Heterogeneous Multimodal Data From Wearable Devices

NIH (Role: MPI)

Develop deep neural models for understanding and predicting individual health over time, while learning shared patterns across multiple individuals to enhance interpretability of the results


DEJA-VU: Design of Joint 3D Solid-State Learning Machines for Various Cognitive Use-Cases

NSF (Role: Co-PI)

The objective of this proposal is to design a new class of computer chips – in contrast to von-Neumann machines - by leveraging advances in our understanding of how the brain represents and computes information and the crucial insight to map complex spatiotemporal signaling (characteristic of brain computations) onto 3D integrated chips.


Irvine Summer Institute in Biostatistics and Data Science

NIH (Role: MPI and Co-Director)

The goal of this project is to introduce students to modern methods in biostatistics and data science, provide them with hands-on experience of solving real-life research problems, and familiarize students with careers options in biostatistics and prepare them for these careers.


Project PIPE-LINE (Programs for Institutional PathwayEngagement–acceLeratingINfrastructure and Education)

California Learning Lab (Role: Co-PI)

The overarching goal of this project is to offer a collaborative and sustainable model for overcoming equity gaps in data science learning.


Completed

Scalable Bayesian Stochastic Process Models for Neural Data Analysis

NIH (Role: PI)

The overarching goal of this study is to understand the neural basis of complex behaviors and temporal organization of memories. To this end, we will develop a new powerful and scalable class of statistical models for studying multimodal neural data using Bayesian stochastic processes and computationally efficient algorithms. The potential clinical impact of this study is broad. Our research will address fundamental and unresolved questions about hippocampal function, and these novel approaches may subsequently lead to unprecedented insight into the neural mechanisms underlying memory impairments.

See our GitHub page for a brief report of our findings and results.


The NSF-Simons Center for Multiscale Cell Fate

NSF/Simons Foundation (Role: Senior Personnel)

The overarching objective of this center is to investigate how cells differentiate into different cell types.


MODULUS: Data-Driven Mechanistic Modeling of Hierarchical Tissues

NSF (Role: PI)

This project will develop new statistical and mathematical models that describe how cells and molecules within cells self organize to perform biological functions within an organism. More specifically, we will use our models to investigate hematopoiesis, which is a remarkable biological process responsible for creation and maintenance of blood cells, and involves complex interactions among biochemical and physical events across temporal and spatial scales that are still not well-understood.  Additionally, this project will provide undergraduate and graduate students with a true interdisciplinary experience with equal mentorship from data and biological scientists.


Theory and practice for exploiting the underlying structure of probability models in big data analysis

NSF (Role: PI)

The objective of this project is to combine geometric techniques with computational algorithms in order to scale up statistical methods used for big data analysis.

See our GitHub page for a brief report of our findings and results.


Efficient Bayesian Learning from Stochastic Gradients

NSF (Role: Co-PI)

This proposal studies a new family of MCMC procedures that requires only very few data-cases per update.


Bayesian Modeling and Data Integration in Infectious Disease Phylodynamics

NIH (Role: Co-I)

The objective of this project is to develop new statistical methodology for analysis of population dynamics of infectious disease agents by integrating gene sequencing and other data collected in infectious disease surveillance programs.


Prenatal Stress Biology, Infant Body Composition and Obesity Risk

NIH (Key Personnel)

The overall objective of this project is to evaluate the impact of maternal biological stress during pregnancy on infant body composition and metabolic function.


EMA Assessment of Biobehavioral Processes in Human Pregnancy

NIH (Key Personnel)

The overall objective of this project is to evaluate the impact of maternal psychosocial and biological stress, assessed with state-of the art ambulatory measures, on length of gestation.


Transcriptomic, Oxidative Stress, and Inflammatory Responses to Air Pollutants

NIH (Key Personnel)

This study would be among the first using repeated measurements to analyze the relation between chemically characterized air pollutants and genome-wide gene expression patterns in peripheral blood cells from a high-risk population of elderly

individuals.

(949) 824-0623

2222 ISEB, UC Irvine, CA 92697

babaks at uci dot edu

Contact

(placeholder)