Skip to main content

From Coevolution Signals to Mutation Effects: Statistically Calibrated Protein Contact Maps and Fitness Prediction

Wen Zhou

Associate Professor, Department of Biostatistics, New York University

Wen Zhou

Protein sequence data offer a massive natural experiment: across evolution, residue positions co-vary in ways that encode 3D structure, illuminate key evolutionary constraints, and reveal compensatory mutation mechanisms. Yet many widely used coevolution methods remain primarily algorithmic–powerful in practice, but often lacking calibrated uncertainty and rigorous theoretical guarantees. In this talk, we introduce a statistically grounded toolkit that transforms multiple sequence alignments (MSAs)–high-dimensional, dependent categorical data–into (i) principled contact maps with error control and (ii) quantitatively reliable predictions of mutation effects. First, we recast contact prediction as hypothesis testing for conditional dependence in high-dimensional categorical data. Using one-hot encoded MSAs, we construct a partial-correlation-style graph and propose a new spectrum-based test statistic that enables statistically calibrated contact discovery. The framework further identifies the specific amino-acid combinations driving each detected interaction, providing a new layer of interpretability for coevolution signals. Next, we develop a Potts-model framework for mutation-effect modeling via node-wise high-dimensional multinomial regression. Our approach enforces sparsity both across residue pairs and across amino-acid types through sparse-group regularization, and it incorporates structural information by weighting penalties across site pairs. We establish sharp L2 convergence rates for the estimated Potts parameters, which in turn yield trustworthy estimates of evolutionary energies and mutation-induced energy changes. Across multiple protein families, our methods improve mutation fitness prediction when benchmarked against high-throughput mutagenesis experiments.

Skip to content