Photo: Sylvain Crouzet
About Me
I am an Assistant Professor
in the
Department of Mathematics
at the
University of California, Los Angeles,
where I am a member of the
Probability
and Mathematical Physics Group.
I am also affiliated with the
UCLA Bioinformatics Program
and the
UCLA EcoEvo Theory Group.
I work at the intersection of applied probability,
mathematical statistics and
theoretical computer science, with an emphasis on biological
and social applications.
More details on my research interests and
publications can be found
here.
My work is supported by NSF grant
DMS-1007144.
The schedule of the UCLA Probability Seminar is
here.
To receive announcements regarding the probability seminar and other related activities, please sign
up here.
Committees
Mathematical Approaches in High-Throughput Genomics, IPAM, September-December 2011
Recent Trends in Probability and Related Fields, AMS Sectional Meeting, UCLA, October 2010
(Slides from the session are posted here.)
SODA 2009, New York, January 2009
Teaching (for a full list,
see here)
Winter 2012:
MATH32B and
MATH275B
Spring 2011: MATH182
Winter 2011: MATH275B
Fall 2010: MATH275A
Spring 2010: MATH285K
Recent Preprints (for a full list,
see here)
Recovering a tree-like trend of evolution despite extensive lateral genetic transfer: A probabilistic analysis
Accepted in RECOMB 2012. With S. Snir.
Robust Estimation of Latent Tree Graphical Models: Inferring Hidden States with Inexact Parameters
Submitted, 2011. With E. Mossel and A. Sly.
Latent tree graphical models are widely used in computational biology,
signal and image processing, and network tomography. Here we design a
new efficient, estimation procedure for latent tree models, including
Gaussian and discrete, reversible models, that significantly improves
on previous sample requirement bounds. Our techniques are based on a
new hidden state estimator which is robust to inaccuracies in estimated
parameters. More precisely, we prove that latent tree models can be
estimated with high probability in the so-called Kesten-Stigum regime
with $O(log^2 n)$ samples where $n$ is the number of nodes.
Phylogenetic Mixtures: Concentration of Measure in the Large-Tree Limit
Accepted in Annals of Applied Probability, 2011. With E. Mossel.
The reconstruction of phylogenies from DNA or protein sequences is a major task of computational evolutionary biology. Common phenomena, notably variations in mutation rates across genomes and incongruences between gene lineage histories, often make it necessary to model molecular data as originating from a mixtureof phylogenies. Such mixed models play an increasingly important role in practice.
Using concentration of measure techniques, we show that mixtures of large trees are typically identifiable. We also derive sequence-length requirements for high-probability reconstruction.
Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies
Submitted, 2011. With E. Mossel.
Mutation rate variation across loci is well known to cause difficulties, notably identifiability issues, in the reconstruction of evolutionary trees from molecular sequences. Here we introduce a new approach for estimating general rates-across-sites models. Our results imply, in particular, that large phylogenies are typically identifiable under rate variation. We also derive sequence-length requirements for high-probability reconstruction.
Our main contribution is a novel algorithm that clusters sites according to their mutation rate. Following this site clustering step, standard reconstruction techniques can be used to recover the phylogeny. Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena.
Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis
Submitted, 2011. With C. Daskalakis.
Conference abstract in Proceedings of RECOMB 2010, 123-137.
We present an efficient phylogenetic reconstruction algorithm,
allowing insertions and deletions, which provably achieves a
sequence-length requirement (or sample complexity) growing
polynomially in the number of taxa. Our algorithm is distance-based,
that is, it relies on pairwise sequence comparisons. More importantly,
our approach largely bypasses the difficult problem of multiple
sequence alignment.
Global Alignment of Molecular Sequences via Ancestral State Reconstruction
Submitted, 2011.
With A. Andoni, C. Daskalakis, and A. Hassidim.
Conference abstract in Proceedings of ICS 2010, 358-369.
We consider the trace reconstruction problem on a tree (TRPT):
a binary sequence is broadcast through a tree channel where
we allow substitutions, deletions, and insertions; we seek
to reconstruct the original sequence from the sequences
received at the leaves. The TRPT is motivated by the
multiple sequence alignment problem in computational biology.
We give a simple recursive procedure giving strong reconstruction
guarantees at low mutation rates.
We begin with a random sequence $x_1, \ldots, x_k$ which is located at the root.
If vertex $v$ has the sequence $y_1, \ldots y_{k_{v}}$, then each one of its $d$
children will have a sequence which is generated from $y_1, \ldots y_{k_{v}}$ by
flipping three biased coins for each bit. The first coin has probability $p_s$
for Heads, and determines whether this bit will be substituted or not.
The second coin has probability $p_d$, and determines whether this bit
will be deleted, and the third coin has probability $p_i$ and determines
whether a new random bit will be inserted. The input to the procedure is
the sequences of the $n$ leaves of the tree, as well as the tree structure
(but not the sequences of the inner vertices) and the goal is to reconstruct
an approximation to the sequence of the root (the DNA of the ancestral father).
For every $\epsilon > 0$, we present a deterministic algorithm which outputs an
approximation of $x_1, \ldots, x_k$ with probability $1 - \epsilon$, if
$p_i + p_d < O(1/k^{2/3} \log n)$ and $(1 - 2p_s)^2 > O(d^{-1} \log d)$,
where the probability is on the coin flips which determine the mutation process.
To our knowledge, this is the first rigorous trace reconstruction result on a tree in the presence of indels.
Contact
If you would like to reach me, here is my contact information.
Office: MS 6156
Phone: (310) 206-0201
Fax: (310) 206-6673
lastname[at]math[dot]ucla[dot]edu
UCLA
Department of Mathematics
520 Portola Plaza
Los Angeles, CA 90095-1555