Poisson log-normal estimation with dependency between sites

Authors
Affiliations

MIA Paris-Saclay, AgroParisTech, INRAE

MaIAGE, Université Paris-Saclay, INRAE

Stéphane Robin

CESCO, LPSM, Sorbonne Université

GenPhySE, INRAE

Published

September 10, 2023

Abstract

The classical version of the PLN model assumes that \(n\) independent count vectors (each of dimension \(p\)) are observed. The aim of this course is to extend the PLN model to take into account a dependency between count vectors. Depending on the application, it may be assumed that the dependency structure between observations is free or takes a specific form. The associated inference algorithm will then need to be developed. This algorithm could eventually be integrated into the R/C++ PLNmodels package. This subject is motivated in particular by an application to population genetics, where the p counts making up a vector are derived from \(p\) possibly related individuals, and where each vector of counts is associated with a locus along the genome. The aim here is to take into account the dependency between neighboring loci (known as linkage disequilibrium).

Keywords: Poisson, log-normal, EM, variational, Generalised Linear Mixed Models, Graphical Models, Genomics.

Acknowledgments: I would like to express my gratitude to my internship supervisors Julien Chiquet and Mahendra Mariadassou, for putting great effort into providing guidance whenever needed. I am also very grateful to Stéphane Robin and Bertrand Servin, with whom I have worked all along this internship, as well as to the whole team of the UMR Mathématique et Informatique Appliquées (Paris-Saclay) for their support and their kindness. Finally, I want to thank Emma, who has been there the whole time.

Introduction

This report summarises the advancements that have been made during my research internship at UMR Mathématique et Informatique Appliquées (Paris-Saclay), between April 2023 and July 2023, under the supervision of Julien Chiquet, Mahendra Mariadassou, Stéphane Robin and Bertrand Servin.

The goal was to design a multivariate count model adapted to a dataset on crossing-overs, which are breaks all along the DNA strand at the time of meiosis, and for multiples species of sheep. The MIA team has developed the Poisson log-normal model, which is accurate for counting multiple variables and measure their correlations.

But in this specific dataset, there is a clear spatial dependency that lacks in the classical PLN model where the sites are independent. So we wanted to design a PLN model where the latent variables have an auto-regressive structure, and implement in R into the PLNmodels package.