3 Application: Variation in Recombination Rate and Its Genetic Determinism in Sheep Populations

This dataset on crossovers was suggested and studied in Petit et al. (2017) and Johnston, Huisman, and Pemberton (2018).

3.1 Context

Let us quickly describe the dataset:

[chrom] are the chromosome indices of the sheep.
[wstart – wstop] are the window spans (unit: megabases) which correpond to the sites.
[nco_[species]_[genre]] are the number of crossovers for each species.
[coverage_[species]_[genre]] is a measurement of the offset.

There are no covariates, so this will be PLN-AR estimation only.

Definition 3.1

A crossover is a reciprocal recombination (exchange of genetic information) between two homologous chromosomes during meiosis. It allows alleles to be exchanged between chromosomes, thereby contributing to genetic diversity.
The recombination rate is the frequency of crossovers in a given window span.

The very low probability of a crossover (around \(10^{-8}\)) justifies the use of a Poisson-based regression model. The parameter \(\beta\) will model the mean recombination rate along the genome and \(\boldsymbol{\Sigma}\) the deviation around this mean.

Why do we need a modified PLN-AR model:

Empirically observations of a one-dimensional dependency along the sites of the genome.
\(\Phi\) will model this one-dimensional spatial dependency between one site and the next.

3.2 Exploration

First, let us clean and discover this dataset.

Cleaning

The dataset needs to be pivoted from long to wide format. We also rescale the offset by window size.

Code

## | file: ../scripts/import_data.R

Import

Code

library(tidyverse)
library(plotly)
cross <- readRDS("../data/crossover_wide.rds") |> mutate(
  region_name = glue::glue("{chrom}:{wstart}-{wstop}"),
  wstart = wstart / 1e6,
  wstop = wstop / 1e6
)

cross_wider <- cross |>
  select(starts_with("nco")) |>
  pivot_longer(
    cols = everything(),
    names_to = "Population",
    values_to = "N_co",
    names_prefix = "nco_"
  )

We perform a sanity check, that is to say that we check the number of crossovers in regions with no coverage information. It should hopefully be 0 and we confirm that it is.

Code

cross |>
  pivot_longer(
    cols = c(starts_with("nco"), starts_with("coverage")),
    names_to = c(".value", "population", "sex"),
    names_sep = "_"
  ) |>
  filter(coverage == 0)
#> # A tibble: 2 x 8
#>   chrom wstart wstop region_name          population sex     nco coverage
#>   <fct>  <dbl> <dbl> <glue>               <chr>      <chr> <int>    <dbl>
#> 1 23        62    63 23:62000000-63000000 Soay       F         0        0
#> 2 23        62    63 23:62000000-63000000 Soay       M         0        0

NAs removal

We remove them to avoid numerical problems during the inference.

Code

cross <- cross |> filter(!if_any(starts_with("coverage"), ~ .x == 0))
cross_wider <- cross |>
  select(starts_with("nco")) |>
  pivot_longer(
    cols = everything(),
    names_to = "Population",
    values_to = "N_co",
    names_prefix = "nco_"
  )

3.2.0.1 One-dimensional plot

Code

g <- ggplot(cross_wider) +
  stat_count(
    aes(x = N_co, y = after_stat(prop), fill = Population),
    position = "identity", alpha = 0.8
  ) +
  xlim(-0.5, 300.5)
# stat_ecdf(geom = "step")
g

Code

# ggplotly(g)

3.2.0.2 Classical PLN fitting

First, we carry out a classical non-AR VEM estimation, and we analyze how it fits to the original data.

Code

library(PLNmodels)
cross_ab <- cross |>
  select(starts_with("nco_")) |>
  data.matrix()

cross_offsets <- cross |>
  select(starts_with("coverage")) |>
  data.matrix()

cross_prep <- prepare_data(
  counts = cross_ab,
  covariates = cross |> select(region_name), ## uninformative covariate
  offset = cross_offsets
)

cross_pln <- PLN(
  ## offset computed in linear scale in the cross_prep data, should be in log scale in the poisson model
  formula = Abundance ~ 1, #+ offset(log(Offset)),
  data = cross_prep
)

Comparing the fitted and original data

Code

cross_fitted <- data.frame(cross_pln$fitted)
colnames(cross_fitted) <- colnames(cross |> select(starts_with("nco")))

cross_fitted_wider <- round(cross_fitted) |>
  pivot_longer(
    cols = everything(),
    names_to = "Population",
    values_to = "N_co_fit",
    names_prefix = "nco_"
  )

n <- 1000
p <- 4
mu <- c(cross_pln$model_par$B)
Sigma <- cross_pln$model_par$Sigma
depths <- rowSums(exp(rep(1, n) %o% diag(Sigma) / 2 + mu)) # Null offsets

cross_gen <- data.frame(rPLN(n, mu, Sigma, depths))
colnames(cross_gen) <- colnames(cross |> select(starts_with("nco")))
cross_gen_wider <- cross_gen |>
  pivot_longer(
    cols = everything(),
    names_to = "Population",
    values_to = "N_co",
    names_prefix = "nco_"
  ) |>
  mutate(Population = paste0(Population, "_gen"))

cross_compare <- data.frame(cross_wider, cross_fitted_wider |> select(N_co_fit))

g1 <- ggplot(
  cross_compare |>
    filter(grepl("Lacaune_M", Population)) |>
    select(starts_with("N_co")) |>
    gather()
) +
  stat_count(
    aes(x = value, y = after_stat(prop), fill = key),
    position = position_dodge()
  ) +
  xlim(-0.5, 300.5)

g2 <- ggplot(
  bind_rows(cross_wider, cross_gen_wider) |>
    filter(grepl("Lacaune_F", Population))
) +
  stat_count(
    aes(x = N_co, y = after_stat(prop), fill = Population),
    position = position_dodge()
  ) +
  xlim(-0.5, 300.5)
# stat_ecdf(geom = "step")
# ggplotly(g2)
g1

Code

# ks.test()

p_xz <- function(n, z, mu = 0, sigma = 1) {
  exp(-exp(z)) * exp(n * z) / factorial(n) *
    (1 / sqrt(2 * pi) * exp(-(z - mu)^2 / (2 * sigma^2)))
}

p_x <- function(n, mu = 0, sigma = 1) {
  integrate(
    function(z) {
      p_xz(n, z, mu, sigma)
    },
    lower = 0.2 * (mu - 100) / sigma**0.75,
    upper = 0.2 * (mu + 100) / sigma**0.75
  )
}

# p <- tibble(p = 1:100) |> mutate(across(1, function(n) {
#   p_x(n, mu = 1)
# }))
# sum(p)
# summary(p)

3.3 Goodness of fit

Once implemented, we shall perform various tests to check the goodness of fit of the predicted and original values, and compare them between the PLN and PLN-AR models:

Coefficient of determination
Prediction error and cross-validation (Manhattan distance or mean squared error in centered log-ratio coordinates)
Q-Q plot
Shapiro-Wicks

3.4 Analysis

Finally, we check that the auto-regressive structure explains the spatial shape of the crossover data.