Code
## | file: ../scripts/import_data.RThis dataset on crossovers was suggested and studied in Petit et al. (2017) and Johnston, Huisman, and Pemberton (2018).
Let us quickly describe the dataset:
chrom] are the chromosome indices of the sheep.wstart – wstop] are the window spans (unit: megabases) which correpond to the sites.nco_[species]_[genre]] are the number of crossovers for each species.coverage_[species]_[genre]] is a measurement of the offset.There are no covariates, so this will be PLN-AR estimation only.
Definition 3.1
The very low probability of a crossover (around \(10^{-8}\)) justifies the use of a Poisson-based regression model. The parameter \(\beta\) will model the mean recombination rate along the genome and \(\boldsymbol{\Sigma}\) the deviation around this mean.
Why do we need a modified PLN-AR model:
First, let us clean and discover this dataset.
Cleaning
The dataset needs to be pivoted from long to wide format. We also rescale the offset by window size.
## | file: ../scripts/import_data.RImport
library(tidyverse)
library(plotly)
cross <- readRDS("../data/crossover_wide.rds") |> mutate(
region_name = glue::glue("{chrom}:{wstart}-{wstop}"),
wstart = wstart / 1e6,
wstop = wstop / 1e6
)
cross_wider <- cross |>
select(starts_with("nco")) |>
pivot_longer(
cols = everything(),
names_to = "Population",
values_to = "N_co",
names_prefix = "nco_"
)We perform a sanity check, that is to say that we check the number of crossovers in regions with no coverage information. It should hopefully be 0 and we confirm that it is.
cross |>
pivot_longer(
cols = c(starts_with("nco"), starts_with("coverage")),
names_to = c(".value", "population", "sex"),
names_sep = "_"
) |>
filter(coverage == 0)
#> # A tibble: 2 x 8
#> chrom wstart wstop region_name population sex nco coverage
#> <fct> <dbl> <dbl> <glue> <chr> <chr> <int> <dbl>
#> 1 23 62 63 23:62000000-63000000 Soay F 0 0
#> 2 23 62 63 23:62000000-63000000 Soay M 0 0NAs removal
We remove them to avoid numerical problems during the inference.
cross <- cross |> filter(!if_any(starts_with("coverage"), ~ .x == 0))
cross_wider <- cross |>
select(starts_with("nco")) |>
pivot_longer(
cols = everything(),
names_to = "Population",
values_to = "N_co",
names_prefix = "nco_"
)g <- ggplot(cross_wider) +
stat_count(
aes(x = N_co, y = after_stat(prop), fill = Population),
position = "identity", alpha = 0.8
) +
xlim(-0.5, 300.5)
# stat_ecdf(geom = "step")
g# ggplotly(g)First, we carry out a classical non-AR VEM estimation, and we analyze how it fits to the original data.
library(PLNmodels)
cross_ab <- cross |>
select(starts_with("nco_")) |>
data.matrix()
cross_offsets <- cross |>
select(starts_with("coverage")) |>
data.matrix()
cross_prep <- prepare_data(
counts = cross_ab,
covariates = cross |> select(region_name), ## uninformative covariate
offset = cross_offsets
)
cross_pln <- PLN(
## offset computed in linear scale in the cross_prep data, should be in log scale in the poisson model
formula = Abundance ~ 1, #+ offset(log(Offset)),
data = cross_prep
)Comparing the fitted and original data
cross_fitted <- data.frame(cross_pln$fitted)
colnames(cross_fitted) <- colnames(cross |> select(starts_with("nco")))
cross_fitted_wider <- round(cross_fitted) |>
pivot_longer(
cols = everything(),
names_to = "Population",
values_to = "N_co_fit",
names_prefix = "nco_"
)
n <- 1000
p <- 4
mu <- c(cross_pln$model_par$B)
Sigma <- cross_pln$model_par$Sigma
depths <- rowSums(exp(rep(1, n) %o% diag(Sigma) / 2 + mu)) # Null offsets
cross_gen <- data.frame(rPLN(n, mu, Sigma, depths))
colnames(cross_gen) <- colnames(cross |> select(starts_with("nco")))
cross_gen_wider <- cross_gen |>
pivot_longer(
cols = everything(),
names_to = "Population",
values_to = "N_co",
names_prefix = "nco_"
) |>
mutate(Population = paste0(Population, "_gen"))
cross_compare <- data.frame(cross_wider, cross_fitted_wider |> select(N_co_fit))
g1 <- ggplot(
cross_compare |>
filter(grepl("Lacaune_M", Population)) |>
select(starts_with("N_co")) |>
gather()
) +
stat_count(
aes(x = value, y = after_stat(prop), fill = key),
position = position_dodge()
) +
xlim(-0.5, 300.5)
g2 <- ggplot(
bind_rows(cross_wider, cross_gen_wider) |>
filter(grepl("Lacaune_F", Population))
) +
stat_count(
aes(x = N_co, y = after_stat(prop), fill = Population),
position = position_dodge()
) +
xlim(-0.5, 300.5)
# stat_ecdf(geom = "step")
# ggplotly(g2)
g1# ks.test()
p_xz <- function(n, z, mu = 0, sigma = 1) {
exp(-exp(z)) * exp(n * z) / factorial(n) *
(1 / sqrt(2 * pi) * exp(-(z - mu)^2 / (2 * sigma^2)))
}
p_x <- function(n, mu = 0, sigma = 1) {
integrate(
function(z) {
p_xz(n, z, mu, sigma)
},
lower = 0.2 * (mu - 100) / sigma**0.75,
upper = 0.2 * (mu + 100) / sigma**0.75
)
}
# p <- tibble(p = 1:100) |> mutate(across(1, function(n) {
# p_x(n, mu = 1)
# }))
# sum(p)
# summary(p)Once implemented, we shall perform various tests to check the goodness of fit of the predicted and original values, and compare them between the PLN and PLN-AR models:
Finally, we check that the auto-regressive structure explains the spatial shape of the crossover data.