Code
## | file: ../scripts/import_data.R
This dataset on crossovers was suggested and studied in Petit et al. (2017) and Johnston, Huisman, and Pemberton (2018).
Let us quickly describe the dataset:
chrom
] are the chromosome indices of the sheep.wstart
– wstop
] are the window spans (unit: megabases) which correpond to the sites.nco_[species]_[genre]
] are the number of crossovers for each species.coverage_[species]_[genre]
] is a measurement of the offset.There are no covariates, so this will be PLN-AR estimation only.
Definition 3.1
The very low probability of a crossover (around \(10^{-8}\)) justifies the use of a Poisson-based regression model. The parameter \(\beta\) will model the mean recombination rate along the genome and \(\boldsymbol{\Sigma}\) the deviation around this mean.
Why do we need a modified PLN-AR model:
First, let us clean and discover this dataset.
Cleaning
The dataset needs to be pivoted from long to wide format. We also rescale the offset by window size.
## | file: ../scripts/import_data.R
Import
library(tidyverse)
library(plotly)
<- readRDS("../data/crossover_wide.rds") |> mutate(
cross region_name = glue::glue("{chrom}:{wstart}-{wstop}"),
wstart = wstart / 1e6,
wstop = wstop / 1e6
)
<- cross |>
cross_wider select(starts_with("nco")) |>
pivot_longer(
cols = everything(),
names_to = "Population",
values_to = "N_co",
names_prefix = "nco_"
)
We perform a sanity check, that is to say that we check the number of crossovers in regions with no coverage information. It should hopefully be 0 and we confirm that it is.
|>
cross pivot_longer(
cols = c(starts_with("nco"), starts_with("coverage")),
names_to = c(".value", "population", "sex"),
names_sep = "_"
|>
) filter(coverage == 0)
#> # A tibble: 2 x 8
#> chrom wstart wstop region_name population sex nco coverage
#> <fct> <dbl> <dbl> <glue> <chr> <chr> <int> <dbl>
#> 1 23 62 63 23:62000000-63000000 Soay F 0 0
#> 2 23 62 63 23:62000000-63000000 Soay M 0 0
NAs removal
We remove them to avoid numerical problems during the inference.
<- cross |> filter(!if_any(starts_with("coverage"), ~ .x == 0))
cross <- cross |>
cross_wider select(starts_with("nco")) |>
pivot_longer(
cols = everything(),
names_to = "Population",
values_to = "N_co",
names_prefix = "nco_"
)
<- ggplot(cross_wider) +
g stat_count(
aes(x = N_co, y = after_stat(prop), fill = Population),
position = "identity", alpha = 0.8
+
) xlim(-0.5, 300.5)
# stat_ecdf(geom = "step")
g
# ggplotly(g)
First, we carry out a classical non-AR VEM estimation, and we analyze how it fits to the original data.
library(PLNmodels)
<- cross |>
cross_ab select(starts_with("nco_")) |>
data.matrix()
<- cross |>
cross_offsets select(starts_with("coverage")) |>
data.matrix()
<- prepare_data(
cross_prep counts = cross_ab,
covariates = cross |> select(region_name), ## uninformative covariate
offset = cross_offsets
)
<- PLN(
cross_pln ## offset computed in linear scale in the cross_prep data, should be in log scale in the poisson model
formula = Abundance ~ 1, #+ offset(log(Offset)),
data = cross_prep
)
Comparing the fitted and original data
<- data.frame(cross_pln$fitted)
cross_fitted colnames(cross_fitted) <- colnames(cross |> select(starts_with("nco")))
<- round(cross_fitted) |>
cross_fitted_wider pivot_longer(
cols = everything(),
names_to = "Population",
values_to = "N_co_fit",
names_prefix = "nco_"
)
<- 1000
n <- 4
p <- c(cross_pln$model_par$B)
mu <- cross_pln$model_par$Sigma
Sigma <- rowSums(exp(rep(1, n) %o% diag(Sigma) / 2 + mu)) # Null offsets
depths
<- data.frame(rPLN(n, mu, Sigma, depths))
cross_gen colnames(cross_gen) <- colnames(cross |> select(starts_with("nco")))
<- cross_gen |>
cross_gen_wider pivot_longer(
cols = everything(),
names_to = "Population",
values_to = "N_co",
names_prefix = "nco_"
|>
) mutate(Population = paste0(Population, "_gen"))
<- data.frame(cross_wider, cross_fitted_wider |> select(N_co_fit))
cross_compare
<- ggplot(
g1 |>
cross_compare filter(grepl("Lacaune_M", Population)) |>
select(starts_with("N_co")) |>
gather()
+
) stat_count(
aes(x = value, y = after_stat(prop), fill = key),
position = position_dodge()
+
) xlim(-0.5, 300.5)
<- ggplot(
g2 bind_rows(cross_wider, cross_gen_wider) |>
filter(grepl("Lacaune_F", Population))
+
) stat_count(
aes(x = N_co, y = after_stat(prop), fill = Population),
position = position_dodge()
+
) xlim(-0.5, 300.5)
# stat_ecdf(geom = "step")
# ggplotly(g2)
g1
# ks.test()
<- function(n, z, mu = 0, sigma = 1) {
p_xz exp(-exp(z)) * exp(n * z) / factorial(n) *
1 / sqrt(2 * pi) * exp(-(z - mu)^2 / (2 * sigma^2)))
(
}
<- function(n, mu = 0, sigma = 1) {
p_x integrate(
function(z) {
p_xz(n, z, mu, sigma)
},lower = 0.2 * (mu - 100) / sigma**0.75,
upper = 0.2 * (mu + 100) / sigma**0.75
)
}
# p <- tibble(p = 1:100) |> mutate(across(1, function(n) {
# p_x(n, mu = 1)
# }))
# sum(p)
# summary(p)
Once implemented, we shall perform various tests to check the goodness of fit of the predicted and original values, and compare them between the PLN and PLN-AR models:
Finally, we check that the auto-regressive structure explains the spatial shape of the crossover data.