Title: Improved computations for
relationship inference using low coverage sequencing data
Authors: Petter Mostad, Andreas Tillmar and
Daniel Kling
Contact: mostad@chalmers.se or daniel.kling@rmv.se
Description
The publication
describes a novel approach
to relationship inference
from low coverage sequencing data. The model
combines an inheritance model for genetic linkage with an observational model
accouting for sequencing read counts to model genotype likelihoods. That is,
the computations do not assume genotypes but models likelihood distributions
for all possible genotypes.
We use example data from the publication by Tillmar et al describing a SNP
panel with roughly 4000 kinship informative markers. https://www.mdpi.com/2073-4425/12/12/1968
Usage and applications
The implementation
is provided as a open source R-script with research purposes as the intended use. A validated Windows GUI
version of the method will
be provided at https://famlink.se/f_download.html
with the main application being forensic genetics.
The typical input is low coverage sequence data for one or more samples and
high quality genotype data (from whatever platform) for zero or more samples.
One scenario can be the comparison of low quality sequence data generated from
low quality material (e.g. old bones) with high quality genotype data from one
(or more) alleged living relatives. Comparisons are obviously also possible if
samples are only high quality as well.
Our R-script implementation accepts simple allele
read counts for a SNP marker as input. In addition
the R-scripts accepts a common
vcf-file with PL parameters as input. The PL data will
be used in the genotype likelihood model and therefore allows the usage of an external software, e.g. ANSGD or GATK to generate these likelihoods directly from BAM files.
Parameters
Our model requires some parameters to be estimated. We provide a rough
guideline but each user must review their particular application. Feel free to
reach out to the emails above for any question relating to this.
A genetic map is necessary for the inheritance model. The map contains genetic
positions, in cM, for each included marker. The map positions are converted
to recombination rates in the script and used in the transmission model of
the our method. We provide an example genetic map
in the next section. https://famlink.se/f_databases.html
provides a few additional
genetic maps. Reach out to us if you would like to construct your own map.
A population frequency database specifying the allele
frequencies for each include marker. We provide an example in the next section
but refer to the 1000G project for additional frequencies or https://famlink.se/f_databases.html for some additional populations.
Our model further necessitates setting some additional
parameters. First, the gamma parameter relating to the probability of an
unobserved allele in the data. That is, if the sequence data indicates an
allele not present in the frequency database, our model requires a frequency
for this allele. One model for the gamma parameter is to use 5/(2N) as gamma,
where N relates to the number of typed individuals in the frequency database.
On a conservative side, this parameter should take a value in the range 0.1-5%
with 5% being most conservative. The Fst parameter relates to subpopulation
structure in the population frequency database. The value is typically low for outbred population (say <0.1%)
but can be higher in inbred population (say 1-3%). The user should
consult the litterature to find a measure that applies to their population.
Finally, the two parameters relating to our observation model, Epar and Mpar.
Epar relates to errors induced in the sequencing process and typically takes a
value in the range 0.1-1% depending on the sample and sequencer used, see https://academic.oup.com/nargab/article/3/1/lqab019/6193612
for some data. Mpar is more difficult to give guidance on and future studies
will evaluate the impact of different parameter values. An intuitive
description of the parameter is the amount of haplotypes available in the
sample prior to the PCR step, which translates to twice the number of cells in
the sample. For low quality data we advice setting this parameter somewhere in
the range 5-20 and for high quality samples >100.
Download
The script is available
through https://familias.name/lcNGS/Files/lcNGS.R (uploaded 2023-01-23)
Genetic map and frequency data from the CEU population in the 1000G project as
well as example vcf file with two individuals.
Files used
in the simulations in the paper are available for simulated depth 3
and 10 respectively.
Getting started
The R tool suite
must be installed, we recommend version 4.0.0 or higher.
The script has only been tested on Windows OS but should work on other systems as well. Please reach
out if you experience any issues.
As stated
in the Usage and application
section previously, the R-script is intended for research
purposes and require some an understanding of R-code. For other uses
we refer to the implementation available at https://famlink.se/f_download.html.
We examplify the method and our implemention in R
using a case with simulated data from two individuals, P1 and P2. Genotype data is provided as a vcf-file. In
addition, a genetic map
as well as a frequency file is required. Once the data is downloaded,
start R.
In the console type
source("lcNGS.R")
to load the R code. Note, this necessitates that the
script is located in your current working directory. Next, the main entry point
for calculations is the function
computeLR <- function(mapFile, freqDataFile, testDataFile,
datatype =
"AD",
Fst = 0.01,
gamma = 0.0001,
epar = c(0.001, 0.001), # vector of same length as
nTested
mpar = c(100, 100) # vector of same length as
nTested
)
Run the command using the data downloaded through the
link provided previously
computeLR("FORCE.map", "FORCE_EUR.freq",
"force_test_vcf.vcf")
with all parameters set at their default values. The output should be (ped 1 is full siblings),
[1] "likelihood
ratio ped 1 vs ped 2 :
4.45493523748995e+257" #ped 2 is unrelated
[1] "likelihood
ratio ped 1 vs ped 3 : 2.90916370392427e-12"
#ped 3 is half siblings
[1] "likelihood
ratio ped 1 vs ped 4 : 471992365854597" #ped
4 is first cousins
[1] "likelihood
ratio ped 1 vs ped 5 : 4.50276676988792e+48"
#ped 5 is
second cousins
Note, the relationships are now
hard-coded as part of the computeLR
function but can be directly changed in the code. Furthermore, note that the readVCF function can also read information from
the PL tag by simply setting data type
to PL instead and also GT
to use genotypes from the vcf file.
The script further contains functions for simulations etc which can be explored
in the script itself. This and changing
the relationship (which is now hardcoded) requires directly changing the code.