Web page for lcNGS

Title: Improved computations for relationship inference using low coverage sequencing data

Authors: Petter Mostad, Andreas Tillmar and Daniel Kling

Contact: mostad@chalmers.se or daniel.kling@rmv.se

Description

The publication describes a novel approach to relationship inference from low coverage sequencing data. The model combines an inheritance model for genetic linkage with an observational model accouting for sequencing read counts to model genotype likelihoods. That is, the computations do not assume genotypes but models likelihood distributions for all possible genotypes.

We use example data from the publication by Tillmar et al describing a SNP panel with roughly 4000 kinship informative markers. https://www.mdpi.com/2073-4425/12/12/1968

Usage and applications

The implementation is provided as a open source R-script with research purposes as the intended use. A validated Windows GUI version of the method will be provided at https://famlink.se/f_download.html with the main application being forensic genetics.

The typical input is low coverage sequence data for one or more samples and high quality genotype data (from whatever platform) for zero or more samples. One scenario can be the comparison of low quality sequence data generated from low quality material (e.g. old bones) with high quality genotype data from one (or more) alleged living relatives. Comparisons are obviously also possible if samples are only high quality as well.

Our R-script implementation accepts simple allele read counts for a SNP marker as input. In addition the R-scripts accepts a common vcf-file with PL parameters as input. The PL data will be used in the genotype likelihood model and therefore allows the usage of an external software, e.g. ANSGD or GATK to generate these likelihoods directly from BAM files.

Parameters

Our model requires some parameters to be estimated. We provide a rough guideline but each user must review their particular application. Feel free to reach out to the emails above for any question relating to this.

A genetic map is necessary for the inheritance model. The map contains genetic positions, in cM, for each included marker. The map positions are converted to recombination rates in the script and used in the transmission model of the our method. We provide an example genetic map in the next section. https://famlink.se/f_databases.html provides a few additional genetic maps. Reach out to us if you would like to construct your own map.

A population frequency database specifying the allele frequencies for each include marker. We provide an example in the next section but refer to the 1000G project for additional frequencies or https://famlink.se/f_databases.html for some additional populations.

Our model further necessitates setting some additional parameters. First, the gamma parameter relating to the probability of an unobserved allele in the data. That is, if the sequence data indicates an allele not present in the frequency database, our model requires a frequency for this allele. One model for the gamma parameter is to use 5/(2N) as gamma, where N relates to the number of typed individuals in the frequency database. On a conservative side, this parameter should take a value in the range 0.1-5% with 5% being most conservative. The Fst parameter relates to subpopulation structure in the population frequency database. The value is typically low for outbred population (say <0.1%) but can be higher in inbred population (say 1-3%). The user should consult the litterature to find a measure that applies to their population. Finally, the two parameters relating to our observation model, Epar and Mpar. Epar relates to errors induced in the sequencing process and typically takes a value in the range 0.1-1% depending on the sample and sequencer used, see https://academic.oup.com/nargab/article/3/1/lqab019/6193612 for some data. Mpar is more difficult to give guidance on and future studies will evaluate the impact of different parameter values. An intuitive description of the parameter is the amount of haplotypes available in the sample prior to the PCR step, which translates to twice the number of cells in the sample. For low quality data we advice setting this parameter somewhere in the range 5-20 and for high quality samples >100.

Download

The script is available through https://familias.name/lcNGS/Files/lcNGS.R (uploaded 2023-01-23)

Genetic map and frequency data from the CEU population in the 1000G project as well as example vcf file with two individuals.

Files used in the simulations in the paper are available for simulated depth 3 and 10 respectively.

Getting started

The R tool suite must be installed, we recommend version 4.0.0 or higher. The script has only been tested on Windows OS but should work on other systems as well. Please reach out if you experience any issues.

As stated in the Usage and application section previously, the R-script is intended for research purposes and require some an understanding of R-code. For other uses we refer to the implementation available at https://famlink.se/f_download.html.

We examplify the method and our implemention in R using a case with simulated data from two individuals, P1 and P2. Genotype data is provided as a vcf-file. In addition, a genetic map as well as a frequency file is required. Once the data is downloaded, start R.

In the console type

source("lcNGS.R")

to load the R code. Note, this necessitates that the script is located in your current working directory. Next, the main entry point for calculations is the function

computeLR <- function(mapFile, freqDataFile, testDataFile,

datatype = "AD",

Fst = 0.01,

gamma = 0.0001,

epar = c(0.001, 0.001), # vector of same length as nTested

mpar = c(100, 100) # vector of same length as nTested

)

Run the command using the data downloaded through the link provided previously

computeLR("FORCE.map", "FORCE_EUR.freq", "force_test_vcf.vcf")

with all parameters set at their default values. The output should be (ped 1 is full siblings),

[1] "likelihood ratio ped 1 vs ped 2 : 4.45493523748995e+257" #ped 2 is unrelated

[1] "likelihood ratio ped 1 vs ped 3 : 2.90916370392427e-12" #ped 3 is half siblings

[1] "likelihood ratio ped 1 vs ped 4 : 471992365854597" #ped 4 is first cousins

[1] "likelihood ratio ped 1 vs ped 5 : 4.50276676988792e+48" #ped 5 is second cousins

Note, the relationships are now hard-coded as part of the computeLR function but can be directly changed in the code. Furthermore, note that the readVCF function can also read information from the PL tag by simply setting data type to PL instead and also GT to use genotypes from the vcf file.

The script further contains functions for simulations etc which can be explored in the script itself. This and changing the relationship (which is now hardcoded) requires directly changing the code.