Trio Calling for de novo Mutations

Next-generation sequencing of family pedigrees offers a powerful approach for the identification of transmitted alleles and/or de novo mutations that may confer susceptibility to disease. The "trio" subcommand of VarScan leverages the family relationship to improve variant calling accuracy, identify apparent Mendelian Inheritance Errors (MIEs), and detect high-confidence de novo mutations.

Based on recent WGS studies in families, we think that the de novo mutation rate in humans is approximately 1.1 × 10-8 per haploid genome (1000 Genomes Project Consortium, 2010; Roach et al., 2010). By this estimate, an individual's diploid genome harbors, on average, around 64 de novo mutations among 3.2 billion base pairs. In the consensus coding sequence (CCDS) (~34 mbp), we expect less than one de novo coding mutation per diploid individual.

Because of this extreme rareness, de novo mutations should be called conservatively. To address this, VarScan re-evaluates apparent de novo mutations in each parent using relaxed parameters and re-classifies those with some evidence in one or both parents as a germline variant. In a similar manner, VarScan attempts to reconcile apparent Mendelian Inheritance Errors. The output of the trio subcommand is a single VCF in which all variants are classified as germline (transmitted or untransmitted), de novo, or MIE.

Input

This command requires "mpileup" for the father, mother, and child (in that order). Generating it will require:
  • The SAMtools package
  • The reference sequence in FASTA format
  • BAM files for the father, mother, and child

Trio Calling Syntax

Trio calling with VarScan 2 is a two-step process.

1. Generate a three-sample mpileup

Here's an example command:
samtools mpileup -B -q 1 -f ref.fasta dad.bam mom.bam child.bam >trio.mpileup

2. Run VarScan trio

Here's the syntax for the VarScan subcommand:
java -jar VarScan.jar trio trio.mpileup trio.mpileup.output \
      --min-coverage 10 --min-var-freq 0.20 --p-value 0.05 \ 
      -adj-var-freq 0.05 -adj-p-value 0.15

Trio Calling Algorithm

VarScan first calls variants across all three samples the same fashion as it does for mpileup2snp using the default (or user-provided) --min-var-freq and --p-value settings. Next, it identifies any variants with apparent Mendelian Inheritance Errors (i.e. present in child but absent from either parent). In these instances, it re-calls the samples that should have a variant but were called wild type with adjusted settings (--adj-var-freq and --adj-p-value), in an attempt to correct the call. This often corrects the MIE, in which case the corrected genotypes are reported. If not, the site will be flagged as mendelError (in the FILTER field) and/or DENOVO (in the INFO field).

Output

The above command will produce two VCF output files: one for SNPs (trio.mpileup.output.snp.vcf) and one for indels (trio.mpileup.output.indel.vcf). Relevant INFO fields include:
  • FILTER - mendelError if MIE, otherwise PASS
  • STATUS - 1=untransmitted, 2=transmitted, 3=denovo, 4=MIE
  • DENOVO - if present, indicates a high-confidence de novo mutation call

Downstream Filtering/Interpreation

Even the stringent methodology in VarScan trio will likely call candidate de novo mutations that are not real. These should be filtered for false positives (using the child BAM) the same way that somatic mutations are. They should also be filtered to aggressively remove known germline variants (i.e. by removing common dbSNPs). A recent whole-genome sequencing study in Dutch families developed a random forest classifier for distinguishing true de novo mutations, and found that the most important factors were related to sequencing depth and read counts. In other words, the ideal de novo mutation call will have high depth (>20x) in all three samples, with good support in the child (40-50% of reads for autosomal calls) and no variant-supporting reads in either parent.