Somatic Mutation Calling

Overview
Commands
Input
Methods
Output

Overview

VarScan calls somatic variants (SNPs and indels) using a heuristic method and a statistical test based on the number of aligned reads supporting each allele.

Command

The syntax of the command for somatic mutation calling differs somewhat from germline calling subcommands.
	
java -jar VarScan.jar somatic normal.pileup tumor.pileup output.basename

The above command will report germline, somatic, and LOH events at positions where both normal and tumor samples have sufficient coverage (default: 8). If the --validation flag is set to 1, it also report non-variant (reference) calls at all positions with sufficient coverage, for the purpose of refuting false positives with validation data.

Input

This command expects both a normal and a tumor file in SAMtools pileup format. from sequence alignments in binary alignment/map (BAM) format. To build a pileup file, you will need:
  • A SAM/BAM file ("myData.bam") that has been sorted using the sort command of SAMtools.
  • The reference sequence ("reference.fasta") to which reads were aligned, in FASTA format.
  • The SAMtools software package.
The command that you'll use depends on the version of SAMtools that you have installed. In Linux/UNIX/MacOSX, it is possible to use piped input rather than actual files to save disk space and I/O. This is most easily done in a simple Perl script; examples are below.

SAMtools Before v0.1.16 (r973): Use pileup

Older versions of SAMtools support the pileup command, which expects a single BAM and a reference sequence. If possible, do NOT include -c or -v, as these will generate consensus format. VarScan supports this format but does not use SAMtools consensus information.
samtools pileup -f [reference sequence] [BAM file] >myData.pileup

This Perl snippet shows you how to pipe input from SAMtools into VarScan:
$normal_pileup = "samtools pileup -f $reference $normal_bam";
$tumor_pileup = "samtools pileup -f $reference $tumor_bam";

To limit the pileup to reads with mapping quality > 0 (recommended), use this variation:
$normal_pileup = "samtools view -b -u -q 1 $normal_bam | samtools pileup -f $reference -";
$tumor_pileup = "samtools view -b -u -q 1 $tumor_bam | samtools pileup -f $reference -";

Next, issue a system call that pipes input from these commands into VarScan :
bash -c \"java -jar VarScan.jar somatic <\($normal_pileup\) <\($tumor_pileup\) output

SAMtools v0.1.17 (r973) and Later: Use mpileup

As of SAMtools v0.1.17 (r973) and later, the pileup command is deprecated and has been replaced with mpileup to accommodate multi-sample calling. You can still generate a pileup file, but make sure you provide only a single BAM:
samtools mpileup -f [reference sequence] [BAM file] >myData.pileup

Do NOT use any of the variant- or consensus-calling parameters. You just want the raw pileup output. This Perl snippet shows you how to pipe input from SAMtools into VarScan:
$normal_pileup = "samtools mpileup -f $reference $normal_bam";
$tumor_pileup = "samtools mpileup -f $reference $tumor_bam";

To limit the pileup to reads with mapping quality > 0 (recommended), use this variation:
$normal_pileup = "samtools mpileup -q 1 -f $reference $normal_bam";
$tumor_pileup = "samtools mpileup -q 1 -f $reference $tumor_bam";

Next, issue a system call that pipes input from these commands into VarScan :
bash -c \"java -jar VarScan.jar somatic <\($normal_pileup\) <\($tumor_pileup\) output


Methods


Normal/Tumor Input Parsing

In somatic mode, VarScan reads the pileup files from normal and tumor simultaneously. Only positions that are present in both files, and meet the minimum coverage in both files, will be compared. VarScan expects that positions on the same chromosome occur in ascending order, so input should be position-sorted! Please note, this kind of simultaneous parsing gets tricky when there are numerous reference sequences (e.g. unplaced contigs) to which reads from only one sample aligned. VarScan tries to obtain the maximum number of comparisons, even if it means closing and reopening the normal file to try to match contig and position. This can lead to looping errors; please report them if you see a number of warmings about "Resetting normal file..." and (if possible) send us sample pileups.

Variant Calling and Comparison

At every position where both normal and tumor have sufficient coverage, a comparison is made. First, normal and tumor are called independently using the germline consensus calling functionality. Then, their genotypes are compared by the following algorithm:

    If tumor matches normal:
        If tumor and normal match the reference
            ==> Call Reference
        Else tumor and normal do not match the reference
            ==> Call Germline

    Else tumor does not match normal:
    Calculate significance of allele frequency difference by Fisher's Exact Test
        If difference is significant (p-value < threshold):
            If normal matches reference
                ==> Call Somatic
            Else If normal is heterozygous
                ==> Call LOH
            Else normal and tumor are variant, but different
                ==> Call IndelFilter or Unknown
        Else difference is not significant:
        Combined tumor and normal read counts for each allele. Recalculate p-value.
            ==> Call Germline

Output

VarScan creates two output files by default, one for SNVs (.snp) and one for indels (.indel). If the --validation flag is turned on, a third file (.validation) will also be generated containing all positions that were called. Output files have headers, and all share the same format. The table below describes each field in VarScan native output format, along with where to find the same value in VCF's INFO (column 8), NORMAL (column 10), or TUMOR (column 11) fields.
Native Output FieldVCF FieldDescription
chromCHROMChromosome or reference name
positionPOSPosition from pileup (1-based)
refREFReference base at this position
varALTVariant base seen in tumor
normal_reads1RD (col 10)Reads supporting reference in normal
normal_reads2AD (col 10)Reads supporting variant in normal
normal_var_freqFREQ(col 10)Variant allele frequency in normal
normal_gtGT (col 10)Consensus genotype call in normal
tumor_reads1RD (col 11)Reads supporting reference in tumor
tumor_reads2AD (col 11)Reads supporting variant in tumor
tumor_var_freqFREQ(col 11)Variant allele frequency in tumor
tumor_gtGT (col 11)Consensus genotype call in tumor
somatic_statusSS (col 8)Somatic status call (Germline, Somatic, LOH, or Unknown)
variant_p_valueGPV (col 8)Variant p-value for Germline events
somatic_p_valueSPV (col 8)Somatic p-value for Somatic/LOH events
tumor_reads1_plusDP4 (col 11)Tumor reference-supporting reads on + strand
tumor_reads1_minusDP4 (col 11)Tumor reference-supporting reads on - strand
tumor_reads2_plusDP4 (col 11)Tumor variant-supporting reads on + strand
tumor_reads2_minus  DP4 (col 11)Tumor variant-supporting reads on - strand
normal_reads1_plusDP4 (col 10)Normal reference-supporting reads on + strand
normal_reads1_minusDP4 (col 10)Normal reference-supporting reads on - strand
normal_reads2_plusDP4 (col 10)Normal variant-supporting reads on + strand
normal_reads2_minusDP4 (col 10)Normal variant-supporting reads on - strand

Isolating Calls by Type and Confidence

The latest release of VarScan includes a new (undocumented) subcommand that will separate a somatic output file by somatic_status (Germline, Somatic, LOH). Somatic mutations will further be classified as high-confidence (.hc) or low-confidence (.lc). The command:
java -jar VarScan.jar processSomatic [output.snp]

The above command will produce 4 output files:
output.snp.Somatic.hc 	(high-confidence Somatic mutations)
output.snp.Somatic.lc 	(low-confidence Somatic mutations)
output.snp.Germline 	(sites called Germline)
output.snp.LOH 		(sites called loss-of-heterozygosity, or LOH)

A Note on Indel Positions

VarScan calls consensus bases, SNPs, and indels at the position reported by SAMtools in the pileup file. For SNPs and consensus bases, this is the 1-based position of the site or variant. Indels, however, are reported at the base immediately upstream of where they occur. Thus, the first inserted base occurs at (position + 1) and the first deleted base occurs at (position + 1). This is why, for deletions, the reference base (e.g. A) may not match the deleted sequence (e.g. */C).