Support FAQ

This page contains answers to many of the common questions asked about VarScan usage, performance, input/output, etc.

USAGE QUESTIONS

Which version of VarScan should I use?
Which VarScan command should I use?
Can I use VarScan on WGS, exome, or RNA-seq data?
Does VarScan work for pooled samples?
How do I use VarScan for validation?
Can I use VarScan on my model organism or microbial genome?

INPUT QUESTIONS

What input format should I give to VarScan?
Which aligner should I use?
Do I need Illumina or Phred scale base qualities?
What about mapping qualities?

OUTPUT QUESTIONS

What do the output columns mean?
How are the p-values calculated?
Why are all of my p-values 0.98?
How should I filter the output files?
Does VarScan provide a confidence score similar to SAMtools?

COMMON ISSUES, WARNINGS, AND ERRORS

Warnings about "resetting normal" or "resetting tumor" file
Read counts from VarScan are different from SAMtools/IGV counts


USAGE QUESTIONS

Which version of VarScan should I use?

Always use the latest version! Currently, this is v2.3.x. There are differences between versions; notably, between v2.2.2 and v2.2.3 we adjusted how reads are counted for indels, which had exhibited strange behavior due to some peculiarities in how SAMtools represents indels in the pileup files.

Which VarScan command should I use?

If you have a single sample and wish to call variants, you can use pileup2snp to call SNPs, pileup2indel to call indels, or pileup2cns to call consensus genotypes at every position meeting the coverage requirement. To call both SNPs and indels simultaneously, use pileup2cns with the --variants parameter set to 1. If you have tumor-normal pairs, you should use the somatic command to call mutations (SNVs/indels) and the copynumber command to call copy number alterations (CNAs).

Can I use VarScan on WGS, exome, or RNA-seq data?

Yes, you can use VarScan for any of these. The default settings are optimized for exome data, where one expects to have at least 10x or 20x coverage across targeted exons. By lowering the minimum coverage requirement (to perhaps 6x) one can call variants in lower-coverage data. Warning: the output files for WGS data may be quite large, and the runtimes longer, since it will call 3+ million variants per genome. VarScan works for RNA-seq data as well, though this data tends to be noisier for variant calling due to RT errors, allele-specific expression, and alignment difficulties.

Does VarScan work for pooled samples?

Absolutely. It's simply a matter of setting appropriate input parameters, particularly --min-coverage, --min-var-freq, and --p-value. In pooled data, you might specify a higher minimum coverage, a lower variant allele frequency (conservatively, 0.50 / the number of samples), and a less-stringent p-value to detect rare variants.

How do I use VarScan for validation?

If validating germline SNPs/indels on an orthogonal sequencing platform, use the pileup2cns command, which will call consensus genotypes at all positions with sufficient coverage, rather than just the variants. This lets you determine sites that are refuted as wild-type. You might start with the following parameters:
	
--min-coverage 20
--min-var-freq 0.08
--p-value 0.05
If validating somatic SNPs/indels on an orthogonal sequncing platform, use the somatic command with --validation set to 1. You might start with the following parameters:
	
--min-coverage 20
--min-var-freq 0.08
--p-value 0.10
--somatic-p-value 0.05
--validation 1


Can I use VarScan on my model organism or microbial genome?

Of course. All you need is a BAM file of your reads aligned to a reference sequence. If you don't have a reference sequence to which you might align reads, then no.

INPUT QUESTIONS

What input format should I give to VarScan?

VarScan expects its input in SAMtools pileup format, which is obtained from a BAM file via the samtools pileup command. For example:
	
samtools pileup -f myReference.fasta myReads.bam >myPileup.pileup
java -jar VarScan.jar pileup2snp myPileup.pileup
To save on disk space and I/O, you can also use a UNIX "pipe" command to forward the pileup output directly into VarScan:
	
samtools pileup -f myRef.fasta myBam.bam | java -jar VarScan.jar pileup2snp


Which aligner should I use?

Great question! The choice of an aligner is an important one, and entire review articles have been written on the topic. For practical purposes, you should use an aligner whose output is (or can be converted to) a SAM/BAM file. I have heard that popular aligners for include BWA, Bowtie, and Novoalign for Illumina data, SHRiMP and BFAST for SOLiD data, and SSAHA2 and BWA-SW for Roche/454 data.

Do I need Illumina or Phred scale base qualities?

VarScan expects Phred-scaled (Phred+33, also called "Sanger") base qualities, in which a score of 20 indicates a 1/1000 probability of base error. By default, VarScan requires a minimum base quality of 15 or 20 (depending on the application), so this value should be adjusted appropriately if alternate base qualities are used.

What about mapping qualities?


Many aligners provide a Phred-scaled quality value (mapping quality) for every read's alignment, which is correlated to the probability that the read is correctly mapped. A mapping quality of 0 typically indicates that the given read has many possible mapping locations of equal probability, and that the location given was chosen randomly. Thus, it's best to exclude reads with mapping quality of 0 from most downstream analyses. A minimum mapping quality of 10 is even better. It's possible to apply this threshold to a BAM file using SAMtools, as follows:
samtools view -b -q 10 myBam.bam | samtools pileup -f myRef.fasta - 
| java -jar VarScan.jar pileup2snp


OUTPUT QUESTIONS

What do the output columns mean?

By default, all VarScan output files should include headers. For detailed descriptions of the output columns and their meanings, see the "output" sections of the germline or somatic calling pages. You can also specify VCF output, which is widely documented.

How are the p-values calculated?

P-values are calculated using a Fisher's Exact Test on the read counts supporting reference and variant alleles. For details, see the germline or somatic calling pages.

Why are all of my p-values 0.98?

The p-value calculations are computationally expensive, so if the user doesn't specify a p-value threshold, VarScan skips the calculation in pileup2snp, pileup2indel, and pileup2cns. Instead, it inserts a dummy value of 0.98. To get p-values, set the --p-value parameter to something like 0.10, 0.05, or 0.01.

How should I filter the output files?

VarScan's default parameters aim to be sensitive when detecting variants, with the trade-off that it will report a lot of candidate variants, many of which will be false positives. These should be further filtered by coverage, variant allele frequency, strand representation, and p-value to isolate high-confidence calls. When performing the initial discovery, it is recommended that you set --strand-filter to 1. After discovery, you should run VarScan filter or somaticFilter to further refine your variant calls.

Does VarScan provide a confidence score similar to SAMtools?

It is possible to calculate a Phred-scaled confidence score from the p-value that VarScan provides:
$score = -10 * log10($p_value);
$score = 255 if($score > 255);
This is done on a per-sample basis when --vcf-output is specified, since VCF format expects Phred-scaled confidence scores.

COMMON ISSUES, WARNINGS, AND ERRORS

Warnings about "resetting normal" or "resetting tumor" file

If you provide two separate input (pileup) files, VarScan attempts to use the chromosome and position to simultaneously parse both files and match them up. Doing so while accounting for alphanumeric chromosome names is difficult, especially when there are chromosomes for which only one sample had coverage. VarScan does its best to ensure that no chromosomes are missed due to a sorting issue by resetting one sample's pileup file, to ensure that no chromosomes are missed due to sorting. When you have many chromosomes or contigs with coverage in just one sample, you'll see a lot of these warnings.

The best way to address this simultaneous parsing issue is quite simple: provide VarScan with a two-sample MPILEUP file (normal and tumor) instead of two individual pileup files:
samtools mpileup -f reference.fasta -q 1 -B normal.bam tumor.bam >normal-tumor.mpileup 
java -jar VarScan.jar somatic normal-tumor.mpileup normal-tumor.varScan.output --mpileup 1
In the mpileup file, SAMtools already does the chromosome and position matching-up, so there's no room for error. You'll also notice that I provided the -B parameter. This disables SAMtools BAQ computation, which is turned on by default in SAMtools mpileup, but (as the author admits) occasionally too stringent for variant calling.

Read counts from VarScan are different from SAMtools/IGV counts

By default, SAMtools and IGV show and count all bases at a given position, regardless of base quality. In contrast, VarScan requires that bases meet the minimum Phred quality score (default 15 for most commands) to count them for things like read counts (reads1, reads2) and to compute variant allele frequency. However, when VarScan reports the depth (such as in the DP field of VCF output), it reports SAMtools raw depth. To get VarScan read counts to more closely match another tool, set use parameter --min-avg-qual 0. And use caution! Low-quality bases, with the occasional exception of BAQ penalties, should not be trusted.

Also, VarScan reports variants on a biallelic basis. That is, for a given SNP call, the "reads1" column is the number of reference-supporting reads (RD), and the "reads2" column is the number of variant-supporting reads (AD). There may be additional reads at that position showing other bases (SNP or indel variants). If these other variants meet the calling criteria, they will be reported in their own line. If not, it may look like you have "missing" reads.