VarScan
User Guides
Somatic Copy Number Alteration (CNA) Calling
OverviewRecommended Workflow
Commands
Input
Methods
Output
Filtering/Adjusting Results with copyCaller
Segmentation and Classification
Recurrent CNA Identification with CMDS
Overview
VarScan v2.2.4 and later includes novel functionality to infer somatic copy number changes using data from matched tumor-normal pairs (manuscript under review). This is designed specifically for exome sequencing, in which a tumor sample and its matched normal were captured and sequenced under identical conditions. By performing a pairwise comparison of read depth between the samples at each position in the exome, it becomes possible to infer relative changes in copy number in the tumor sample. The output of this tool is a set of regions with chromosome, start, stop, and log-base-2 of the copy number change, which is similar to array-based copy number data and therefore amenable to the same segmentation algorithms.Recommended Workflow
If you're just getting started, here is the recommended procedure for performing copy number analysis using VarScan. Let's assume you have a normal BAM and a tumor BAM file from targeted (e.g. exome) sequencing, and would like to identify tumor-specific (somatic) copy number changes.1. Run VarScan copynumber on normal and tumor mpileup output
samtools mpileup -q 1 -f ref.fa normal.bam tumor.bam | java -jar VarScan.jar copynumber varScan --mpileup 1This will create a single output file, varScan.copynumber, containing the raw copynumber calls.
2. Run VarScan copyCaller to adjust for GC content and make preliminary calls.
java -jar VarScan.jar copyCaller varScan.copynumber --output-file varScan.copynumber.called [--output-homdel-file varScan.copynumber.called.homdel]This will create two output files: varScan.copynumber.called (adjusted calls) and varScan.copynumber.called.gc (GC adjustment information). As of version 2.2.12 (August 2012), you can also specify an output file for candidate homozygous deletions.
3. Apply circular binary segmentation using the DNAcopy library from BioConductor
See the section on CBS segmentation for how to do this. A per-chromosome basis is usually best.
4. Visualize CBS results and recenter if necessary.
The DNAcopy package includes utilities to visualize the data points and segments. If all of the data and segments are consistently above or below the neutral value (0.0), you can re-center the data points with VarScan copyCaller. Then repeat step 3.
5. Merge adjacent segments of similar copy number and classify events by size (large-scale or focal)
You can download the mergeSegements.pl utility script from the VarScan scripts folder on SourceForge.
Command
The syntax of the command for copy number calling is most similar to the VarScan somatic tool.java -jar VarScan.jar copynumber normal-tumor.pileup output.basename --mpileup 1Or, if you have independent normal and tumor pileup files:
java -jar VarScan.jar copynumber normal.pileup tumor.pileup output.basename
The above command will read data from normal and tumor files simultaneously, compute the Q>20 read depths for each, and report the relative copy number in the tumor on a log scale.
Input
This command expects input from normal and tumor in SAMtools pileup format. To obtain this format, you will need:- Sorted BAM files for normal and tumor.
- The reference sequence ("reference.fasta") to which reads were aligned, in FASTA format.
- The SAMtools software package.
SAMtools v0.1.17 (r973) and Later: Use mpileup
As of SAMtools v0.1.17 (r973) and later, the pileup command is deprecated and has been replaced with mpileup to accommodate multi-sample calling. VarScan will read in a single mpileup file containing normal and tumor data, respectively. Do NOT use any of the variant- or consensus-calling parameters. You just want the raw mpileup output. This Perl snippet shows you how to pipe input from SAMtools into VarScan:samtools mpileup -q 1 -f $reference $normal_bam $tumor_bam | java -jar VarScan.jar copynumber output.basename --mpileup 1The command should be typed in a single line. It's shown in two lines above for readability.
To limit the pileup to reads with mapping quality > 0 (recommended), use this variation:
samtools mpileup -q 1 -f $reference $normal_bam $tumor_bam | java -jar VarScan.jar copynumber output.basename --mpileup 1
SAMtools Before v0.1.16 (r973): Use pileup
Older versions of SAMtools support the pileup command, which expects a single BAM and a reference sequence. If possible, do NOT include -c or -v, as these will generate consensus format. VarScan supports this format but does not use SAMtools consensus information.samtools pileup -f [reference sequence] [BAM file] >myData.pileup
This Perl snippet shows you how to pipe input from SAMtools into VarScan:
$normal_pileup = "samtools pileup -f $reference $normal_bam"; $tumor_pileup = "samtools pileup -f $reference $tumor_bam";
To limit the pileup to reads with mapping quality > 0 (recommended), use this variation:
$normal_pileup = "samtools view -b -u -q 1 $normal_bam | samtools pileup -f $reference -"; $tumor_pileup = "samtools view -b -u -q 1 $tumor_bam | samtools pileup -f $reference -";
Next, issue a system call that pipes input from these commands into VarScan :
bash -c \"java -jar VarScan.jar copynumber <\($normal_pileup\) <\($tumor_pileup\) output
Methods
Normal/Tumor Input Parsing
In copynumber mode, VarScan reads the pileup files from normal and tumor simultaneously. Only positions that are present in both files, and meet the minimum coverage at least one of them, will be compared. VarScan expects that positions on the same chromosome occur in ascending order, so input should be position-sorted! Please note, this kind of simultaneous parsing gets tricky when there are numerous reference sequences (e.g. unplaced contigs) to which reads from only one sample aligned. VarScan tries to obtain the maximum number of comparisons, even if it means closing and reopening the normal file to try to match contig and position. This can lead to looping errors; please report them if you see a number of warmings about "Resetting normal file..." and (if possible) send us sample pileups.Read Depth Comparison
At each position, VarScan computes the Q>20 read depths for tumor and normal samples independently, and then the ratio of tumor depth (Td) to normal depth (Nd). To account for differences in sequence data input, this ratio can be normalized with the --data-ratio parameter, which should be a decimal between 0.00 and 1.00, and should reflect the amount of unique on-target bases in the normal BAM (Ni) relative to the tumor BAM. The relative copy number in tumor is computed from the log-base-2 of the ratio of tumor depth to normal depth.Segment Delineation
VarScan will report contiguous regions with similar Q>20 read depths, one per line. Region boundaries are determined by (1) gaps of 2 or more consecutive positions that do not have sufficient coverage, or (2) change-points at which the ratio of normal depth to tumor depth changes significantly (p < 0.05 with Fisher's Exact Test of read depths at this position compared to those at previous position). To minimize the noise due to off-target reads, reported regions must meet or exceed the value of the --min-region-size parameter (default 100).Output
VarScan copynumber output files have this format:Field | Description |
chrom | Chromosome or reference name |
chr_start | Start position of a contiguous copy number rgion (1-based) |
chr_stop | Stop position of a contiguous copy number rgion (1-based) |
normal_depth | Average sequence depth in the normal |
tumor_depth | Average sequence depth in the tumor |
log_ratio | Log-base-2 ratio of the tumor/normal depth ratio |
gc_content | Proportion of GC bases in the region, between 0 and 100 (v2.2.7 and later) |
Filtering/Adjusting Results with copyCaller
As of VarScan v2.2.7, the copyCaller command can be used to process VarScan2 copynumber output. This command allows you to:- Filter copynumber calls by minimum coverage and/or region size.
- Adjust raw copynumber (log2) values for GC content.
- Classify each region as amplification (gain), deletion (loss), or neutral based on your preferred log2 thresholds.
- Recenter raw copynumber data if neutral segments are not on the log2=0 axis.
USAGE: java -jar VarScan.jar copyCaller [varScan.copynumber] OPTIONS INPUT: Raw output from the VarScan copynumber command (eg. varScan.output.copynumber) OPTIONS: --output-file Output file to contain the calls --min-coverage Minimum read depth at a position to make a call [8] --amp-threshold Lower bound for log ratio to call amplification [0.25] --del-threshold Upper bound for log ratio to call deletion (provide as positive number) [0.25] --min-region-size Minimum size (in bases) for a region to be counted [10] --recenter-up Recenter data around an adjusted baseline > 0 [0] --recenter-down Recenter data around an adjusted baseline < 0 [0]
Segmentation and Classification
Ideally, the raw output from VarScan copynumber should be smoothed and segmented by a program such as the DNAcopy library of the BioConductor project. Once you've installed the R library, you might process VarScan copynumber output as follows:library(DNAcopy) cn <- read.table("your.cn.file",header=F) CNA.object <-CNA( genomdat = cn[,6], chrom = cn[,1], maploc = cn[,2], data.type = 'logratio') CNA.smoothed <- smooth.CNA(CNA.object) segs <- segment(CNA.smoothed, verbose=0, min.width=2) segs2 = segs$output write.table(segs2[,2:6], file="out.file", row.names=F, col.names=F, quote=F, sep="\t")Thanks to Chris Miller for providing this code sample.