Somatic Copy Number Alteration (CNA) Calling

Recommended Workflow
Filtering/Adjusting Results with copyCaller
Segmentation and Classification
Recurrent CNA Identification with CMDS


VarScan v2.2.4 and later includes novel functionality to infer somatic copy number changes using data from matched tumor-normal pairs (manuscript under review). This is designed specifically for exome sequencing, in which a tumor sample and its matched normal were captured and sequenced under identical conditions. By performing a pairwise comparison of read depth between the samples at each position in the exome, it becomes possible to infer relative changes in copy number in the tumor sample. The output of this tool is a set of regions with chromosome, start, stop, and log-base-2 of the copy number change, which is similar to array-based copy number data and therefore amenable to the same segmentation algorithms.

Recommended Workflow

If you're just getting started, here is the recommended procedure for performing copy number analysis using VarScan. Let's assume you have a normal BAM and a tumor BAM file from targeted (e.g. exome) sequencing, and would like to identify tumor-specific (somatic) copy number changes.
1. Run VarScan copynumber on normal and tumor mpileup output
samtools mpileup -q 1 -f ref.fa normal.bam tumor.bam | 
java -jar VarScan.jar copynumber varScan --mpileup 1
This will create a single output file, varScan.copynumber, containing the raw copynumber calls.

2. Run VarScan copyCaller to adjust for GC content and make preliminary calls.
java -jar VarScan.jar copyCaller varScan.copynumber --output-file varScan.copynumber.called
 [--output-homdel-file varScan.copynumber.called.homdel]
This will create two output files: varScan.copynumber.called (adjusted calls) and varScan.copynumber.called.gc (GC adjustment information). As of version 2.2.12 (August 2012), you can also specify an output file for candidate homozygous deletions.

3. Apply circular binary segmentation using the DNAcopy library from BioConductor
See the section on CBS segmentation for how to do this. A per-chromosome basis is usually best.

4. Visualize CBS results and recenter if necessary.
The DNAcopy package includes utilities to visualize the data points and segments. If all of the data and segments are consistently above or below the neutral value (0.0), you can re-center the data points with VarScan copyCaller. Then repeat step 3.

5. Merge adjacent segments of similar copy number and classify events by size (large-scale or focal)
You can download the utility script from the VarScan scripts folder on SourceForge.


The syntax of the command for copy number calling is most similar to the VarScan somatic tool.
java -jar VarScan.jar copynumber normal-tumor.pileup output.basename --mpileup 1
Or, if you have independent normal and tumor pileup files:
java -jar VarScan.jar copynumber normal.pileup tumor.pileup output.basename

The above command will read data from normal and tumor files simultaneously, compute the Q>20 read depths for each, and report the relative copy number in the tumor on a log scale.


This command expects input from normal and tumor in SAMtools pileup format. To obtain this format, you will need:
  • Sorted BAM files for normal and tumor.
  • The reference sequence ("reference.fasta") to which reads were aligned, in FASTA format.
  • The SAMtools software package.
The command that you'll use depends on the version of SAMtools that you have installed. In Linux/UNIX/MacOSX, it is possible to use piped input rather than actual files to save disk space and I/O. This is most easily done in a simple Perl script; examples are below.

SAMtools v0.1.17 (r973) and Later: Use mpileup

As of SAMtools v0.1.17 (r973) and later, the pileup command is deprecated and has been replaced with mpileup to accommodate multi-sample calling. VarScan will read in a single mpileup file containing normal and tumor data, respectively. Do NOT use any of the variant- or consensus-calling parameters. You just want the raw mpileup output. This Perl snippet shows you how to pipe input from SAMtools into VarScan:
samtools mpileup -q 1 -f $reference $normal_bam $tumor_bam | 
java -jar VarScan.jar copynumber output.basename --mpileup 1
The command should be typed in a single line. It's shown in two lines above for readability.

To limit the pileup to reads with mapping quality > 0 (recommended), use this variation:
samtools mpileup -q 1 -f $reference $normal_bam $tumor_bam | 
java -jar VarScan.jar copynumber output.basename --mpileup 1

SAMtools Before v0.1.16 (r973): Use pileup

Older versions of SAMtools support the pileup command, which expects a single BAM and a reference sequence. If possible, do NOT include -c or -v, as these will generate consensus format. VarScan supports this format but does not use SAMtools consensus information.
samtools pileup -f [reference sequence] [BAM file] >myData.pileup

This Perl snippet shows you how to pipe input from SAMtools into VarScan:
$normal_pileup = "samtools pileup -f $reference $normal_bam";
$tumor_pileup = "samtools pileup -f $reference $tumor_bam";

To limit the pileup to reads with mapping quality > 0 (recommended), use this variation:
$normal_pileup = "samtools view -b -u -q 1 $normal_bam | samtools pileup -f $reference -";
$tumor_pileup = "samtools view -b -u -q 1 $tumor_bam | samtools pileup -f $reference -";

Next, issue a system call that pipes input from these commands into VarScan :
bash -c \"java -jar VarScan.jar copynumber <\($normal_pileup\) <\($tumor_pileup\) output


Normal/Tumor Input Parsing

In copynumber mode, VarScan reads the pileup files from normal and tumor simultaneously. Only positions that are present in both files, and meet the minimum coverage at least one of them, will be compared. VarScan expects that positions on the same chromosome occur in ascending order, so input should be position-sorted! Please note, this kind of simultaneous parsing gets tricky when there are numerous reference sequences (e.g. unplaced contigs) to which reads from only one sample aligned. VarScan tries to obtain the maximum number of comparisons, even if it means closing and reopening the normal file to try to match contig and position. This can lead to looping errors; please report them if you see a number of warmings about "Resetting normal file..." and (if possible) send us sample pileups.

Read Depth Comparison

At each position, VarScan computes the Q>20 read depths for tumor and normal samples independently, and then the ratio of tumor depth (Td) to normal depth (Nd). To account for differences in sequence data input, this ratio can be normalized with the --data-ratio parameter, which should be a decimal between 0.00 and 1.00, and should reflect the amount of unique on-target bases in the normal BAM (Ni) relative to the tumor BAM. The relative copy number in tumor is computed from the log-base-2 of the ratio of tumor depth to normal depth.

Segment Delineation

VarScan will report contiguous regions with similar Q>20 read depths, one per line. Region boundaries are determined by (1) gaps of 2 or more consecutive positions that do not have sufficient coverage, or (2) change-points at which the ratio of normal depth to tumor depth changes significantly (p < 0.05 with Fisher's Exact Test of read depths at this position compared to those at previous position). To minimize the noise due to off-target reads, reported regions must meet or exceed the value of the --min-region-size parameter (default 100).


VarScan copynumber output files have this format:
chromChromosome or reference name
chr_startStart position of a contiguous copy number rgion (1-based)
chr_stopStop position of a contiguous copy number rgion (1-based)
normal_depth  Average sequence depth in the normal
tumor_depthAverage sequence depth in the tumor
log_ratioLog-base-2 ratio of the tumor/normal depth ratio
gc_contentProportion of GC bases in the region, between 0 and 100 (v2.2.7 and later)

Filtering/Adjusting Results with copyCaller

As of VarScan v2.2.7, the copyCaller command can be used to process VarScan2 copynumber output. This command allows you to:
  • Filter copynumber calls by minimum coverage and/or region size.
  • Adjust raw copynumber (log2) values for GC content.
  • Classify each region as amplification (gain), deletion (loss), or neutral based on your preferred log2 thresholds.
  • Recenter raw copynumber data if neutral segments are not on the log2=0 axis.
The usage is as follows:
USAGE: java -jar VarScan.jar copyCaller [varScan.copynumber] OPTIONS
	Raw output from the VarScan copynumber command (eg. varScan.output.copynumber)

	--output-file	Output file to contain the calls
	--min-coverage	Minimum read depth at a position to make a call [8]
	--amp-threshold	Lower bound for log ratio to call amplification [0.25]
	--del-threshold	Upper bound for log ratio to call deletion (provide as positive number) [0.25]
	--min-region-size	Minimum size (in bases) for a region to be counted [10]
	--recenter-up	Recenter data around an adjusted baseline > 0 [0]
	--recenter-down	Recenter data around an adjusted baseline < 0 [0]

Segmentation and Classification

Ideally, the raw output from VarScan copynumber should be smoothed and segmented by a program such as the DNAcopy library of the BioConductor project. Once you've installed the R library, you might process VarScan copynumber output as follows:
cn <- read.table("",header=F)
CNA.object <-CNA( genomdat = cn[,6], chrom = cn[,1], maploc = cn[,2], data.type = 'logratio')
CNA.smoothed <- smooth.CNA(CNA.object)
segs <- segment(CNA.smoothed, verbose=0, min.width=2)
segs2 = segs$output
write.table(segs2[,2:6], file="out.file", row.names=F, col.names=F, quote=F, sep="\t")
Thanks to Chris Miller for providing this code sample.

Recurrent CNA Identification with CMDS

Identifying recurrent copy number alterations (RCNAs) across multiple tumors can enrich for events driving tumor development and progression. We recently published Correlation Matrix Diagonal Segmentation (CMDS) to search for RCNAs using SNP array-based copy number data, and this program can be applied to exome-based copy number data as well. Doing so requires assigning copy number values to a discrete set of markers across the genome. The coordinates of known exons, or of exome targets, can serve this purpose. By matching the coordinates of your marker set to each sample's copynumber regions, it's possible to retrieve the log ratio for this sample at this position. If you would like to do this and need assistance, please send us an e-mail or post to the forum.