I have spent sometime trying to understand/analyze exome CNV (copy number variation). There are many tools to do so, but here I am using varscan2, DNACopy and GISTIC.
Varscan2: http://varscan.sourceforge.net/index.html
DNACopy: http://www.bioconductor.org/packages/2.3/bioc/html/DNAcopy.html
Gistic2: ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm
Step-1: Run varscan2 on normal-tumor-mpileup and make copy-number call
java -jar $varscan copynumber normal-tumor-mpileup $outputFile --mpileup 1 --p-value 0.005 --min-coverage 30 --min-map-qual 20 --min-base-qual 20 --data-ratio Calculate_From_Data
The output file has suffix .copynumber
, for e.g., filename.copynumber
. It contains all the segments above certain threshold, which are generally in large number, ranging close to 1 million per file. The output should further be processed to find significant amplified/deleted CNA or detect reoccurrent SNVs in large cohort. We will first do it for single T/N pair and on the population. Newer version of samtools mpileup file might throw an error while running copynumber.
Step-2: Use CBS binary segmentation in R to find segmented region. Use mergeSegment.pl (provided by varscan) to merge them
library(DNAcopy)
P1 = read.table ("filename.copynumber", header=TRUE)
CNA.object <- CNA( genomdat = P1[,7], chrom = P1[,1], maploc = P1 [,2], data.type = 'logratio')
CNA.object.smoothed <- smooth.CNA(CNA.object)
CNA.object.smoothed.seg <- segment(CNA.object.smoothed, verbose=0, min.width=2)
seg.pvalue <- segments.p(CNA.object.smoothed.seg, ngrid=100, tol=1e-6, alpha=0.05, search.range=100, nperm=1000)
write.table (seg.pvalue, file="cbs_P1_pvalue", row.names=F, col.names=F, quote=F, sep="\t")
# This will not print the p-value.
#segmented = CNA.object.smoothed.seg$output
#write.table (segmented[,2:6], file="cbs_P1", row.names=F, col.names=F, quote=F, sep="\t")
Step-2.1: Input the output file to mergeSegments.pl [downloaded from varscan2]. Note: mergeSegment.pl needs an additional column-2, where "chrom" needs to inserted.
cat cbs_P1_pvalue | awk 'BEGIN {OFS="\t"} { print $1,"chrom",$2,$3,$4,$5,$6,$7,$8,$9,$10 }' > formattedInputFile perl mergeSegment.pl formattedInputFile --ref-arm-sizes $refArmSize --output-basename filteredOutput
Now, from the output file, significant CN amplified or deleted regions can be detected.
cat filteredOutput | grep -e amplification -e deletion cat filteredOutput | grep large-scale cat filteredOutput | grep neutral
Have a closer look at the output file. The output files also reports few regions which are neutral and few large scale deletions.
Step-3: Now, let's try to find recurrent CNV among population, for which I use varscan and GISTIC. The output from varscan copynumber is feed to GISTIC. To run GISTIC, you need to make segment file and marker position. Also refer to the GISTIC documentation ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm
Step-3.1: Make marker position
cat filename.copynumber | grep -v -e chrom -e chrM -e chrX -e chrY | awk 'BEGIN {OFS="\t"} { print "ID'",$1,$2 }' | sed 's/chr//g' > markerPosition
Step-3.2: Make segment file. The output from mergeSegment.pl, which in this example with start with suffix "filteredOutput' should be used to make segment file.
name=filteredOutput cat $name | awk 'BEGIN {OFS="\t"} { print "ID",$2,$3,$4,$5,$6 }' | grep -v -e chrM -e chrX -e chrY | sed 's/chr//g' > segmentedFile
Step-4: Run GISTIC. Please read the documentation on how to install/execute GISTIC.
markersfile=markerPosition
segfile=segmentedFile
gp_gistic2_from_seg -b $basedir -seg $segfile -mk $markersfile -refgene $refgenefile -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme
Step-5: Check the output plots. The amplified and deletion plots and the output text files are ready.
From output file of GISTIC, read information from this site.
Please feel free to contribute. I will update when I find better solution.
I edited the best I could to make this more readable. Though, you might want to add some more links/background on the approach and explain some of the variables being used (just a suggestion).
I think $name at step 3.2 need to be replaced with actual file name.
I use Varscan on exome data, so it seems to me that I should the exome target list as Marker list - maybe by taking the exon mid point (since markerPosition only allows one coordinate). Thoughts?
Incidentally, I've found that using the exome target list as whitelist in samtools mpileup greatly reduces noise in the Varscan output.
Can i use this method on WGS data?
It would be very inefficient. I'd recommend CNVnator instead for WGS.
Thank you. After that run GISTIC for CNAs?
GISTIC is normally used to find recurrent CNV among patients, and there are other tools as well to find recurrent CNVs. First, you need to find enriched CNV segments for each patient, using anytool (CNVnator or any other tools), then you can use GISTIC2 to find recurrent CNV.
Thank you
But since GISTIC need segmentation file and marker file which require number of markers to be filled in the segmentation file and the marker position in the marker file. Seg.CN i can get from the CNV segments from CNVnator. So, was thinking if CNVnator output provides these information.
There are many tools to find recurrent CNVs but GISTIC is the most popular i guess
Hi rse, did you figure out a way to run GISTIC2 after calling CNVs with CNVnator?
No, i could not run GISTIC2
Thanks for your process.I have a question,could please interpret why should exclude chrM chrY chrX in the gistic analysis?
Sorry,
Could I use copy number in .csv format from caveman for gistic input?
Thank you for providing this tutorial as I found tailoring this together is very hard for me
Could I ask from where I can provide
refArmSize
please?