Tutorial:Tutorial: Analyze exome Copy number variation (CNV) in single patient or in population.
1
34
Entering edit mode
9.3 years ago
Chirag Nepal ★ 2.4k

I have spent sometime trying to understand/analyze exome CNV (copy number variation). There are many tools to do so, but here I am using varscan2, DNACopy and GISTIC.

Varscan2: http://varscan.sourceforge.net/index.html

DNACopy: http://www.bioconductor.org/packages/2.3/bioc/html/DNAcopy.html

Gistic2: ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm

http://gdac.broadinstitute.org/runs/analyses__2013_05_23/reports/cancer/KICH-TP/CopyNumber_Gistic2/nozzle.html

Step-1: Run varscan2 on normal-tumor-mpileup and make copy-number call

java -jar $varscan copynumber normal-tumor-mpileup $outputFile --mpileup 1 --p-value 0.005 --min-coverage 30 --min-map-qual 20 --min-base-qual 20 --data-ratio Calculate_From_Data

The output file has suffix .copynumber, for e.g., filename.copynumber. It contains all the segments above certain threshold, which are generally in large number, ranging close to 1 million per file. The output should further be processed to find significant amplified/deleted CNA or detect reoccurrent SNVs in large cohort. We will first do it for single T/N pair and on the population. Newer version of samtools mpileup file might throw an error while running copynumber.

Step-2: Use CBS binary segmentation in R to find segmented region. Use mergeSegment.pl (provided by varscan) to merge them

library(DNAcopy)
P1 = read.table ("filename.copynumber", header=TRUE)
CNA.object <- CNA( genomdat = P1[,7], chrom = P1[,1], maploc = P1 [,2], data.type = 'logratio')
CNA.object.smoothed     <- smooth.CNA(CNA.object)
CNA.object.smoothed.seg <- segment(CNA.object.smoothed, verbose=0, min.width=2)
seg.pvalue <- segments.p(CNA.object.smoothed.seg, ngrid=100, tol=1e-6, alpha=0.05, search.range=100, nperm=1000)
write.table (seg.pvalue,       file="cbs_P1_pvalue", row.names=F, col.names=F, quote=F, sep="\t")

# This will not print the p-value.
#segmented = CNA.object.smoothed.seg$output
#write.table (segmented[,2:6], file="cbs_P1", row.names=F, col.names=F, quote=F, sep="\t")
  • Step-2.1: Input the output file to mergeSegments.pl [downloaded from varscan2]. Note: mergeSegment.pl needs an additional column-2, where "chrom" needs to inserted.

    cat cbs_P1_pvalue | awk 'BEGIN {OFS="\t"} { print $1,"chrom",$2,$3,$4,$5,$6,$7,$8,$9,$10 }' > formattedInputFile
    perl mergeSegment.pl formattedInputFile --ref-arm-sizes $refArmSize --output-basename filteredOutput
    

    Now, from the output file, significant CN amplified or deleted regions can be detected.

    cat filteredOutput | grep -e amplification -e deletion
    cat filteredOutput | grep large-scale
    cat filteredOutput | grep neutral
    

    Have a closer look at the output file. The output files also reports few regions which are neutral and few large scale deletions.

Step-3: Now, let's try to find recurrent CNV among population, for which I use varscan and GISTIC. The output from varscan copynumber is feed to GISTIC. To run GISTIC, you need to make segment file and marker position. Also refer to the GISTIC documentation ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm

http://genepattern.broadinstitute.org/ftp/distribution/genepattern/modules_public_server_doc/GISTIC2.pdf

  • Step-3.1: Make marker position

    cat filename.copynumber | grep -v -e chrom -e chrM -e chrX -e chrY | awk 'BEGIN {OFS="\t"} { print "ID'",$1,$2 }' | sed 's/chr//g' > markerPosition
    
  • Step-3.2: Make segment file. The output from mergeSegment.pl, which in this example with start with suffix "filteredOutput' should be used to make segment file.

    name=filteredOutput
    cat $name | awk 'BEGIN {OFS="\t"} { print "ID",$2,$3,$4,$5,$6 }' | grep -v -e chrM -e chrX -e chrY | sed 's/chr//g' > segmentedFile
    

Step-4: Run GISTIC. Please read the documentation on how to install/execute GISTIC.

markersfile=markerPosition
segfile=segmentedFile
gp_gistic2_from_seg -b $basedir -seg $segfile -mk $markersfile -refgene $refgenefile -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme

Step-5: Check the output plots. The amplified and deletion plots and the output text files are ready.

From output file of GISTIC, read information from this site.

http://www.genomespace.org/support/guides/recipes/sections/identify-biological-functions-for-genes-in-cnv-regions

Please feel free to contribute. I will update when I find better solution.

GISTIC varscan2 CNV copy-number • 25k views
ADD COMMENT
3
Entering edit mode
Do not use on-target data, use off-target data! See Revolution (?) in CNA detection using exome/targeted sequencing
ADD REPLY
1
Entering edit mode

I edited the best I could to make this more readable. Though, you might want to add some more links/background on the approach and explain some of the variables being used (just a suggestion).

ADD REPLY
0
Entering edit mode

I think $name at step 3.2 need to be replaced with actual file name.

ADD REPLY
0
Entering edit mode

I use Varscan on exome data, so it seems to me that I should the exome target list as Marker list - maybe by taking the exon mid point (since markerPosition only allows one coordinate). Thoughts?

Incidentally, I've found that using the exome target list as whitelist in samtools mpileup greatly reduces noise in the Varscan output.

ADD REPLY
0
Entering edit mode

Can i use this method on WGS data?

ADD REPLY
0
Entering edit mode

It would be very inefficient. I'd recommend CNVnator instead for WGS.

ADD REPLY
0
Entering edit mode

Thank you. After that run GISTIC for CNAs?

ADD REPLY
0
Entering edit mode

GISTIC is normally used to find recurrent CNV among patients, and there are other tools as well to find recurrent CNVs. First, you need to find enriched CNV segments for each patient, using anytool (CNVnator or any other tools), then you can use GISTIC2 to find recurrent CNV.

ADD REPLY
0
Entering edit mode

Thank you

But since GISTIC need segmentation file and marker file which require number of markers to be filled in the segmentation file and the marker position in the marker file. Seg.CN i can get from the CNV segments from CNVnator. So, was thinking if CNVnator output provides these information.

ADD REPLY
0
Entering edit mode

There are many tools to find recurrent CNVs but GISTIC is the most popular i guess

ADD REPLY
0
Entering edit mode

Hi rse, did you figure out a way to run GISTIC2 after calling CNVs with CNVnator?

ADD REPLY
0
Entering edit mode

No, i could not run GISTIC2

ADD REPLY
0
Entering edit mode

Thanks for your process.I have a question,could please interpret why should exclude chrM chrY chrX in the gistic analysis?

ADD REPLY
0
Entering edit mode

Sorry,

Could I use copy number in .csv format from caveman for gistic input?

ADD REPLY
0
Entering edit mode

Thank you for providing this tutorial as I found tailoring this together is very hard for me

Could I ask from where I can provide refArmSize please?

ADD REPLY
2
Entering edit mode
9.1 years ago
wzlnot ▴ 20

Replenish some detail and suggests:

  1. In DNAcopy output part, the col.names and row.names suggested to set True. The mergeSegments.pl can handle this.
  2. The chromosome name between --ref-arm-sizes file and DNAcopy output should be consistent**. If not, the mergeSegments.pl will malfunction.
  3. Filter the un-located scaffolds or mtDNA if them exist. mergeSegments.pl assign the hash key with the chromosome name in --ref-arm-size file, and call hash value by use the chromsome name in DNAcopy output. If chromsome of DNAcopy is not in --ref-arm-size file, then cause the Use of uninitialized value ... error.

So, suggest:

cat CNA.smoothed.segs.pvalue | awk -F "\t" '$3~/chr[1-9,XY]/{print}' > mergeSegments.input
ADD COMMENT

Login before adding your answer.

Traffic: 1134 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6