Dear all,
which tools would you recommend to use for the annotation of SV (structural variants) with gene information (SV including deletions, insertions, translocations, inversions, duplications). thank you !
-- bogdan
Dear all,
which tools would you recommend to use for the annotation of SV (structural variants) with gene information (SV including deletions, insertions, translocations, inversions, duplications). thank you !
-- bogdan
This very much depends on what sort of annotation you want to do.
If you're interested in breakend overlap annotation, then I recommend using my StructuralVariantAnnotation package to convert to a paired breakend GRanges object then using the standard bioconductor packages to annotate stuff like repeat overlaps, gene position and orientation (ie is this a viable fusion gene), and so on.
If you're wanting to do annotations of the functional impact of the variant, then this is very much a different problem and the state of the art in this is well behind that of SNVs and small indels.
Dear Daniel, thank you for your suggestions, and congratulations for the publication of GRIDSS package.
Regarding the package "StructuralVariantAnnotation", I have started learning how to use it, and if I may ask please 2 questions, I would appreciate having your replies :
-- please would it be possible to share a piece of R code that uses "findBreakpointOverlaps(gr)" in order to infer the overlap of the breakpoints with exons, introns or intergenic regions ? -- after the overlap of the breakpoints with the gene information, is there any function available that infer gene fusions or truncations ? (i used to do it with a perl script, although having everything in R, it would be very helpful).
if you wish, we can also talk via email. My email address is tanasa@gmail.com. Thank you ;)
please would it be possible to share a piece of R code that uses "findBreakpointOverlaps(gr)" in order to infer the overlap of the breakpoints with exons, introns or intergenic regions ?
The nice thing about the breakpoint format I'm using (a normal bioconductor GRanges object with just an extra $partner field to identify the other side) is that you don't have to do anything special for this. Looking at breakend overlaps is just like looking at SNV/small indel overlaps.
You can see an example of this in https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R which is a very simple script that checks for candidate gene fusion. In line 52, you can see that to calculating the gene overlapping a breakend just uses the standard GenomicRanges
findOverlaps
function. findBreakpointOverlaps
is useful when you want find overlaps between breakpoints (e.g. comparing a SV caller to a truth set, or creating an ensemble SV call set), but if you're just comparing breakends to genomic intervals, StructuralVariantAnnotation
is only needed for the conversion of VCF to the breakend GRanges format.
StructuralVariantAnnotation will convert the output of SV callers into the same GRanges format regardless of whether the caller used TRA/BND or DEL/DUP/INV format to make the call. It works with VCFv4.2 or VCFv4.3 compliant output, but I've added some extra code to handle some of the more popular callers that don't follow the VCF specs correctly.
I've used it successfully on breakdancer, cortex, crest, delly, gridss, hydra, lumpy, manta, pindel, and socrates (although you need to use my conversion script to turn the breakdancer & socrates outputs into VCF format).
Hi Daniel, thank you, the R code works very well. If I may add a question please:
when using breakpointRanges(vcf) to transform a VCF into a GR, how are strands + and - especially defined/assigned ? I am using a vcf file from DELLY, where the TRA (and other SV) are encoded as 3to5, 5to3, 3to3, or 5tot5. thanks a lot !
Would indicate a deletion as the DNA leading up to chr1:100 is connected to the DNA at chr1:200 and beyond.
Dear Daniel, thank you for your reply.
Please if I may add a note below, as I would like to understand much better how the "+" and "-" are assigned please :
we have been using DELLY on a few cancer genomes that we had, and DELLY gives the following SV:
DEL are always 3to5
DUP are always 5to3
INV can be 5to5 or 3to3
TRA can be 5to5, 3to3, 5to3 or 3to5.
How do these SV above translate into a notation with + and - ? thank you very much again !
The special case handling for DELLY TRA code can be found https://github.com/PapenfussLab/StructuralVariantAnnotation/blob/master/R/extensions-VCF.R#L336 .
I just reviewed the code and the custom DELLY 5to5 and 3to3 attributes are not used for inversions (I just found this issue myself with manta INV3 and INV5 non-standard attributes). The current version of StructuralVariantAnnotation will ignore 5to5 and 3to3 and report both inversion breakpoints.
Dear Daniel, thanks a lot ! we wish you a happy and fruitful week !
If I may add a question, this time if about TRA, specifically about 3to5, and 5to3 TRANSLOCATIONS.
When do we define a TRA as 3to5, and when it is a TRA 5to3 ? I believe that these are RECIPROCAL TRA.
thank you !
And, if I may add please another question (that is a bit naive as I have not checked the answer yet):
depending on the type of SV and depending on the type of SV direction/orientation (3to5, 5to3, 5to5, 3tot3),
shall we refer to INVERSIONS,
would we need to include in the R script that computes the gene fusions a piece of R code for instance that inverts the INV coordinates, or the function "breakpointRanges" does it internally already ?
I am asking because I would also like to compute the TRUNCATIONS. thank you very much ;)
In my gene fusion example code, gene orientation and the breakpoint direction must be consistent with the fusion. An upstream gene in the fusion must have the strand matching the breakpoint orientation (+
strand transcript would require a +
breakpoint orientation), and the downstream gene requires the oppposite.
Truncation logic will be quite similar.
Dear Daniel,
talking about TRUNCATIONS, shall we define the gene TRUNCATIONS, based on your R code that is available at https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R, I believe that we only have to change the lines 70-72 of the R code by "negation" of "couldBeFivePrimeEnd" :
gr <- gr[ (gr$couldBeThreePrimeStart & (! partner(gr)$couldBeFivePrimeEnd)) | ( (! gr$couldBeFivePrimeEnd) & partner(gr)$couldBeThreePrimeStart),]
thank you very much !
The DELLY authors updated their output to use BND notation (which is good) but they only write one of the two records required for a valid VCF BND breakpoint (which is bad).
Technically my tool is doing the correct thing but that isn't particularly useful for users. I have a few other projects taking my time at the moment but I should have an updated version that handles the DELLY notational non-compliance by the end of this month.
Edit: as a workaround, you can update the DELLY VCF to include the other side of the BND breakpoint.
@Kevin Blighe dear Kevin Blighe,
i had the gsa microarray data in which i wanted to identify the cnv's .... the cnv were known through illumina genome studio software which gave txt file like this.
SampleID BookmarkType Chr Start End Size Author CreatedDate Value Comment 203389250004_R01C01 [1] CNV Bin: Min 2.5 To Max 3.5 3 197788369 197871052 82683 12/28/2019 7:31:44 PM 3 CNV Confidence: 58.26665 203389250004_R01C01 [1] CNV Bin: Min 0.5 To Max 1.5 9 105596881 105835972 239091 12/28/2019 7:31:44 PM 1 CNV Confidence: 95.1899
for annotation of this txt file with annovar tool i converted it into vcf format which looked something like this
3 197788369 197871052 82683 3 CNV Confidence: 58.26665 CNV Bin: Min 2.5 To Max 3.5 203389250004_R01C01 [1],,12/28/2019 7:31:44 PM 9 105596881 105835972 239091 1 CNV Confidence: 95.1899 CNV Bin: Min 0.5 To Max 1.5 203389250004_R01C01 [1],,12/28/2019 7:31:44 PM
i want to know is this file good for inputting it into annovar and also how to annotate this vcf file using annovar, once i have downloaded all the database defined by you in this post earlier i have followed the commands given by you
perl annovar/convert2annovar.pl -format vcf4 -withzyg --allsample -outfile anotation.ann input.vcf
perl annovar/table_annovar.pl anotation.ann annovar/humandb/ -buildve hg19 -remove -otherinfo -onetranscript -protocol refGene,cytoBand,esp6500siv2_all,exac03,dbnsfp35c,avsnp150,cosmic70 -operation g,r,f,f,f,f,f -nastring "NA" -csvout it generates file something like this
Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,cytoBand,esp6500siv2_all,ExAC_ALL,ExAC_AFR,ExAC_AMR,ExAC_EAS,ExAC_FIN,ExAC_NFE,ExAC_OTH,ExAC_SAS,SIFT_score,SIFT_converted_rankscore,SIFT_pred,LRT_score,LRT_converted_rankscore,LRT_pred,MutationTaster_score,MutationTaster_converted_rankscore,MutationTaster_pred,MutationAssessor_score,MutationAssessor_score_rankscore,MutationAssessor_pred,FATHMM_score,FATHMM_converted_rankscore,FATHMM_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,MetaSVM_score,MetaSVM_rankscore,MetaSVM_pred,MetaLR_score,MetaLR_rankscore,MetaLR_pred,M-CAP_score,M-CAP_rankscore,M-CAP_pred,MutPred_score,MutPred_rankscore,fathmm-MKL_coding_score,fathmm-MKL_coding_rankscore,fathmm-MKL_coding_pred,Eigen_coding_or_noncoding,Eigen-raw,Eigen-PC-raw,GenoCanyon_score,GenoCanyon_score_rankscore,integrated_fitCons_score,integrated_fitCons_score_rankscore,integrated_confidence_value,GERP++_RS,GERP++_RS_rankscore,phyloP100way_vertebrate,phyloP100way_vertebrate_rankscore,phyloP20way_mammalian,phyloP20way_mammalian_rankscore,phastCons100way_vertebrate,phastCons100way_vertebrate_rankscore,phastCons20way_mammalian,phastCons20way_mammalian_rankscore,SiPhy_29way_logOdds,SiPhy_29way_logOdds_rankscore,Interpro_domain,GTEx_V6p_gene,GTEx_V6p_tissue,avsnp150,cosmic70,Otherinfo1,Otherinfo2,Otherinfo3 3,197788369,197788369,0,0,"ncRNA_intronic","ANKRD18DP",NA,NA,NA,"3q29",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,". CNV Confidence: 58.26665
i want to know the rsids of the cnv for this particular position only and so on in my file. Can you please help me out with this and if i am wrong somewhere please correct me.
thankyou,
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You have these stored in VCF format?
yes, Kevin, the files are in VCF format.
Can you try something like ANNOVAR? - start from the Quick Start-Up Guide. Installing everything can take a while.
Then, to annotate a VCF, you would do something like this:
ANNOVAR Also allows you to annotate via the Database of Genomic Variants (DGV).
SV annotation (with OMIM, DGV, 1000g, haploinsufficiency, TAD, ... and also with your own in-house information) can be easily automated !
You can look at this post describing the annotSV tool: Annotation for SV and CNV