We analyze a lot of CNVs both called using tools like Birdsuite or PennCNV or those which are imputed from GWAS using reference panels like that from 1000 Genomes. In terms of tools, I can recommend two steps that we do: the first covers overlap with genes, using Bedtools; the second covers disease associations, using gene2mesh .
From a BED file of CNVs, such as this:
chr22 39378403 39388216 esv2666691 MERGED_DEL_2_106009
chr5 151514804 151518864 esv2666686 MERGED_DEL_2_32905
You take a bed file including all genes in the genome or a subset. You may also use individual exons of genes (contact me for such a file.):
chr1 11873 12227 DDX11L1
chr1 12612 12721 DDX11L1
chr1 13220 14409 DDX11L1
chr1 14361 14829 WASH7P
chr1 14969 15038 WASH7P
chr1 15795 15947 WASH7P
...
Using Bedtools, run the intersection like so:
intersectBed -a cnvs.bed -b refseq_exons.bed -wb
This will give you output like so:
chr22 39387563 39388216 esv2666691 MERGED_DEL_2_106009 chr22 39387563 39394225 APOBEC3B-AS1
chr22 39358280 39388216 esv2666691 MERGED_DEL_2_106009 chr22 39353526 39388783 APOBEC3A_B
chr22 39358280 39359188 esv2666691 MERGED_DEL_2_106009 chr22 39353526 39359188 APOBEC3A
chr22 39378403 39388216 esv2666691 MERGED_DEL_2_106009 chr22 39378403 39388784 APOBEC3B
chr5 151514804 151518864 esv2666686 MERGED_DEL_2_32905 chr5 151338458 151650010 CTB-12O2.1
For your second step on finding links with disease, I have found gene2mesh to be very helpful. It gives links to keywords, but there are some other useful resources including OMIM that may be helpful. In the case of gene2mesh, the following perl script can take the output genes and get the top MeSH terms. In this case, we put the genes from the intersectBed output above into a file called "gene_list.txt". Perl script we modified from the website looks like so:
#!/usr/bin/perl -w
use strict;
use warnings;
use XML::XPath;
use XML::XPath::XMLParser;
use LWP::UserAgent qw($ua get);
my $ua = new LWP::UserAgent;
my $file="gene_list.txt";
open(F,$file);
while(<F>){
( my $gene)=split;
my $getf="http://gene2mesh.ncibi.org/fetch?genesymbol=${gene}&limit=5";
my $response = $ua->get($getf);
my $xp = XML::XPath->new(xml => $response->content);
print "## Top 30 MeSH Terms from Gene2MeSH Associated with GeneID $gene ##\n\n";
foreach my $g2mNode ($xp->find('//Descriptor/Name')->get_nodelist) {
print $g2mNode->string_value . "\n";
}
}
close(F);
And running it on a short list of genes would give:
## Top 30 MeSH Terms from Gene2MeSH Associated with GeneID APOBEC3B ##
Cytidine Deaminase
vif Gene Products, Human Immunodeficiency Virus
Gene Products, vif
HIV-1
HIV Infections
CNV annotation (with OMIM, DGV, 1000g, haploinsufficiency, TAD, ... and also with your own in-house information) can be easily automated !
You can look at this post describing the annotSV tool: Annotation for SV and CNV