I am trying to put together a dataset of hg38 human genome SNP data (CDS only) that contains information on:
- SNP chromosomal location
- Allele frequencies
- Allele change
- Codon change
- Peptide change
- Variant consequence
- Gene within which SNP is located
Initially I tried to use BiomaRt's getBM to get variation data, which included everything I wanted except allele frequencies (you can get minor allele frequencies but only very few of the entries actually have any data for this).
I eventually gave up on BiomaRt for variation data as it keeps crashing R (I know my code is correct because it works on the smaller chromosomes - 22 for instance - but fails on the larger ones). From reading other threads I can see that the variation data on BiomaRt can be fiddly to acquire due to the size of the datasets.
So my new approach is to acquire data via the UCSC table browser which gives me all the information above except the gene within which the SNP is located. I have managed to get gene information from BiomaRt using:
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
attributes = listAttributes(ensembl)
#Get all chromosome names
chroms <- getBM(attributes = 'chromosome_name',
mart = ensembl)
for(i in 1:length(chroms[,])){
gene.table <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'chromosome_name',
'ensembl_transcript_id', 'transcript_start', 'transcript_end',
'exon_chrom_start', 'exon_chrom_end', 'phase', 'end_phase',
'genomic_coding_start', 'genomic_coding_end', 'cds_start', 'cds_end'),
filters = c('chromosome_name'),
values = chroms[i,1],
mart = ensembl)
gene.table <- data.frame(t(sapply(gene.table,c)))
gene.table <- t(gene.table)
gene.table <- gene.table[complete.cases(gene.table),]
write.table(gene.table, file=paste("chrom.",chroms[i,1],".gene.txt", sep=''), sep='\t', quote=FALSE)
}
From here I can use the SNP chromosomal location and the gene_coding_start and gene_coding_end coordinates to identify to which gene each SNP belongs.
The reason for my post is that this seems like a rather unwieldy approach to acquiring what I assume is a commonly used dataset. What I would like to know is whether there are more efficient ways of getting what I need, and if not is my own methodology sound.
As an aside, I did intend to use VEP but have had real issues attempting to install it or use it via the virtual machine.
HI Pierre, thanks for your reply. I have tried installing snpeff but have had no luck. I get the error: Unsupported major.minor version 51.0. I understand that I need to upgrade my Java version, but this is proving troublesome too. Using Java -version I get:
java version "1.6.0_65" Java(TM) SE Runtime Environment (build 1.6.0_65-b14-468-11M4833) Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-468, mixed mode)
However, using /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -version I get:
java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
I know I need 1.7 or higher, but am unsure why it won't work.
what is exactly your snpeff command please, including the java cmd + What is your OS ?
I am running OS X El Capitan. I downloaded snpEff here: http://snpeff.sourceforge.net/download.html and had no trouble instilling it. However, when I run the command:
I get the following errors:
To check my java version I use:
Which gives me:
However, when I use the command:
I get:
So my guess is that although I have Java 1.8 installed it is not configured?
and what is the output of
???
Ah that provides a list of databases! So to run it I must call Java from the correct place? Many thanks. Is there no way to set Java 1.8 as the default? Again, many thanks for your help.
Do you remember installing Java from Java.com? If not do that.
Actually I am still receiving errors. Whilst the above command worked, when I use:
I get the following error:
The database is the one listed within the snpEff.config file (as advised in the documentation). at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.run(SnpEffCmdDownload.java:72) at org.snpeff.SnpEff.run(SnpEff.java:1182) at org.snpeff.SnpEff.main(SnpEff.java:162)
Looks like you are not able to download the database automatically (behind a firewall?). You may need to download manually as described in the documentation.