Acquiring and combining hg38 SNP and gene data
2
0
Entering edit mode
7.5 years ago
spiral01 ▴ 110

I am trying to put together a dataset of hg38 human genome SNP data (CDS only) that contains information on:

  • SNP chromosomal location
  • Allele frequencies
  • Allele change
  • Codon change
  • Peptide change
  • Variant consequence
  • Gene within which SNP is located

Initially I tried to use BiomaRt's getBM to get variation data, which included everything I wanted except allele frequencies (you can get minor allele frequencies but only very few of the entries actually have any data for this).

I eventually gave up on BiomaRt for variation data as it keeps crashing R (I know my code is correct because it works on the smaller chromosomes - 22 for instance - but fails on the larger ones). From reading other threads I can see that the variation data on BiomaRt can be fiddly to acquire due to the size of the datasets.

So my new approach is to acquire data via the UCSC table browser which gives me all the information above except the gene within which the SNP is located. I have managed to get gene information from BiomaRt using:

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
attributes = listAttributes(ensembl)
#Get all chromosome names
chroms <- getBM(attributes = 'chromosome_name',
                mart = ensembl)

for(i in 1:length(chroms[,])){
  gene.table <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'chromosome_name', 
                                     'ensembl_transcript_id', 'transcript_start', 'transcript_end', 
                                     'exon_chrom_start', 'exon_chrom_end', 'phase', 'end_phase', 
                                     'genomic_coding_start', 'genomic_coding_end', 'cds_start', 'cds_end'), 
                      filters = c('chromosome_name'),
                      values = chroms[i,1], 
                      mart = ensembl)

  gene.table <- data.frame(t(sapply(gene.table,c)))
  gene.table <- t(gene.table)
  gene.table <- gene.table[complete.cases(gene.table),]

  write.table(gene.table, file=paste("chrom.",chroms[i,1],".gene.txt", sep=''), sep='\t', quote=FALSE)
}

From here I can use the SNP chromosomal location and the gene_coding_start and gene_coding_end coordinates to identify to which gene each SNP belongs.

The reason for my post is that this seems like a rather unwieldy approach to acquiring what I assume is a commonly used dataset. What I would like to know is whether there are more efficient ways of getting what I need, and if not is my own methodology sound.

As an aside, I did intend to use VEP but have had real issues attempting to install it or use it via the virtual machine.

SNP gene • 2.5k views
ADD COMMENT
2
Entering edit mode
7.5 years ago

I would 'just' get the dbsnp vcf file for hg38 and annotate it with snpeff or vep.

ADD COMMENT
0
Entering edit mode

HI Pierre, thanks for your reply. I have tried installing snpeff but have had no luck. I get the error: Unsupported major.minor version 51.0. I understand that I need to upgrade my Java version, but this is proving troublesome too. Using Java -version I get:

java version "1.6.0_65" Java(TM) SE Runtime Environment (build 1.6.0_65-b14-468-11M4833) Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-468, mixed mode)

However, using /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -version I get:

java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

I know I need 1.7 or higher, but am unsure why it won't work.

ADD REPLY
0
Entering edit mode

what is exactly your snpeff command please, including the java cmd + What is your OS ?

ADD REPLY
0
Entering edit mode

I am running OS X El Capitan. I downloaded snpEff here: http://snpeff.sourceforge.net/download.html and had no trouble instilling it. However, when I run the command:

java -jar snpEff.jar databases

I get the following errors:

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/snpeff/SnpEff : Unsupported major.minor version 51.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

To check my java version I use:

java -version

Which gives me:

java version "1.6.0_65" Java(TM) SE Runtime Environment (build 1.6.0_65-b14-468-11M4833) Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-468, mixed mode)

However, when I use the command:

 /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -version

I get:

java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

So my guess is that although I have Java 1.8 installed it is not configured?

ADD REPLY
0
Entering edit mode

and what is the output of

 /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java  -jar snpEff.jar databases

???

ADD REPLY
0
Entering edit mode

Ah that provides a list of databases! So to run it I must call Java from the correct place? Many thanks. Is there no way to set Java 1.8 as the default? Again, many thanks for your help.

ADD REPLY
0
Entering edit mode

Do you remember installing Java from Java.com? If not do that.

ADD REPLY
0
Entering edit mode

Actually I am still receiving errors. Whilst the above command worked, when I use:

/Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -jar snpEff.jar download database.dbsnp.GRCh38

I get the following error:

java.lang.RuntimeException: Property: 'database.dbsnp.GRCh38.genome' not found
    at org.snpeff.interval.Genome.<init>(Genome.java:106)
    at org.snpeff.snpEffect.Config.readGenomeConfig(Config.java:680)
    at org.snpeff.snpEffect.Config.readConfig(Config.java:648)
    at org.snpeff.snpEffect.Config.init(Config.java:479)
    at org.snpeff.snpEffect.Config.<init>(Config.java:117)
    at org.snpeff.SnpEff.loadConfig(SnpEff.java:451)
    at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.runDownloadGenome(SnpEffCmdDownload.java:80)

The database is the one listed within the snpEff.config file (as advised in the documentation). at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.run(SnpEffCmdDownload.java:72) at org.snpeff.SnpEff.run(SnpEff.java:1182) at org.snpeff.SnpEff.main(SnpEff.java:162)

ADD REPLY
1
Entering edit mode

Looks like you are not able to download the database automatically (behind a firewall?). You may need to download manually as described in the documentation.

ADD REPLY
1
Entering edit mode
7.5 years ago
genecats.ucsc ▴ 580

You should be able to use the UCSC Table Browser to get this information. Select the snp147CodingDbSnp table as your primary table, and then choose "selected fields from primary and related tables" as your output option, where you can select fields from the snp147 table.

Then I think you want the following selections from snp147CodingDbSnp: chrom, chromStart, chromEnd, name, transcript, alleles, codons, peptides

and the following fields from snp147: refNCBI/UCSC, observed, func, alleleFreqs

this leads to output like the following:

# position rsId NM_transcript codons peptides ref observed function frequency
chr9    133255667   133255668   rs782400398 NM_020469   TGA,CGA,    *,R,    A   A/G stop-loss   0.999974,0.000026,
chr9    133255668   133255669   rs782672373 NM_020469   CCG,CCA,    P,P,    C   C/T coding-synon    
chr9    133255669   133255670   rs56392308  NM_020469   CCG,CGT,    P,R,    G   -/C frameshift  0.106304,0.893696,

If you have any more questions about the UCSC Genome Browser, feel free to send your question to our public mailing list at genome@soe.ucsc.edu so our whole team can see your question. If your question involves private data, you can send it to genome-www@soe.ucsc.edu instead.

Thanks, ChrisL from the UCSC Genome Browser

ADD COMMENT
0
Entering edit mode

Hi Chris,

Many thanks for your response and apologies for my own slow reply. I have been using UCSC data such as you've shown above, but I also need to identify in which gene each SNP is located. As far as I can see this is not possible from the table browser and would involve cross referencing co-ordinates? I have given this a go and due to overlapping transcripts have ended up finding the same SNP in multiple genes?

ADD REPLY

Login before adding your answer.

Traffic: 2915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6