How to get gene name for 900k SNPs
2
1
Entering edit mode
6 months ago

I've been using this to obtain gene name and ensembl id

def get_gene_name(row, CHR, pos):
    genes_list=data.gene_names_at_locus(contig=row[CHR], position=row[pos])
    return list(filter(None, genes_list))

def get_ensembl_id(chromosome, position):
    ensembl_url = f"https://rest.ensembl.org/overlap/region/human/{chromosome}:{position}-{position}?feature=gene"
    headers = {"Content-Type": "application/json"}
    response = requests.get(ensembl_url, headers=headers)
    if response.status_code != 200:
        raise Exception(f"Request failed: {response.status_code}")
    data = response.json()
    genes = [entry['gene_id'] for entry in data if entry['feature_type'] == 'gene']
    return genes

However, I have 900k SNPs! so this isn't feasible. Is there a faster way to obtain this info? So far I've been doing a for loop for all 900k SNPs using:

df.apply(get_gene_name, args=('chr_name', 'chr_position'), axis=1)

It's taking way too long. Is there a faster way?

biomart entrez pyensembl snp gene • 884 views
ADD COMMENT
4
Entering edit mode
6 months ago

get a sorted bed for your snps

get a sorted bed for the genes

use bedtools intersect

done.

ADD COMMENT
0
Entering edit mode

Where would I get the bed for the genes from? And how would that file look like? Could you provide a minimal example? Thanks!

ADD REPLY
1
Entering edit mode

Where would I get the bed for the genes from ?

How To Download The Entire Human Gene List With Their Chromosomal Location

And how would that file look like?

chrom/start/end/name : https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Could you provide a minimal example? Thanks!

$ head snps.bed

chr1 10 20 SNP1
chr1 20 21 SNP2
chr3 10 20 SNP3

$ head genes.bed

chr1 0 1000 GENE1

bedtools intersect -a snps.bed -b genes.bed -wa -wb -u
ADD REPLY
0
Entering edit mode

This filters the SNPs in GENE1. Output is:

chr1    10  20  SNP1
chr1    20  21  SNP2

but how do I obtain a file which contains all the SNPs along with the gene they're in. Thanks again for your help!

chr1 10 20 SNP1 GENE1
chr1 20 21 SNP2 GENE1
chr3 10 20 SNP3 NaN

EDIT: Nevermind, removing -u did the job :)

ADD REPLY
1
Entering edit mode

hum...should be option '-loj'

    -loj    Perform a "left outer join". That is, for each feature in A
        report each overlap with B.  If no overlaps are found, 
        report a NULL feature for B.
ADD REPLY
0
Entering edit mode

Thanks! how should the files be sorted? numerically? alphabetically?

I'm sorting them this way, is this correct?

sort -k1,1V -k2,2n -k3,3n file
ADD REPLY
1
Entering edit mode
sort -t $'\t' -k1,1 -k2,2n
ADD REPLY
0
Entering edit mode

Depending on your variants but its possible to download dbSNP or gnomAD data and a lot of other data but your biggest challenge is not here.I no longer remember everything, but I think the files size order was about one or two hundrer Go, just for the SNPs annotations.

But warning ! It's correct way, if it's just to annotate a "clean" (or parsed) data. In addition, is possible to optimize your code with multitreating/multiprocessing library on python to have reasonable time to execution (it's also possible in C or others language ...).

Finaly if you want annotation taking into account the VCF file information, VEP is the most used bioinfo tools.

PS: During my MSc degree, I had developed a script in python using multiprocessing python library with a few algorithms part. I was able to annotate, extract, or modify SNPs fast on GWAS file in PED and MAP format on Macbook Air (below the minute). It was also a small cohort about 550 samples but about 900 000 SNP.

PS2: I know pandas has made progress in supporting the large file, but I don't think that a good future situation for your project in this way. I think that you will need manage your class object and memory usage, too.

For my script project describe in previous "PS", I modified each pointed address and assigned the new value to optimize the memory usage for example.

PS3: My opinion is based that you only have access to one computer, otherwise VEP tool should work.

ADD REPLY
1
Entering edit mode
6 months ago

It seems VEP is the best option so far. But it still takes around 10 hours to process 900k SNPs, which is very far from what they claim.

"Set up correctly, VEP is capable of processing around 3 million variants in 30 minutes."

Not sure what "set up correctly" even means. I used the largest buffer size and still took almost half a day...

There is another tool that is faster called SNP-nexus, however this only lets you run 10k at the time, and there is no way to access it programatically, so not very useful.

ADD COMMENT

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6