Hi,
I try to get all sequences for a large (about 17.000) list of NCBI gene Ids.
What I figured is one can use http://www.ensembl.org/biomart/martview/7cf551cc4abf51e75bb4f1d84477681e and then download the sequences but this is slow and only works for 500 gene ids at a time.
Is there somewhere a database / file available which I can use to retrieve the nucleotide sequences for the corresponding genes?
I figured there is https://www.ncbi.nlm.nih.gov/sites/batchentrez which can yield for genes:
gene_id -> genomic_nucleotide_accession.version:start_position_on_the_genomic_accession-end_position_on_the_genomic_accession"
Then I used the "nucleotide" batch search. But when I enter e.g.
NC_000003.12:155869430-155944020
I get
An illegal character in a token. Possible wrong file format. Request processing canceled.
I figured one can download the genome from here ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/current/GCF_000001405.39_GRCh38.p13/ but I am lost a bit here. In the readme it states
*_genomic.fna.gz file FASTA format of the genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case (see below).
So I downloaded this file and I also grepped a bit in the file and can view for example NC_000003.12. But now I need also to retrieve the positions which is super slow when doing it the neive waw for example in python.
So my question is: How should I approach this task?
Awesome thank you very much!! The only problem I am getting is, that the sequences seems to not match. E.g. looking at Gene ID 30849, I get the location from feature_table: "NC_000003.12:130678934 -130746829"
Now when I look up the gene at NCBI, I get this sequence: https://www.ncbi.nlm.nih.gov/nuccore/NC_000003.12?report=fasta&from=130678934&to=130746829&strand=true starting with
But when I do
I get the sequence starting with
Also I don't know why there are small letters?
Your link is showing reverse complement sequence because the gene is on the opposite strand.
bedtools getfasta
has an option to fetch strand specific sequence if you include the strand in your input bed. I suggest you give that a try.These are masked sequences. Take a look at the README file for more information about it.
Perfect, now everything works, thank you!