Retrieve FASTA sequences for all genes from GRCh38.p13
2
0
Entering edit mode
5.1 years ago
lokimoto • 0

Hi,

I try to get all sequences for a large (about 17.000) list of NCBI gene Ids.

What I figured is one can use http://www.ensembl.org/biomart/martview/7cf551cc4abf51e75bb4f1d84477681e and then download the sequences but this is slow and only works for 500 gene ids at a time.

Is there somewhere a database / file available which I can use to retrieve the nucleotide sequences for the corresponding genes?

I figured there is https://www.ncbi.nlm.nih.gov/sites/batchentrez which can yield for genes:

 gene_id -> genomic_nucleotide_accession.version:start_position_on_the_genomic_accession-end_position_on_the_genomic_accession"

Then I used the "nucleotide" batch search. But when I enter e.g.

NC_000003.12:155869430-155944020

I get

An illegal character in a token. Possible wrong file format. Request processing canceled.

I figured one can download the genome from here ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/current/GCF_000001405.39_GRCh38.p13/ but I am lost a bit here. In the readme it states

*_genomic.fna.gz file FASTA format of the genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case (see below).

So I downloaded this file and I also grepped a bit in the file and can view for example NC_000003.12. But now I need also to retrieve the positions which is super slow when doing it the neive waw for example in python.

So my question is: How should I approach this task?

Assembly sequence • 3.8k views
ADD COMMENT
2
Entering edit mode
5.1 years ago
vkkodali_ncbi ★ 3.8k

I suggest you to take a look at the GCF_000001405.39_GRCh38.p13_feature_table.txt.gz file located in the FTP path you have specified. It is a tab-delimited table with information about all of the features annotated on the genome. If you have a list of NCBI GeneIDs, I'd join that list on GeneID field of the feature_table.txt.gz file, filter the rows that have gene in the first column and create a BED-like file using the data in columsn 7, 8, 9 and 10. Make sure to subtract 1 from the start/end positions depending on the strand because BED uses 0-based coordinate system whereas the coordinates in the feature_table.txt file are 1-based. Then, you can use bedtools getfasta in combination with the genome sequence FASTA files to extract the gene sequence in FASTA format.

ADD COMMENT
0
Entering edit mode

Awesome thank you very much!! The only problem I am getting is, that the sequences seems to not match. E.g. looking at Gene ID 30849, I get the location from feature_table: "NC_000003.12:130678934 -130746829"

Now when I look up the gene at NCBI, I get this sequence: https://www.ncbi.nlm.nih.gov/nuccore/NC_000003.12?report=fasta&from=130678934&to=130746829&strand=true starting with

GAGGTCAGACCGGTTGCTTTCCCGGGAGTTCGGCG

But when I do

bedtools getfasta -fi GCF_000001405.39_GRCh38.p13_genomic.fna -bed test_gene_30849.bed

I get the sequence starting with

ACAGTTTAACTGCTTTGAATAgcatttttaatagcaaaaatgtGTTTATACTATAA

Also I don't know why there are small letters?

ADD REPLY
1
Entering edit mode

Your link is showing reverse complement sequence because the gene is on the opposite strand. bedtools getfasta has an option to fetch strand specific sequence if you include the strand in your input bed. I suggest you give that a try.

Also I don't know why there are small letters?

These are masked sequences. Take a look at the README file for more information about it.

ADD REPLY
0
Entering edit mode

Perfect, now everything works, thank you!

ADD REPLY
1
Entering edit mode
5.1 years ago
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/current/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr1.fna.gz
gzip -d chr1.fna.gz
samtools faidx chr1.fna
samtools faidx chr1.fna NC_000001.11:100000-100050

You can use samtools to create index for fasta file. Hope the above code helps. You can use xargs to iterate over bed file. I can put some code using xargs as a follow up if required.

ADD COMMENT
0
Entering edit mode

Thank you very much! Similar to the sulution vkkodali posted I have problems finding the correct sequence.

E.g. looking at Gene ID 30849, the location is "NC_000003.12:130678934 -130746829"

from NCBI, I get this sequence: https://www.ncbi.nlm.nih.gov/nuccore/NC_000003.12?report=fasta&from=130678934&to=130746829&strand=true starting with

GAGGTCAGACCGGTTGCTTTCCCGGGAGTTCGGCG

But when I do

samtools faidx chr3.fna NC_000003.12:130678934-130746829

I get the sequence starting with

TACAGTTTAACTGCTTTGAATAGCATTTTTAATAGCAAAAATGTGTTTATACTATAATAG

Can you give me hints on why this is the case? (I used ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/current/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr3.fna.gz)

ADD REPLY

Login before adding your answer.

Traffic: 1685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6