Question

Retrieve GFF3 file from Ensembl

0

Entering edit mode

5.8 years ago

jamesdong • 0

I have a bunch of Ensembl mRNA IDs of homologous genes from different species: ENSGACT00000003747,ENSMLUT00000000343,ENSMICT00000005631,......, I want to retrieve conresponding gff3 files respectively from ensembl site. I have searched the Biostars and found the link Retrieve GFF3 file from ncbi, which gave that answer of "Retrieve GFF3 file from ncbi", so I thought to convert Ensembl mRNA IDs into NCBI RefSeq mRNA Accession with bioDBnet and then I can fetch the gff3 files from NCBI site, however, the Ensembl protein ID and NCBI RefSeq mRNA Accession are not always one to one match, For example, ENSMLUT00000000343 match to XM_006081596, XM_023753155, XM_023753157, while ENSGACT00000003747 matched nothing. So,my question is how to batch download gff3 files of homologous genes with Ensembl mRNA IDs from Ensembl site directly? Thanks in advance.

gene Assembly genome ensembl • 1.8k views

ADD COMMENT • link 5.8 years ago by jamesdong • 0

0

Entering edit mode

The solutions are great!

ADD REPLY • link 5.8 years ago by jamesdong • 0

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 5.8 years ago by GenoMax 151k

0

Entering edit mode

Thank you for reminding me, I have edited.

ADD REPLY • link 5.8 years ago by jamesdong • 0

score 4 · Accepted Answer · 2019-07-08

An idea is to first get the required information from bioDBnet (species name, gene ID, location on the genome), then use that information to query Ensembl. You can try further filtering the results generated from the following codes or modifying it to meet your need (make sure you have jq installed):

IDs="ENSGACT00000003747,ENSMLUT00000000343,ENSMICT00000005631"

curl -s "https://biodbnet-abcc.ncifcrf.gov/webServices/rest.php/biodbnetRestApi.json?method=db2db&format=row&input=ensembltranscriptid&inputValues=${IDs}&outputs=taxonid,ensemblgeneid,chromosomallocation" \
  | jq -r '.[] | "\(."Taxon ID")\t\(."Ensembl Gene ID")\t\(."Chromosomal Location")\t\(."InputValue")"' \
  | perl -nle '($spe, $gene, $chr, $start, $end, $trans)=($1, $2, $3, $4, $5, $6) if ($_ =~ /\[Scientific Name: (\w+\s\w+)\]\t(ENS\w+\d+)\W+\[chr: (.+?)\] \[chr_start: (\d+)\] \[chr_end: (\d+)\].*\t(ENS\w+\d+)/); $spe=~s/ /_/g; print $spe." ".$gene." ".$chr.":".$start."-".$end." ".$trans' \
  | awk '{print "https://www.ensembl.org/"$1"/Export/Output/Transcript?db=core;flank3_display=0;flank5_display=0;g="$2";output=gff3;r="$3";strand=feature;t="$4";param=gene;param=transcript;param=exon;param=intron;param=cds;_format=Text"}' \
  > dl_mRNA.txt

wget -O dl_mRNA.gff3 -i dl_mRNA.txt

Where dl_mRNA.txt contains the URLs to download the requested gff3:

$ head dl_mRNA.txt
https://www.ensembl.org/Gasterosteus_aculeatus/Export/Output/Transcript?db=core;flank3_display=0;flank5_display=0;g=ENSGACG00000002849;output=gff3;r=groupV:1964805-1971970;strand=feature;t=ENSGACT00000003747;param=gene;param=transcript;param=exon;param=intron;param=cds;_format=Text
https://www.ensembl.org/Myotis_lucifugus/Export/Output/Transcript?db=core;flank3_display=0;flank5_display=0;g=ENSMLUG00000000348;output=gff3;r=GL429768:1611827-1619441;strand=feature;t=ENSMLUT00000000343;param=gene;param=transcript;param=exon;param=intron;param=cds;_format=Text
https://www.ensembl.org/Microcebus_murinus/Export/Output/Transcript?db=core;flank3_display=0;flank5_display=0;g=ENSMICG00000005634;output=gff3;r=29:8333386-8409960;strand=feature;t=ENSMICT00000005631;param=gene;param=transcript;param=exon;param=intron;param=cds;_format=Text

Part of the example output:

$ grep 'ENSGACT00000003747' dl_mRNA.gff3
groupV  Ensembl transcript  1964805 1971815 .   -   .   ID=ENSGACT00000003747.1;Name=ENSGACT00000003747.1;Parent=ENSGACG00000002849.1;biotype=protein_coding
groupV  Ensembl intron  1967918 1971674 .   -   .   Name=intron00001;Parent=ENSGACT00000003747.1
groupV  Ensembl intron  1967660 1967875 .   -   .   Name=intron00002;Parent=ENSGACT00000003747.1
groupV  Ensembl intron  1965918 1967531 .   -   .   Name=intron00003;Parent=ENSGACT00000003747.1
groupV  Ensembl intron  1965353 1965875 .   -   .   Name=intron00004;Parent=ENSGACT00000003747.1
groupV  Ensembl CDS 1971675 1971795 .   -   .   Name=ENSGACP00000003735;Parent=ENSGACT00000003747.1
groupV  Ensembl CDS 1967876 1967917 .   -   .   Name=ENSGACP00000003735;Parent=ENSGACT00000003747.1
groupV  Ensembl CDS 1967532 1967659 .   -   .   Name=ENSGACP00000003735;Parent=ENSGACT00000003747.1
groupV  Ensembl CDS 1965876 1965917 .   -   .   Name=ENSGACP00000003735;Parent=ENSGACT00000003747.1
groupV  Ensembl CDS 1965344 1965352 .   -   .   Name=ENSGACP00000003735;Parent=ENSGACT00000003747.1
groupV  Ensembl exon    1971675 1971815 .   -   .   Name=ENSGACE00000030656;Parent=ENSGACT00000003747.1
groupV  Ensembl exon    1967876 1967917 .   -   .   Name=ENSGACE00000030672;Parent=ENSGACT00000003747.1
groupV  Ensembl exon    1967532 1967659 .   -   .   Name=ENSGACE00000030687;Parent=ENSGACT00000003747.1
groupV  Ensembl exon    1965876 1965917 .   -   .   Name=ENSGACE00000030700;Parent=ENSGACT00000003747.1
groupV  Ensembl exon    1964805 1965352 .   -   .   Name=ENSGACE00000030711;Parent=ENSGACT00000003747.1