Retrieve reference sequence GenBank for mRNA transcript of interest
0
0
Entering edit mode
4.5 years ago
speycast • 0

Hi,

I want to retrieve reference sequence GenBank files for genes with transcript of interest using E-utils For example: GFM1 gene has two mRNA transcripts in the region. XM_005247840.1 and NM_024996.5, and I only want the mRNA and CDS for NM_024996.5 in my GenBank reference files.

This is what I have so far:

esearch -db gene -query "GFM1[gene] AND human[orgn] AND alive[prop]" | efetch -format docsum | xtract -pattern DocumentSummary -block LocationHistType -if ChrAccVer -equals NC_000003.11 -tab "\n" -element ChrAccVer,ChrStart,ChrStop | awk -F '\t' '{{OFS = "\t"} if ($2 < $3) {print $1, $2+1, $3+1} else {print $1, $2+1, $3+1}}' | xargs -n 3 sh -c 'efetch -db nucleotide -id "$0" -seq_start "$1" -seq_stop "$2"'

But, it gives me both transcripts, and I only want the one of interest.

Any help would be greatly appreciated!

GenBank NCBI RefSeq mRNA • 1.3k views
ADD COMMENT
0
Entering edit mode

There are multiple entries in RefSeq:

$  esearch -db gene -query "GFM1[gene] AND human[orgn] AND alive[prop]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta | grep ">"
>NM_024996.7 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 2, mRNA; nuclear gene for mitochondrial product
>NM_001374357.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 6, mRNA; nuclear gene for mitochondrial product
>NM_001374355.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 4, mRNA; nuclear gene for mitochondrial product
>NM_001374361.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 10, mRNA; nuclear gene for mitochondrial product
>NM_001374356.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 5, mRNA; nuclear gene for mitochondrial product
>NM_001374358.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 7, mRNA; nuclear gene for mitochondrial product
>NM_001374360.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 9, mRNA; nuclear gene for mitochondrial product
>NM_001374359.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 8, mRNA; nuclear gene for mitochondrial product
>NM_001308166.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 3, mRNA
>NR_164502.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 14, non-coding RNA
>NR_164499.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 11, non-coding RNA
>NR_164500.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 12, non-coding RNA
>NR_164501.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 13, non-coding RNA
>NM_001308164.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 1, mRNA
ADD REPLY
0
Entering edit mode

Thanks genomax, I'm new to this. I'd like .gk files for GRCh37 assembly. With my command above, I was able to get the .gb file, but it contains both mRNA transcripts. I'm having trouble with downloading .gb reference for just NM_024996.5

ADD REPLY
0
Entering edit mode

You can get the GenBank format sequence by doing:

$ efetch -db nuccore -id "NM_024996" -format gb > NM_024996.gbk

These are RefSeq accessions and they are not tied to a particular genome build.

ADD REPLY
0
Entering edit mode

Yes, I understand. But I do need the reference sequence for this gene that encodes the transcript and chromosome number g. relative to the gene.

The genbank should contain the gene GFM1, build 37 NC_000003.11 and mRNA transcript NM_024996.5.

Thanks..

ADD REPLY
0
Entering edit mode

Entrez Direct will only return the latest data. And human build 37 aka GRCh37 is no longer actively annotated. Every once in a while an update is released by RefSeq. The latest update is here: https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/

You can download the GenBank flatfiles from that location as well as data in other formats. Keep in mind though that the data there are current as of the annotation update release date. Any additional updates to the RefSeqs made since that release are currently only available for GRCh38. Its not uncommon to come across RefSeq transcripts that are live and not present in the GRCh37 data because those RefSeq transcripts were created after the release of the GRCh37 105.20190906.

ADD REPLY

Login before adding your answer.

Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6