Hi,
I want to retrieve reference sequence GenBank files for genes with transcript of interest using E-utils
For example: GFM1 gene has two mRNA transcripts in the region. XM_005247840.1
and NM_024996.5
, and I only want the mRNA and CDS for NM_024996.5
in my GenBank reference files.
This is what I have so far:
esearch -db gene -query "GFM1[gene] AND human[orgn] AND alive[prop]" | efetch -format docsum | xtract -pattern DocumentSummary -block LocationHistType -if ChrAccVer -equals NC_000003.11 -tab "\n" -element ChrAccVer,ChrStart,ChrStop | awk -F '\t' '{{OFS = "\t"} if ($2 < $3) {print $1, $2+1, $3+1} else {print $1, $2+1, $3+1}}' | xargs -n 3 sh -c 'efetch -db nucleotide -id "$0" -seq_start "$1" -seq_stop "$2"'
But, it gives me both transcripts, and I only want the one of interest.
Any help would be greatly appreciated!
There are multiple entries in RefSeq:
Thanks genomax, I'm new to this. I'd like .gk files for GRCh37 assembly. With my command above, I was able to get the .gb file, but it contains both mRNA transcripts. I'm having trouble with downloading .gb reference for just
NM_024996.5
You can get the GenBank format sequence by doing:
These are RefSeq accessions and they are not tied to a particular genome build.
Yes, I understand. But I do need the reference sequence for this gene that encodes the transcript and chromosome number g. relative to the gene.
The genbank should contain the gene GFM1, build 37 NC_000003.11 and mRNA transcript NM_024996.5.
Thanks..
Entrez Direct will only return the latest data. And human build 37 aka GRCh37 is no longer actively annotated. Every once in a while an update is released by RefSeq. The latest update is here: https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/
You can download the GenBank flatfiles from that location as well as data in other formats. Keep in mind though that the data there are current as of the annotation update release date. Any additional updates to the RefSeqs made since that release are currently only available for GRCh38. Its not uncommon to come across RefSeq transcripts that are live and not present in the GRCh37 data because those RefSeq transcripts were created after the release of the GRCh37 105.20190906.