Downloading Complete Cds Data Of A Genome From Unigene And Refseq
2
0
Entering edit mode
11.2 years ago
Ritvik ▴ 30

Hi,

I want to download complete CDS of a genome from refseq and unigene, from the ftp site i wasn't able to get the data. Is it possible to do so using eutils or biomart? If so, can please anyone guide me through the general steps, i don't need the script as i am learning a scripting language myself which i believe i can figure out.

Also, how to retrieve CDS data from a complete cDNA (fasta format) of a genome ?

cds cdna refseq • 5.0k views
ADD COMMENT
0
Entering edit mode

Can you explain to us how you were not able to get the data from the FTP site? Were you seeing any error messages? It usually is possible to use eutils or biomart as most public databases use those services. What are the specific steps you are having trouble with?

ADD REPLY
0
Entering edit mode

I should have mentioned before that i am just starting to learn bioinformatics and sorry for being quite short on my query earlier.

Suppose i want to download full length cDNA of Zea mays from Refseq and i am here at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/ but to understand the nomenclature pattern i am trying to download this catalog file -ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/RefSeq-release60.catalog.gz , i get a ' forbidden ' error message every time i try to download it.

Now coming to second part of my query, from the above refseq plant link ,i have downloaded this one random file 'plant.1.rna.fna.gz' and here's one sequence from this file :

gi|145341211|ref|XM_001415670.1| Ostreococcus lucimarinus CCE9901 VIC family transporter: sodium ion channel (OSTLU_7925) mRNA, partial cds TGGCAAGACCTCATGTACGAAGCGATGGATGTCGTCGGCGTAGATCAAGAACCTATCCGAGATAACGCCAAGTGGGCGTG CGTTTACTTTTTCGTATCGATTCTCTTTGGCTTCTTGCTCTGGGCAAATCTCTTTGTGTCGGCGCTCATCGACAATTTCA ATCGTATCGCACACGACGAGAACGACGGCAAGCTGCTTGTCACCGATGAGCAACGCGTGTGGCAGCAAGCAATGTTACTC GCCACGGTGCACGCTGACAACTCTTGGCGTAGATCATCACCGGAAACGCCCTGGAAAGCCGTCGTGCATGGCGTAGTGTC TAAGTATACCTTTGACGCGTTTTCCGTCTTTATGATCGTACTCAACATGGTCACGATGATGGCAATACGCGCGAACCCGT CGAAATCTGAGGACGACTATCAAGTTTGGATGGGAAACACGTTAGCGATCTGGTACATGCTTGAGGCTTATCTTTTGATC GTCGCCATGAAGTGGAAAAATTACTGGCAGAGCGGTTGGAATAAGATCGATTTCATCGTGGCAGTTAGCGGTGTCGTCGG TCTACTCATCCCGGATGTTTACGAGAGTGGCGTTGGTGGAGCTTTCCGTATGCTGAGATTTTTGCGATTGTTTAAAATTG TTCAGGTGAGCAAAGGTCTGCGAACACTTTTCGCGACATTCTTGTCGGCGATTCCTGGAGTCGTCAATGTCGCGCTTTTA TCTCTGCTGTTCATGTACATCTATGCTTGCCTCGGTGTCGCACTCTTT

Can you tell me how to identify CDS in this sequence???

Similarly, where to download the complete transcriptome of Zea mays from unigene, i am here at ftp://ftp.ncbi.nlm.nih.gov/repository/UniGene/Zea_mays/, the info file shows the species contains 146856 mRNAs but which of the files listed therein contains these mRNA? As far as i know, unigene maintains gene oriented clusters, so each entry contains a cluster ,so do i have to look for mrna in description of that cluster? And again how to identify CDS once i have got the transcriptome???

I know what i am asking is fairly basic but like i said earlier, i am new to this field and any help extended would be appreciable. Also, i know there are plant specific databases which can easily provide this information, but i want to do so using NCBI, however i am missing something very fundamental here.

ADD REPLY
1
Entering edit mode
11.2 years ago
Jason ▴ 940

Without knowing the organism, in general if I were looking for CDS data retrieval I might start at the UCSC genome browser. I noticed there's also a FTP server you can utilize on UCSC too. You can also go through galaxy if you need to change formats or do some editing. Galaxy is great for beginners.

ADD COMMENT
0
Entering edit mode

I should have mentioned before that i am just starting to learn bioinformatics and sorry for being quite short on my query earlier.

Suppose i want to download full length cDNA of Zea mays from Refseq and i am here at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/ but to understand the nomenclature pattern i am trying to download this catalog file -ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/RefSeq-release60.catalog.gz , i get a ' forbidden ' error message every time i try to download it.

Now coming to second part of my query, from the above refseq plant link ,i have downloaded this one random file 'plant.1.rna.fna.gz' and here's one sequence from this file :

gi|145341211|ref|XM_001415670.1| Ostreococcus lucimarinus CCE9901 VIC family transporter: sodium ion channel (OSTLU_7925) mRNA, partial cds TGGCAAGACCTCATGTACGAAGCGATGGATGTCGTCGGCGTAGATCAAGAACCTATCCGAGATAACGCCAAGTGGGCGTG CGTTTACTTTTTCGTATCGATTCTCTTTGGCTTCTTGCTCTGGGCAAATCTCTTTGTGTCGGCGCTCATCGACAATTTCA ATCGTATCGCACACGACGAGAACGACGGCAAGCTGCTTGTCACCGATGAGCAACGCGTGTGGCAGCAAGCAATGTTACTC GCCACGGTGCACGCTGACAACTCTTGGCGTAGATCATCACCGGAAACGCCCTGGAAAGCCGTCGTGCATGGCGTAGTGTC TAAGTATACCTTTGACGCGTTTTCCGTCTTTATGATCGTACTCAACATGGTCACGATGATGGCAATACGCGCGAACCCGT CGAAATCTGAGGACGACTATCAAGTTTGGATGGGAAACACGTTAGCGATCTGGTACATGCTTGAGGCTTATCTTTTGATC GTCGCCATGAAGTGGAAAAATTACTGGCAGAGCGGTTGGAATAAGATCGATTTCATCGTGGCAGTTAGCGGTGTCGTCGG TCTACTCATCCCGGATGTTTACGAGAGTGGCGTTGGTGGAGCTTTCCGTATGCTGAGATTTTTGCGATTGTTTAAAATTG TTCAGGTGAGCAAAGGTCTGCGAACACTTTTCGCGACATTCTTGTCGGCGATTCCTGGAGTCGTCAATGTCGCGCTTTTA TCTCTGCTGTTCATGTACATCTATGCTTGCCTCGGTGTCGCACTCTTT

Can you tell me how to identify CDS in this sequence???

Similarly, where to download the complete transcriptome of Zea mays from unigene, i am here at ftp://ftp.ncbi.nlm.nih.gov/repository/UniGene/Zea_mays/, the info file shows the species contains 146856 mRNAs but which of the files listed therein contains these mRNA? As far as i know, unigene maintains gene oriented clusters, so each entry contains a cluster ,so do i have to look for mrna in description of that cluster? And again how to identify CDS once i have got the transcriptome???

I know what i am asking is fairly basic but like i said earlier, i am new to this field and any help extended would be appreciable. Also, i know there are plant specific databases which can easily provide this information, but i want to do so using NCBI, however i am missing something very fundamental here.

ADD REPLY
1
Entering edit mode
11.2 years ago

You've mentioned Zea mays in a few of your responses, for which the CDS is available from PlantGDB. For some organisms, these sequences are also available via biomart, though it's a bit hit or miss whether this actually works. As a last resort, you can always (1) download the genomic sequence and a GTF annotation for it, (2) filter this annotation to just include the CDS (though, you'll likely have to change the label to "exon"), and then (3) use the gtf_to_fasta executable that comes with tophat. Something along those lines should generally work for any remaining genomes. You could also probably use gffread, from cufflinks, in place of gtf_to_fasta.

ADD COMMENT
0
Entering edit mode

Thanks for replying, ya, i know about plantgdb and other databases which easily provide this information but i have seen a number of papers that use refseq for extracting all mRNAs which i am unable to figure out. Ok, i will try what you have suggested and see how the results turn up. Once again thanks for replying!

ADD REPLY

Login before adding your answer.

Traffic: 1232 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6