Hi all, I would like to download all of the longest transcripts for protein coding sequences for the vervet genome (assembly number 132581 in NCBI) in fasta format.
I used the command:
elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq |efetch -db nuccore -format fasta_cds_na >> chlorocebus_vervet.genes
where "id" is the assembly ID for the species.
The problems:
- When I type just elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq, the output is (I have had to slightly amend the format, but the numbers have not been altered):
ENTREZ_DIRECT
nuccore
WebEnv>NCID_1_14910726_130.14.22.215_9001_1469710900_373932213_0MetA0_S_MegaStore_F_1
QueryKey>2</querykey<>
Count>7156</count>
Step>1</step>
ENTREZ_DIRECT
Maybe I'm not understanding, but is this telling me that there are 7,156 protein coding sequences being downloaded? When I manually search for "132581" in the NCBI assembly database, the output is: https://www.ncbi.nlm.nih.gov/assembly/GCF_000409795.2/. Then when I click on https://www.ncbi.nlm.nih.gov/genome/?term=txid60711[orgn] and "Gene", and click "protein coding category"; I can see that there are 20,633 records that I would like the fasta sequences for.
- When I run this command (elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq |efetch -db nuccore -format fasta_cds_na >> chlorocebus_vervet.genes) two things happen: (a) Not all 7,156 records download, the number stops at somewhere between 150-400. (b) There are large uneven gaps (i.e. empty lines) randomly in the fasta output. I tried to copy and paste to here, but the large gaps are automatically stripped.
Could someone please tell me what I am doing wrong, and how I retrieve all of the longest transcripts for 20,633 protein coding sequences for this species.
Thanks.
There are 61803 predicted proteins in this file for the accession that you posted above.
don't use gi : http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/
Do you want to get the DNA sequence or protein (you had a post about it but your appear to have deleted it).
This is an interesting problem. I can see a few ways of getting the info for those 20,633 genes.
cut -f 7-9,15 GCF_000409795.2_Chlorocebus_sabeus_1.1_feature_table.txt
An example belowgenomic_accession start end symbol
NC_023642.1 2628 4911 SCGB1C1
NC_023642.1 2628 4911 SCGB1C1
NC_023642.1 3658 4619 SCGB1C1
NC_023642.1 4916 10379 ODF3
NC_023642.1 4916 10379 ODF3
NC_023642.1 4916 10168 ODF3
NC_023642.1 4916 10168 ODF3
NC_023642.1 7313 10166 ODF3
NC_023642.1 7313 9932 ODF3
NC_023642.1 7313 9932 ODF3
Thank you. The link provided by you (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000409795.2_Chlorocebus_sabeus_1.1/GCF_000409795.2_Chlorocebus_sabeus_1.1_protein.faa.gz ) was particularly useful as it pointed me to where I should be looking. To clarify, I want the coding sequences and not amino acids.
I have a question, so I didn't see your second answer, the possible solution to obtain coding sequences, and I was trying to work on my own and this is what I came up with; I'm wondering if this is an alternative way to obtain protein coding sequences (nuclear, not mitochrondial)?
I can see in that directory (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000409795.2_Chlorocebus_sabeus_1.1/ ), that there is another file called GCF_000409795.2_Chlorocebus_sabeus_1.1_cds_from_genomic.fna. I looked up what the "cds_from_genomic" files are here: http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/, and I can see that they are: "FASTA format of the nucleotide sequences corresponding to all CDS features annotated on the assembly, based on the genome sequence." So would another way of obtained protein coding sequences be to:
Thank you again for the effort, I appreciate it.
If you only want the 20633 genes then going with #2 in my post above is the best option.
But otherwise what you are describing should work. This may be equal to #3 that I had posted above. There appear to be only 13 entries for cds_YP in the cds_from_genomic file with 61790 cds_XP entries.
For option #2 above it turns out there are 25752 features in the file (protein coding genes, gene_result.txt).
Extracting some fields out of the original file (you can adjust as needed)
Then extract the sequence
The intervals (full gene) in this file include exons and introns (upper and lower case letters in final output, verify that once). So that is what you get.
Thank you for the reply, I really appreciate your help. So I did the above method, got an output, and wanted to check that the results made sense. To explain the problem that I see that keeps coming up, if we look at the TRIM5 gene in Pan Troglodytes (chimpanzee).
In Ensembl, when I go to Biomart, and filter the search to only look for TRIM5 in "associated gene names" in the data set CHIMP2.1.4, and obtain the ensembl gene name, transcript name and coding sequence, I get this sequence (on the webpage, you might have to click "results" to see the sequence). This is exactly what I would expect the longest transcript coding sequence of a gene to look like, in terms of length (although length obviously varies).
When I searched for TRIM5 in NCBI here; if I go down to the section “Genomic regions, transcripts and products”, change the genomic sequence to “NC_006478.3 Chromosome 11 reference pan troglodytes-2.1.4 primary assembly” (only so it matches the assembly in Ensembl) and then click “go to nucleotides -> fasta”, the output (which is here) is almost 300,000 base pairs. This is the same length of the gene that you get using method 2 described above, getting the gene information in tabular view. So then I thought to myself, it must be the BED file that's wrong.
When you search the NUCLEOTIDE database in NCBI for TRIM5 in chimp , the title is Pan troglodytes TRIM5alpha (TRIM5) gene, complete cds see here. If you remove the “NNNN” letters from the NCBI nucleotide, the two sequences (NCBI and Ensembl) are the exact same length (1,482 base pairs) and almost identical in composition. So, in this case, the BED file should have said this gene, but for some reason it is pulling out a much longer section of DNA.
So now I know that the problem isn't the method, but it's the BED file that the method is working on. Clearly GENE is not the right database to be working in, and I should be in the NUCLEOTIDE database instead. So then I went to the NUCLEOTIDE database, and I searched for "txid9598[Organism]" (which extracts all chimpanzee genes) in the search bar. Then I filtered by only Pan troglodytes species and genomic DNA/RNA. The problem is that I'm still left with >307,000 sequences. I cannot find how to whittle this number down using the filters.
So my question is, in your method 2 described above, when you say "Choose "Send to file" and then "tabular text view" to download full table"; are you in the GENE database (which is what I was originally working in; giving me incorrect longest coding sequences), or is it the NUCLEOTIDE database that you've somehow whittled down from ~300,000 sequences to ~20,000 sequences? I can't stress enough how much I appreciated your reply, even though I'm still having problems, I would have been a lot further behind without you.
Before I look at this are you happy with the C. sabeus results? Now you are looking at Chimp?
Sorry I replied and only saw this after. I actually have the same problem with everything, but I can't show you what the C. sabeus results should look like in Ensembl because the data isn't there yet, so I moved to chimpanzee because I can see the results that i should have using Ensembl, and then the results that I actually have in NCBI. Then if the results are the same for both, I can assume that the method works for the other species not in Ensembl.
Actually I can kind of get it to work (but there's still a problem) if I use the feature table method (method 3 above). I pulled down the feature table for pan troglodytes, and i pulled out the lines associated with TRIM5.
But I can see two feature descriptions of interest for TRIM5, one is there is a line with "gene...protein coding...5442276-5737937". this is 300,000 base pairs described above. But there's another feature category of sequences, not called GENE, called CDS. And I can see that the length of the longest CDS for this gene is 1,482 base pairs, which is exactly what the Ensembl TRIM5 longest transcript is.
If I pull down the feature table, remove everything except coding sequences, and then for TRIM5, find the length of the longest coding sequence, that will work I think.
However, in your example of making the BED file using this method you have this.
In reality, the lines for TRIM5 are like this:
i.e. your start and end points are ~2,000 base pairs long, whereas mine is >15,000 base pairs long. I don't know where this 1,482 base pairs starts and ends in the genomic accession ( where is the column that specifically tells me where the start and end bases that make up a length of 1,482 base pairs of interest in the feature table, I can't see one).
So now I'm wondering do I have to use the feature table to identify the longest protein. Extract the protein/RNA name, and then do something else with bedtools to give it a list of protein names, and it will give me back the coding sequences for the protein names?
I am looking at two genes (examples) in the C. sabeus.
Here are all the entries for the two genes in the feature table (method #3). I have cut some of the columns that are not interesting out.
If you need just the CDS then you would need to grep out those lines from the feature table and then extract the start and stops.
In P. tro TRIM5 gene has more entries becuase there are additional genbank entries but 3 CDS's based on the length of the feature they code for.
Thanks so much. So in this last example (TRIM5 for chimpanzee), I can see the three coding sequences, one has a "longest canonical transcript/CCDS/coding sequence" of 903 bp, making a protein 300 aa in length, one has a coding sequence length of 981bp/326aa and the last one has a coding sequence length of 1,482bp/493aa.
So in this case, I want to pull out the 1,482 base pairs that make up the last "longest canonical" transcript. I know exactly what the sequence should look like. In the links above, this is the exact 1,482 base pairs here and in NCBI, if you remove the "NNN" from this sequence which is 2,041 base pairs long, it is here. So I know this number is right and I know what the "answer"/coding sequence that is extracted should look like.
So, this is my problem, in the example above (TRIM5 for chimpanzee), where are the start and end positions in that table, that when you subtract the start from the end, gives you a 1,482 base pair sequence (or 2,041 base pairs with NNN); i.e. the co-ordinates that I should get the sequence from?
Because the only other numbers I can see in that example are the starts and ends of the genomic accession, where 5476417 - 5461069 = 15,348 base pairs. So then I thought to myself: maybe it's the 15,348 base pairs, but removing the lower case letters. But in this 15,348 base pair region, there are 8,367 upper case letters.
So then I thought, maybe there's no way to obtain a 1,482 bp set of co-ordinates from this table. But then in that case, I'll be left with other rubbish in my sequence, when I just want the longest canonical transcript for the protein? I was just googling to make sure I understood what the longest CDS was here; so you can see there the CDS is directly translated into protein; so I'm pretty sure the sequence that I want should be 1,482 base pairs in length?
Thank you so much, I appreciate it.
Go here and the search for TRIM5 when the page loads (warning: large download). You will see how the CDS entries are annotated. I think a lot of this may be attributable to automated annotation.
Thank you so much for the reply.
I'm wondering if the best method would be to use the features table described above to identify the protein of interest based on length; make a list of the proteins of interest, and then use Batch Entrez/Eutils to extract the coding sequences for this set of proteins.
I have tried to do this and run into the problem that I can download the full mRNA sequence, but not the mRNA sequence that specifically encodes the protein (pretty much the same issue as described above).
Since how to obtain a set of coding sequences from a set of protein IDs has gone off the original topic of this post "Why big gaps using Entrez Eutils when trying to download protein coding sequences", I have started a new thread here to ask that specific question and linked it to this thread.
I really appreciate all the help and time that you've given me, I absolutely have not taken it for granted. I can't believe how difficult I'm finding this!
Perhaps I should have asked this a while ago but what is the final aim of this analysis? What do you hope to get?
I am doing an analysis that compares orthologs between species that are involved in a particular function to see how it differs between species. The first step is to obtain the longest canonical transcript for each gene for each species.
I am currently trying the method I described above, where I've downloaded the "cds_from_genomic.fna" files. I am running a script that pulls out the longest coding sequence for each gene from these files, then I will check to see if there's any mitochondrial etc/anything unusual in output and see where I am from there, maybe compare my output to what I should have based on the feature file.
Have you thought about using precomputed alignments the UCSC/NCBI/Ensembl offer? If you can use them then it can save you a ton of work.