Why big gaps when I use Entrez Eutils to download protein coding sequences.
0
0
Entering edit mode
8.3 years ago
Tom ▴ 50

Hi all, I would like to download all of the longest transcripts for protein coding sequences for the vervet genome (assembly number 132581 in NCBI) in fasta format.

I used the command:

elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq |efetch -db nuccore -format fasta_cds_na >> chlorocebus_vervet.genes

where "id" is the assembly ID for the species.

The problems:

  1. When I type just elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq, the output is (I have had to slightly amend the format, but the numbers have not been altered):

ENTREZ_DIRECT

nuccore

WebEnv>NCID_1_14910726_130.14.22.215_9001_1469710900_373932213_0MetA0_S_MegaStore_F_1

QueryKey>2</querykey<>

Count>7156</count>

Step>1</step>

ENTREZ_DIRECT

Maybe I'm not understanding, but is this telling me that there are 7,156 protein coding sequences being downloaded? When I manually search for "132581" in the NCBI assembly database, the output is: https://www.ncbi.nlm.nih.gov/assembly/GCF_000409795.2/. Then when I click on https://www.ncbi.nlm.nih.gov/genome/?term=txid60711[orgn] and "Gene", and click "protein coding category"; I can see that there are 20,633 records that I would like the fasta sequences for.

  1. When I run this command (elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq |efetch -db nuccore -format fasta_cds_na >> chlorocebus_vervet.genes) two things happen: (a) Not all 7,156 records download, the number stops at somewhere between 150-400. (b) There are large uneven gaps (i.e. empty lines) randomly in the fasta output. I tried to copy and paste to here, but the large gaps are automatically stripped.

Could someone please tell me what I am doing wrong, and how I retrieve all of the longest transcripts for 20,633 protein coding sequences for this species.

Thanks.

entrez eutils • 3.0k views
ADD COMMENT
1
Entering edit mode

There are 61803 predicted proteins in this file for the accession that you posted above.

ADD REPLY
0
Entering edit mode

Do you want to get the DNA sequence or protein (you had a post about it but your appear to have deleted it).

This is an interesting problem. I can see a few ways of getting the info for those 20,633 genes.

  1. You could get their gi numbers (at this time, won't work after Sept 2016) and then retrieve sequence that way.
  2. Choose "Send to file" and then "tabular text view" to download full table. Cut the interval columns out for the locations and then use getfasta from bedtools to recover the DNA sequence.
  3. There is also this file which gives you features defined for this genome. Following command will get you the BED format file for genes. There are multiple entries for each so you will need to trim those down. cut -f 7-9,15 GCF_000409795.2_Chlorocebus_sabeus_1.1_feature_table.txt An example below

genomic_accession start end symbol
NC_023642.1 2628 4911 SCGB1C1
NC_023642.1 2628 4911 SCGB1C1
NC_023642.1 3658 4619 SCGB1C1
NC_023642.1 4916 10379 ODF3
NC_023642.1 4916 10379 ODF3
NC_023642.1 4916 10168 ODF3
NC_023642.1 4916 10168 ODF3
NC_023642.1 7313 10166 ODF3
NC_023642.1 7313 9932 ODF3
NC_023642.1 7313 9932 ODF3

ADD REPLY
0
Entering edit mode

Thank you. The link provided by you (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000409795.2_Chlorocebus_sabeus_1.1/GCF_000409795.2_Chlorocebus_sabeus_1.1_protein.faa.gz ) was particularly useful as it pointed me to where I should be looking. To clarify, I want the coding sequences and not amino acids.

I have a question, so I didn't see your second answer, the possible solution to obtain coding sequences, and I was trying to work on my own and this is what I came up with; I'm wondering if this is an alternative way to obtain protein coding sequences (nuclear, not mitochrondial)?

I can see in that directory (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000409795.2_Chlorocebus_sabeus_1.1/ ), that there is another file called GCF_000409795.2_Chlorocebus_sabeus_1.1_cds_from_genomic.fna. I looked up what the "cds_from_genomic" files are here: http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/, and I can see that they are: "FASTA format of the nucleotide sequences corresponding to all CDS features annotated on the assembly, based on the genome sequence." So would another way of obtained protein coding sequences be to:

  1. Obtain the cds_genomic_fna file.
  2. Remove all of the sequences that have "cds_YP" in the title (mitochondrial), keeping only those with cds_XP (nuclear) in the title.
  3. For the remaining proteins, select the longest protein (i.e. XP number) for each gene (GeneID number).
  4. This will give me a list of longest transcripts for each gene/protein?

Thank you again for the effort, I appreciate it.

ADD REPLY
0
Entering edit mode

If you only want the 20633 genes then going with #2 in my post above is the best option.

But otherwise what you are describing should work. This may be equal to #3 that I had posted above. There appear to be only 13 entries for cds_YP in the cds_from_genomic file with 61790 cds_XP entries.

ADD REPLY
0
Entering edit mode

For option #2 above it turns out there are 25752 features in the file (protein coding genes, gene_result.txt).

Extracting some fields out of the original file (you can adjust as needed)

awk -F '\t' 'BEGIN{OFS="\t";}{print $12,$13,$14,$11,$6,$8,$15}' gene_result.txt > monkey.bed

Then extract the sequence

bedtools getfasta -fi GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna -bed monkey.bed -fo out.fa

The intervals (full gene) in this file include exons and introns (upper and lower case letters in final output, verify that once). So that is what you get.

ADD REPLY
0
Entering edit mode

Thank you for the reply, I really appreciate your help. So I did the above method, got an output, and wanted to check that the results made sense. To explain the problem that I see that keeps coming up, if we look at the TRIM5 gene in Pan Troglodytes (chimpanzee).

In Ensembl, when I go to Biomart, and filter the search to only look for TRIM5 in "associated gene names" in the data set CHIMP2.1.4, and obtain the ensembl gene name, transcript name and coding sequence, I get this sequence (on the webpage, you might have to click "results" to see the sequence). This is exactly what I would expect the longest transcript coding sequence of a gene to look like, in terms of length (although length obviously varies).

When I searched for TRIM5 in NCBI here; if I go down to the section “Genomic regions, transcripts and products”, change the genomic sequence to “NC_006478.3 Chromosome 11 reference pan troglodytes-2.1.4 primary assembly” (only so it matches the assembly in Ensembl) and then click “go to nucleotides -> fasta”, the output (which is here) is almost 300,000 base pairs. This is the same length of the gene that you get using method 2 described above, getting the gene information in tabular view. So then I thought to myself, it must be the BED file that's wrong.

When you search the NUCLEOTIDE database in NCBI for TRIM5 in chimp , the title is Pan troglodytes TRIM5alpha (TRIM5) gene, complete cds see here. If you remove the “NNNN” letters from the NCBI nucleotide, the two sequences (NCBI and Ensembl) are the exact same length (1,482 base pairs) and almost identical in composition. So, in this case, the BED file should have said this gene, but for some reason it is pulling out a much longer section of DNA.

So now I know that the problem isn't the method, but it's the BED file that the method is working on. Clearly GENE is not the right database to be working in, and I should be in the NUCLEOTIDE database instead. So then I went to the NUCLEOTIDE database, and I searched for "txid9598[Organism]" (which extracts all chimpanzee genes) in the search bar. Then I filtered by only Pan troglodytes species and genomic DNA/RNA. The problem is that I'm still left with >307,000 sequences. I cannot find how to whittle this number down using the filters.

So my question is, in your method 2 described above, when you say "Choose "Send to file" and then "tabular text view" to download full table"; are you in the GENE database (which is what I was originally working in; giving me incorrect longest coding sequences), or is it the NUCLEOTIDE database that you've somehow whittled down from ~300,000 sequences to ~20,000 sequences? I can't stress enough how much I appreciated your reply, even though I'm still having problems, I would have been a lot further behind without you.

ADD REPLY
0
Entering edit mode

Before I look at this are you happy with the C. sabeus results? Now you are looking at Chimp?

ADD REPLY
0
Entering edit mode

Sorry I replied and only saw this after. I actually have the same problem with everything, but I can't show you what the C. sabeus results should look like in Ensembl because the data isn't there yet, so I moved to chimpanzee because I can see the results that i should have using Ensembl, and then the results that I actually have in NCBI. Then if the results are the same for both, I can assume that the method works for the other species not in Ensembl.

ADD REPLY
0
Entering edit mode

Actually I can kind of get it to work (but there's still a problem) if I use the feature table method (method 3 above). I pulled down the feature table for pan troglodytes, and i pulled out the lines associated with TRIM5.

But I can see two feature descriptions of interest for TRIM5, one is there is a line with "gene...protein coding...5442276-5737937". this is 300,000 base pairs described above. But there's another feature category of sequences, not called GENE, called CDS. And I can see that the length of the longest CDS for this gene is 1,482 base pairs, which is exactly what the Ensembl TRIM5 longest transcript is.

If I pull down the feature table, remove everything except coding sequences, and then for TRIM5, find the length of the longest coding sequence, that will work I think.

However, in your example of making the BED file using this method you have this.

genomic_accession start end symbol 
NC_023642.1 2628 4911 SCGB1C1
NC_023642.1 2628 4911 SCGB1C1
NC_023642.1 3658 4619 SCGB1C1

In reality, the lines for TRIM5 are like this:

NC_006478.4  5461069  5476417 TRIM5
NC_006478.4 5461069 5476417 TRIM5
NC_006478.4 5461069 5476417 TRIM5

i.e. your start and end points are ~2,000 base pairs long, whereas mine is >15,000 base pairs long. I don't know where this 1,482 base pairs starts and ends in the genomic accession ( where is the column that specifically tells me where the start and end bases that make up a length of 1,482 base pairs of interest in the feature table, I can't see one).

So now I'm wondering do I have to use the feature table to identify the longest protein. Extract the protein/RNA name, and then do something else with bedtools to give it a list of protein names, and it will give me back the coding sequences for the protein names?

ADD REPLY
0
Entering edit mode

I am looking at two genes (examples) in the C. sabeus.

Here are all the entries for the two genes in the feature table (method #3). I have cut some of the columns that are not interesting out.

feature class   genomic_accession   start        end        product_accession related_accession symbol  feature_interval_length product_length
gene    protein_coding  NC_023642.1 2628        4911                                             SCGB1C1    2284    
mRNA                    NC_023642.1 2628        4911        XM_007993457.1  XP_007991648.1       SCGB1C1    1754        1754
CDS                     NC_023642.1 3658        4619        XP_007991648.1  XM_007993457.1       SCGB1C1     432        143

gene    protein_coding  NC_023642.1 61040908    61067815                                         TRIM5      26908   
mRNA                    NC_023642.1 61040908    61067815    XM_008019877.1  XP_008018068.1       TRIM5      2264        2264
mRNA                    NC_023642.1 61041019    61067815    XM_008019878.1  XP_008018069.1       TRIM5      2301        2301
CDS                     NC_023642.1 61050550    61067265    XP_008018068.1  XM_008019877.1       TRIM5      1548        515
CDS                     NC_023642.1 61050550    61067265    XP_008018069.1  XM_008019878.1       TRIM5      1548        515

If you need just the CDS then you would need to grep out those lines from the feature table and then extract the start and stops.

ADD REPLY
0
Entering edit mode

In P. tro TRIM5 gene has more entries becuase there are additional genbank entries but 3 CDS's based on the length of the feature they code for.

feature class           genomic_accession   start   end product_accession   related_accession   symbol  feature_interval_length product_length
gene    protein_coding  NC_006478.4      5442276    5737937                                     TRIM5   295662  
mRNA                    NC_006478.4      5442276    5481346 XM_016919895.1  XP_016775384.1      TRIM5   1280        1280
CDS                     NC_006478.4      5442355    5476417 XP_016775384.1  XM_016919895.1      TRIM5   903         300
mRNA                    NC_006478.4      5459818    5737937 XM_016919890.1  XP_016775379.1      TRIM5   3163        3163
mRNA                    NC_006478.4      5459818    5481357 XM_016919894.1  XP_016775383.1      TRIM5   1669        1669
mRNA                    NC_006478.4      5459818    5481357 XM_016919889.1  XP_016775378.1      TRIM5   2894        2894
mRNA                    NC_006478.4      5459818    5481346 XM_016919893.1  XP_016775382.1      TRIM5   1806        1806
mRNA                    NC_006478.4      5459818    5481346 XM_016919888.1  XP_016775377.1      TRIM5   3031        3031
mRNA                    NC_006478.4      5459818    5481103 XM_016919891.1  XP_016775380.1      TRIM5   3671        3671
CDS                     NC_006478.4      5460345    5476417 XP_016775382.1  XM_016919893.1      TRIM5   981         326
CDS                     NC_006478.4      5460345    5476417 XP_016775383.1  XM_016919894.1      TRIM5   981         326
mRNA                    NC_006478.4      5461069    5476417 NM_001012650.1  NP_001012668.1      TRIM5   1482        1482
CDS                     NC_006478.4      5461069    5476417 XP_016775377.1  XM_016919888.1      TRIM5   1482        493
CDS                     NC_006478.4      5461069    5476417 XP_016775378.1  XM_016919889.1      TRIM5   1482        493
CDS                     NC_006478.4      5461069    5476417 XP_016775379.1  XM_016919890.1      TRIM5   1482        493
CDS                     NC_006478.4      5461069    5476417 XP_016775380.1  XM_016919891.1      TRIM5   1482        493
CDS                     NC_006478.4      5461069    5476417 NP_001012668.1  NM_001012650.1      TRIM5   1482        493
ADD REPLY
0
Entering edit mode

Thanks so much. So in this last example (TRIM5 for chimpanzee), I can see the three coding sequences, one has a "longest canonical transcript/CCDS/coding sequence" of 903 bp, making a protein 300 aa in length, one has a coding sequence length of 981bp/326aa and the last one has a coding sequence length of 1,482bp/493aa.

So in this case, I want to pull out the 1,482 base pairs that make up the last "longest canonical" transcript. I know exactly what the sequence should look like. In the links above, this is the exact 1,482 base pairs here and in NCBI, if you remove the "NNN" from this sequence which is 2,041 base pairs long, it is here. So I know this number is right and I know what the "answer"/coding sequence that is extracted should look like.

So, this is my problem, in the example above (TRIM5 for chimpanzee), where are the start and end positions in that table, that when you subtract the start from the end, gives you a 1,482 base pair sequence (or 2,041 base pairs with NNN); i.e. the co-ordinates that I should get the sequence from?

Because the only other numbers I can see in that example are the starts and ends of the genomic accession, where 5476417 - 5461069 = 15,348 base pairs. So then I thought to myself: maybe it's the 15,348 base pairs, but removing the lower case letters. But in this 15,348 base pair region, there are 8,367 upper case letters.

So then I thought, maybe there's no way to obtain a 1,482 bp set of co-ordinates from this table. But then in that case, I'll be left with other rubbish in my sequence, when I just want the longest canonical transcript for the protein? I was just googling to make sure I understood what the longest CDS was here; so you can see there the CDS is directly translated into protein; so I'm pretty sure the sequence that I want should be 1,482 base pairs in length?

Thank you so much, I appreciate it.

ADD REPLY
0
Entering edit mode

Go here and the search for TRIM5 when the page loads (warning: large download). You will see how the CDS entries are annotated. I think a lot of this may be attributable to automated annotation.

ADD REPLY
0
Entering edit mode

Thank you so much for the reply.

I'm wondering if the best method would be to use the features table described above to identify the protein of interest based on length; make a list of the proteins of interest, and then use Batch Entrez/Eutils to extract the coding sequences for this set of proteins.

I have tried to do this and run into the problem that I can download the full mRNA sequence, but not the mRNA sequence that specifically encodes the protein (pretty much the same issue as described above).

Since how to obtain a set of coding sequences from a set of protein IDs has gone off the original topic of this post "Why big gaps using Entrez Eutils when trying to download protein coding sequences", I have started a new thread here to ask that specific question and linked it to this thread.

I really appreciate all the help and time that you've given me, I absolutely have not taken it for granted. I can't believe how difficult I'm finding this!

ADD REPLY
0
Entering edit mode

Perhaps I should have asked this a while ago but what is the final aim of this analysis? What do you hope to get?

ADD REPLY
0
Entering edit mode

I am doing an analysis that compares orthologs between species that are involved in a particular function to see how it differs between species. The first step is to obtain the longest canonical transcript for each gene for each species.

I am currently trying the method I described above, where I've downloaded the "cds_from_genomic.fna" files. I am running a script that pulls out the longest coding sequence for each gene from these files, then I will check to see if there's any mitochondrial etc/anything unusual in output and see where I am from there, maybe compare my output to what I should have based on the feature file.

ADD REPLY
0
Entering edit mode

Have you thought about using precomputed alignments the UCSC/NCBI/Ensembl offer? If you can use them then it can save you a ton of work.

ADD REPLY

Login before adding your answer.

Traffic: 1434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6