Get cDNA sequences from NCBI for given protein IDs
2
0
Entering edit mode
5.3 years ago
Chirag Parsania ★ 2.0k

Hi,

I have following NCBI protein ids, which I got from blast result. I want cDNA sequences for all of them. How can I get them. I tried batch entrez but it didn't work

WP_114529907
WP_086161776
WP_023545800
WP_055635421
WP_046726840
WP_019985024
WP_030816474
WP_015656893
WP_066997879
WP_076090756
WP_046591128
WP_037737828
WP_019061909
WP_031182504
WP_029184528
WP_107430231
WP_075738394
WP_018547093
WP_050371941
WP_058922273
WP_014675804
WP_010039119
WP_059042595
WP_031045624
WP_003993952
WP_067124593
WP_069770111
WP_085209034
ncbi proteinid cdna • 1.9k views
ADD COMMENT
0
Entering edit mode

Since these are WP_* records they are special non-redundant protein ID's. They refer to more than one genome so that is a reason you are not able to get cDNA sequence directly.

ADD REPLY
0
Entering edit mode

Thanks!

Somehow I got the nucleotide (cDNA) co-ordinates for my query protein id (Wp_*). Is there a way to get sequences from NCBI using nucletide co-ordinates ?

Protein        `Nucleotide Accession`   Start    Stop Strand
  <chr>          <chr>                    <dbl>   <dbl> <chr> 
1 WP_115909503.1 NZ_QTTG01000001.1      5250192 5251037 +     
2 WP_115909964.1 NZ_QTTG01000001.1      5851735 5852526 -     
3 WP_116145412.1 NZ_QREB01000001.1      1915696 1916487 +     
4 WP_117489828.1 NZ_QVIG01000001.1      7216177 7217025 +     
5 WP_119102858.1 NZ_QXMJ01000158.1        70757   71548 -     
6 WP_119295284.1 NZ_KZ984569.1            27367   28866 -
ADD REPLY
2
Entering edit mode
5.3 years ago
Sej Modha 5.3k

You could get all CDS associated with the nucleotide record and filter the fasta file for sequences of interest, based on the presence of the protein ID in the header.

elink -target nuccore -db protein -name protein_nuccore -id WP_114529907|efetch -format fasta_cds_na > all_cds.fasta
ADD COMMENT
0
Entering edit mode

Bingo... !! How to give file containing several Ids instead of just one id

ADD REPLY
0
Entering edit mode

I'd suggest writing a for loop to process each ID separately.

ADD REPLY
0
Entering edit mode
5.3 years ago
Chirag Parsania ★ 2.0k

Once all_cds.fasta downloaded as mentioned above by @SejModha, below is the R script to get the query sequences.

library(tidyverse)

## ncbi query protein ids  for which  we need cDNA fasta file 
query_ids <- c("WP_030938735.1", "WP_073767452.1", "WP_075030623.1", "WP_073737703.1", "WP_084907526.1", "WP_058922273.1", "WP_069743312.1", "WP_063483499.1", "WP_086727825.1", "WP_100828975.1") %>% tibble(V1= .)


## get sequences in R
fasta_cdna <- Biostrings::readDNAStringSet("./all_cds.fasta")
fasta_cdna_ids <- base::names(fasta_cdna) 
fasta_cdna_ids %>% head()
#> [1] "lcl|NZ_QLLY01000003.1_cds_WP_146614694.1_1 [locus_tag=K377_RS06515] [protein=DUF4331 domain-containing protein] [protein_id=WP_146614694.1] [location=<1..501] [gbkey=CDS]"                            
#> [2] "lcl|NZ_QLLY01000003.1_cds_WP_111662221.1_2 [locus_tag=K377_RS06520] [protein=hypothetical protein] [protein_id=WP_111662221.1] [location=583..1914] [gbkey=CDS]"                                       
#> [3] "lcl|NZ_QLLY01000003.1_cds_WP_111662644.1_3 [locus_tag=K377_RS06525] [protein=nickel transporter] [protein_id=WP_111662644.1] [location=2067..3539] [gbkey=CDS]"                                        
#> [4] "lcl|NZ_QLLY01000003.1_cds_WP_146614695.1_4 [locus_tag=K377_RS06530] [protein=hypothetical protein] [protein_id=WP_146614695.1] [location=complement(3622..3861)] [gbkey=CDS]"                          
#> [5] "lcl|NZ_QLLY01000003.1_cds_WP_111662645.1_5 [locus_tag=K377_RS06535] [protein=ABC transporter permease] [protein_id=WP_111662645.1] [location=complement(3995..4783)] [gbkey=CDS]"                      
#> [6] "lcl|NZ_QLLY01000003.1_cds_WP_111662223.1_6 [locus_tag=K377_RS06540] [protein=ATP-binding cassette domain-containing protein] [protein_id=WP_111662223.1] [location=complement(4822..5922)] [gbkey=CDS]"

## from original header get protein id for each seq and assign it as sequence id 
get_prot_id <- as_mapper(~ gsub(pattern = ".*\\[protein_id=(.*?)\\].*" , replacement = "\\1" ,x = .))

pp_id <-map_chr(fasta_cdna_ids ,get_prot_id)

length(pp_id)
#> [1] 1131303

pp_id %>% sample(10)
#>  [1] "WP_062925461.1"                                                                                                                                                     
#>  [2] "WP_143613249.1"                                                                                                                                                     
#>  [3] "WP_100109264.1"                                                                                                                                                     
#>  [4] "WP_095873712.1"                                                                                                                                                     
#>  [5] "WP_078631405.1"                                                                                                                                                     
#>  [6] "WP_099926778.1"                                                                                                                                                     
#>  [7] "lcl|NZ_CP041609.1_cds_8438 [locus_tag=FNV67_RS44310] [protein=hypothetical protein] [pseudo=true] [partial=3'] [location=complement(<9384513..9384704)] [gbkey=CDS]"
#>  [8] "WP_143607681.1"                                                                                                                                                     
#>  [9] "WP_067427611.1"                                                                                                                                                     
#> [10] "WP_097242740.1"

names(fasta_cdna) <- pp_id

## keep only unique sequences  
fasta_cdna_uniq <- fasta_cdna[ ! (names(fasta_cdna) %>% duplicated())] 

## get your sequences for your query id 
query_fasta_cdna <- fasta_cdna_uniq [names(fasta_cdna_uniq) %in% query_ids$V1] 
length(query_fasta_cdna)
#> [1] 9

## ids for which sequences not found 
not_found <- query_ids$V1[! query_ids$V1 %in% names(query_fasta_cdna)]

not_found %>% length()
#> [1] 1

## write fasta 
Biostrings::writeXStringSet(query_fasta_cdna , filepath = "./query_id_cdna.fasta")

## write not found ids
write_delim(path =  "./query_id_not_found.txt" , x = not_found %>% as.data.frame() , delim = "\t")
ADD COMMENT
1
Entering edit mode

You can get the protein sequence easily using Entrezdirect:

$ efetch -db protein -id "WP_030938735.1" -format fasta
>WP_030938735.1 serine protease [Streptomyces violaceoruber]
MFGFNRAKKTAAAVAATAAAAATVLIGAPAAVAAPQPIVGGTTTTTTAYPFMMQITDASQNQFCGGTLVS
ATKVVTAAHCMVGESTSSVRVVGGRTYLNGTNGTVSRVSKIWINPSYTDATNGDDVAVLTLSTSMPYTKA
SYVSSSQTSVYAAGTTARIIGWGTTSSNGSSSNQLRTATVPIVSDSSCASSYGSEYVKSDMVCAGYTSGG
TDTCQGDSGGPLLIGGVLAGITSWGEGCAQAGYPGVYTRLTTFSSLVTAQVNS
ADD REPLY
0
Entering edit mode

I need cDNA (nt) sequences not protein.

ADD REPLY
0
Entering edit mode

Did you not get those directly from @Sej's answer? Perhaps I am missing something simple. These are bacterial sequences so there should be no introns. Are you just making a non-redundant sequence file to remove dups?

ADD REPLY
0
Entering edit mode

yes, I got from @Sej's answer. However, its partial solution as we further need to filter all_cds.fasta. And that's what I mentioned in the Rscript.

ADD REPLY
0
Entering edit mode

I wrote complete solution here in step by step manner

ADD REPLY

Login before adding your answer.

Traffic: 2518 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6