Retrive amino acids sequences from NCBI E-Direct
0
0
Entering edit mode
5 months ago
Patadu94 • 0

Hi all,

I have just finished to analyse for the first time an RNAseq dataset with DESeq2 for Verticillium dahliae. I have obtained a .csv file where there are more 6000 Gene_ID and now I would like to use the Gene_ID to retrive the corresponding amino acid sequences from NCBI E-Direct, in fasta format. I have tried to use the following code:

esearch -db nuccore -query VDAG_00XXX | elink -target protein | efetch -format fasta 

However, when I run the before mentioned code, I do not get the amino acid sequence for the protein but for 1660 other proteins (including the one I have searched for). Because I have a long list of genes that I would like to submit in NCBI E-Direct, do you know how I can retrive the excat gene/protein from it?

Also, I was looking at the NCBI website and noticed that the Gene_ID I got on the .csv file, it is called Locus_Tag. Can I still use it in my reasearch?

Thanks!

E-Direct • 580 views
ADD COMMENT
0
Entering edit mode

I found this old post and I was able to retrive a single amino acid sequence for the Gene_ID (Locus_Tag) that I include in the -query. However, when I use this code:

epost -db gene -input "PATH/to/Folder/file.txt" | elink -target protein -name gene_protein_refseq | efetch -format fasta_cds_na

I do get error and as a result I get amino acids from Homo Sapiens. Any suggestion on how to obtain the amino acid sequences I have included in my .txt file?

ADD REPLY
0
Entering edit mode

Go to: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000150675.1/

Click on Download button --> RefSeq --> Protein fasta to get the protein sequence. You can extract the sequences you need but sounds like you may have the entire genome.

ADD REPLY
0
Entering edit mode

Hi GenoMax

Thanks, in the end I could manage to make NCBI EDirect run on the Linux Command Line. I have used the same code as above but I have chose efetch -format fasta as an output. I could get the AA sequnces from that and run them in EffectorP

ADD REPLY
0
Entering edit mode

This should have worked though (one example)

$ esearch -db gene -query VDAG_00101 | elink -target protein -name gene_protein_refseq | efetch -format fasta_cds_aa
>lcl|XM_009651478.1_prot_XP_009649773.1_1 [locus_tag=VDAG_00101] [db_xref=GeneID:20701564] [protein=MYG1 protein] [protein_id=XP_009649773.1] [location=167..1180] [gbkey=CDS]
MSTLTIGTHNGHFHADEALAVHMLRQLPAYQGASLIRTRDPKLLETCHTVVDVGGEYDAEKNRYDHHQRD
FTTTFPGRSTKLSSAGLVFLHFGRAIIAQKMGTAEDSPDVALLHNKFYESFIEALDAHDNGISVYDHLAV
ADD REPLY
0
Entering edit mode

Definitely that works. When I use a single VDAG ID I have no problem, but with the .txt file I had to convert the IDs and then run it. Anyway, I did get the AA list I was looking for.

ADD REPLY

Login before adding your answer.

Traffic: 2314 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6