Question

Retrive amino acids sequences from NCBI E-Direct

0

Entering edit mode

13 months ago

Patadu94 • 0

Hi all,

I have just finished to analyse for the first time an RNAseq dataset with DESeq2 for Verticillium dahliae. I have obtained a .csv file where there are more 6000 Gene_ID and now I would like to use the Gene_ID to retrive the corresponding amino acid sequences from NCBI E-Direct, in fasta format. I have tried to use the following code:

esearch -db nuccore -query VDAG_00XXX | elink -target protein | efetch -format fasta

However, when I run the before mentioned code, I do not get the amino acid sequence for the protein but for 1660 other proteins (including the one I have searched for). Because I have a long list of genes that I would like to submit in NCBI E-Direct, do you know how I can retrive the excat gene/protein from it?

Also, I was looking at the NCBI website and noticed that the Gene_ID I got on the .csv file, it is called Locus_Tag. Can I still use it in my reasearch?

Thanks!

E-Direct • 1.0k views

ADD COMMENT • link 13 months ago by Patadu94 • 0

0

Entering edit mode

I found this old post and I was able to retrive a single amino acid sequence for the Gene_ID (Locus_Tag) that I include in the -query. However, when I use this code:

epost -db gene -input "PATH/to/Folder/file.txt" | elink -target protein -name gene_protein_refseq | efetch -format fasta_cds_na

I do get error and as a result I get amino acids from Homo Sapiens. Any suggestion on how to obtain the amino acid sequences I have included in my .txt file?

ADD REPLY • link 13 months ago by Patadu94 • 0

0

Entering edit mode

Go to: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000150675.1/

Click on Download button --> RefSeq --> Protein fasta to get the protein sequence. You can extract the sequences you need but sounds like you may have the entire genome.

ADD REPLY • link 13 months ago by GenoMax 152k

0

Entering edit mode

Hi GenoMax

Thanks, in the end I could manage to make NCBI EDirect run on the Linux Command Line. I have used the same code as above but I have chose efetch -format fasta as an output. I could get the AA sequnces from that and run them in EffectorP

ADD REPLY • link 13 months ago by Patadu94 • 0

0

Entering edit mode

This should have worked though (one example)

$ esearch -db gene -query VDAG_00101 | elink -target protein -name gene_protein_refseq | efetch -format fasta_cds_aa
>lcl|XM_009651478.1_prot_XP_009649773.1_1 [locus_tag=VDAG_00101] [db_xref=GeneID:20701564] [protein=MYG1 protein] [protein_id=XP_009649773.1] [location=167..1180] [gbkey=CDS]
MSTLTIGTHNGHFHADEALAVHMLRQLPAYQGASLIRTRDPKLLETCHTVVDVGGEYDAEKNRYDHHQRD
FTTTFPGRSTKLSSAGLVFLHFGRAIIAQKMGTAEDSPDVALLHNKFYESFIEALDAHDNGISVYDHLAV

ADD REPLY • link 13 months ago by GenoMax 152k

0

Entering edit mode

Definitely that works. When I use a single VDAG ID I have no problem, but with the .txt file I had to convert the IDs and then run it. Anyway, I did get the AA list I was looking for.

ADD REPLY • link 13 months ago by Patadu94 • 0