Extract gene sequences from multiple fasta using a list from a file:
3
0
Entering edit mode
21 months ago
Junior • 0

Hi guys,

I have gene IDs from 5 different species in multiple text files.

species1.fasta, specie2.fasta, specie3.fasta, specie3.fasta, specie4.fasta, specie5.fasta

fileA.txt contains gene ids from some or all of the above species

fileB.txt contains gene ids from some or all of the above species

fileC.txt contains gene ids from some or all of the above species

fileD.txt contains gene ids from some or all of the above species

fileE.txt contains gene ids from some or all of the above species

How to extract the sequences of all gene IDs in fileA.txt, and save it as fileA.fasta?

Then do the same for all .txt files using for loop?

fasta shell • 1.3k views
ADD COMMENT
0
Entering edit mode

What have you tried?

Also, why is bioinformatics the tag you chose? Every question on the forum is related to bioinformatics and there are better, more specific subject matter tags you could choose.

ADD REPLY
0
Entering edit mode

Sorry, what tags do you suggest?

ADD REPLY
0
Entering edit mode

There's fasta and shell for starters but finding relevant tags is also an exercise - if you were to think about it /Google a bit, you'd see that grep and awk are relevant, where you may have stumbled upon bioawk, solving your problem as you typed the question. That has happened to me multiple times, where writing down a problem in a reproducible manner + listing everything I tried reveals something that might work, ultimately solving the problem even before creating the post.

ADD REPLY
0
Entering edit mode

Please clarify if the "gene" information is already in headers of speciesN.fasta files. Or is this something you need to first find by doing e.g. a blast search.

ADD REPLY
0
Entering edit mode

yes, "gene" information is already in headers of speciesN.fasta files

ADD REPLY
0
Entering edit mode

Use bioawk and process the header to extract gene information. You're going to need to search the forum for ideas and previous solutions - this topic has been addressed a ton of times already. This will be a great learning exercise for you; I hope no one gives you a ready-to-use answer and hurts your learning process.

ADD REPLY
0
Entering edit mode
21 months ago
seidel 11k

One method is to use fatacmd from the NCBI toolkit. https://manpages.ubuntu.com/manpages/trusty/man1/fastacmd.1.html You'd have to cat all your sequences to a single file and make a blastdb, but that's just two lines of code. Anyway it's one way to extract sequences from a DB given a list of IDs.

ADD COMMENT
0
Entering edit mode

Thank you, Seidel. I will try it asap

ADD REPLY
0
Entering edit mode
21 months ago

seqkit grep would be my tool choice for this task. The rest of your homework is up to you...

ADD COMMENT
0
Entering edit mode
21 months ago
GenoMax 147k

Yes, "gene" information is already in headers of speciesN.fasta files

This is a FAQ on biostars with many threads. Here is one example: How do I extract Fasta Sequences based on a list of IDs?

You will need to cat all your species files.

ADD COMMENT

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6