Question

Extract gene sequences from multiple fasta using a list from a file:

0

Entering edit mode

21 months ago

Junior • 0

Hi guys,

I have gene IDs from 5 different species in multiple text files.

species1.fasta, specie2.fasta, specie3.fasta, specie3.fasta, specie4.fasta, specie5.fasta

fileA.txt contains gene ids from some or all of the above species

fileB.txt contains gene ids from some or all of the above species

fileC.txt contains gene ids from some or all of the above species

fileD.txt contains gene ids from some or all of the above species

fileE.txt contains gene ids from some or all of the above species

How to extract the sequences of all gene IDs in fileA.txt, and save it as fileA.fasta?

Then do the same for all .txt files using for loop?

fasta shell • 1.3k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 21 months ago by Junior • 0

0

Entering edit mode

What have you tried?

Also, why is bioinformatics the tag you chose? Every question on the forum is related to bioinformatics and there are better, more specific subject matter tags you could choose.

ADD REPLY • link 21 months ago by Ram 44k

0

Entering edit mode

Sorry, what tags do you suggest?

ADD REPLY • link 21 months ago by Junior • 0

0

Entering edit mode

There's fasta and shell for starters but finding relevant tags is also an exercise - if you were to think about it /Google a bit, you'd see that grep and awk are relevant, where you may have stumbled upon bioawk, solving your problem as you typed the question. That has happened to me multiple times, where writing down a problem in a reproducible manner + listing everything I tried reveals something that might work, ultimately solving the problem even before creating the post.

ADD REPLY • link 21 months ago by Ram 44k

0

Entering edit mode

Please clarify if the "gene" information is already in headers of speciesN.fasta files. Or is this something you need to first find by doing e.g. a blast search.

ADD REPLY • link 21 months ago by GenoMax 147k

0

Entering edit mode

yes, "gene" information is already in headers of speciesN.fasta files

ADD REPLY • link 21 months ago by Junior • 0

0

Entering edit mode

Use bioawk and process the header to extract gene information. You're going to need to search the forum for ideas and previous solutions - this topic has been addressed a ton of times already. This will be a great learning exercise for you; I hope no one gives you a ready-to-use answer and hurts your learning process.

ADD REPLY • link 21 months ago by Ram 44k

score 0 · Answer 1 · 2023-02-09

0

Entering edit mode

21 months ago

seidel 11k

One method is to use fatacmd from the NCBI toolkit. https://manpages.ubuntu.com/manpages/trusty/man1/fastacmd.1.html You'd have to cat all your sequences to a single file and make a blastdb, but that's just two lines of code. Anyway it's one way to extract sequences from a DB given a list of IDs.

ADD COMMENT • link 21 months ago by seidel 11k

0

Entering edit mode

Thank you, Seidel. I will try it asap

ADD REPLY • link 21 months ago by Junior • 0

score 0 · Answer 2 · 2023-02-09

0

Entering edit mode

21 months ago

Matthias Zepper 5.0k

seqkit grep would be my tool choice for this task. The rest of your homework is up to you...

ADD COMMENT • link 21 months ago by Matthias Zepper 5.0k

score 0 · Answer 3 · 2023-02-09

0

Entering edit mode

21 months ago

GenoMax 147k

Yes, "gene" information is already in headers of speciesN.fasta files

This is a FAQ on biostars with many threads. Here is one example: How do I extract Fasta Sequences based on a list of IDs?

You will need to cat all your species files.

ADD COMMENT • link 21 months ago by GenoMax 147k