Question

How to extract only the specific fasta sequences by matching the ID's to a different source files ?

0

Entering edit mode

3.6 years ago

pinn ▴ 210

Hi

I had 1000's of Multiline text files with sequence ID's, each text file has only 9 sequence ID's.
Example file:

>gene1

>gene2

>gene3

>gene4

>gene5

>gene6

>gene7

>gene8

>gene9


I'm interested in matching the gene1 ID to a ex1fasta and extracting only the gene1 ID sequence to a separate file ?
gene1 ID to ex1fasta >> out.fa
gene2 ID to ex2fasta >> out.fa
gene3 ID to ex3fasta >> out.fa
gene4 ID to ex4fasta >> out.fa
gene5 ID to ex5fasta >> out.fa
gene6 ID to ex6fasta >> out.fa
gene7 ID to ex7fasta >> out.fa
gene8 ID to ex8fasta >> out.fa
gene9 ID to ex9fasta >> out.fa

I tried using this post it, I'm unable to How to extract fasta sequences and only its ID's, based on the subsequence fasta numbers from a main fasta file ? reproduce the result for my analysis. suggestions.

RNA alignment DNA • 1.5k views

ADD COMMENT • link updated 3.6 years ago by Prakash ★ 2.2k • written 3.6 years ago by pinn ▴ 210

score 2 · Answer 1 · 2021-04-10

2

Entering edit mode

3.6 years ago

Prakash ★ 2.2k

Using GNU parallel and seqtk tool, you can extract only the matching sequence id and sequence from its corresponding fasta file.

see below code

parallel --verbose 'echo "gene{}" >tmp.txt | seqtk subseq ex{}fasta tmp.txt | seqtk rename - gene{}_ex{}fasta >>out.fa' ::: {1..9}

ADD COMMENT • link 3.6 years ago by Prakash ★ 2.2k

0

Entering edit mode

In my case, it doesn't work.

parallel --verbose 'echo "TRINITY_{}" > test__.txt | seqtk subseq /home/sunn/data/softwares/evaluation/TransDecoder-TransDecoder-v5.5.0/SRR363205.trim_trinity.cdhit.fasta.transdecoder.cds test__.txt | seqtk rename - OG0012881{}_OG00{}.fa >>out__.fa' ::: {1..9}

test__.txt 
TRINITY_DN9890_c1_g1_i2.p1
TRINITY_DN80_c2_g1_i1.p1
TRINITY_DN280_c5_g4_i1.p1
TRINITY_DN1196_c4_g1_i1.p1
TRINITY_DN2100_c4_g1_i1.p1
TRINITY_DN68_c4_g1_i1.p1
TRINITY_DN381_c17_g1_i4.p1
TRINITY_DN371_c8_g1_i1.p1
TRINITY_DN846_c1_g1_i1.p1

The out__.fa is an empty file.

my question is quite simple I would like to search only for the 1st TRINITY ID in my test__.txt
to the .cds (fasta file). 2nd TRINITY ID vs. 2nd .cds

I hope I presented here better way. Suggestions.

ADD REPLY • link 3.6 years ago by pinn ▴ 210

0

Entering edit mode

If you jsut want to extract the sequence based on Id, I not sure if you have one fasta file or multiple fasta file. you can concateante all the fasta file and extract sequence for id of your interest. see if this works for you.

cat *.fasta >allseq.fa

seqtk subseq allseq.fa >out.fa

ADD REPLY • link 3.6 years ago by Prakash ★ 2.2k