Question

How to remove just sequence ID's from a fasta file with multiple sequences

0

Entering edit mode

4.9 years ago

Nyksubuz ▴ 20

I have around 110 Fasta files. Each Fasta file is a multiple sequence file. I want to remove all ID's and make it as single sequence only file, without any ID's and without any gaps. I would prefer, any possible way with awk,grep or similar method.

awk grep fasta sequence • 2.2k views

ADD COMMENT • link updated 4.9 years ago by Joe 21k • written 4.9 years ago by Nyksubuz ▴ 20

score 3 · Accepted Answer · 2020-01-09

if you want to combine all the files into a single sequence:

$ grep -hv "^>" *.fa | paste -sd '\0' > final_seq.txt

However, if you want each multi-sequence file into a single sequence file, try this with gnu parallel. New files will have .text extension and without gaps:

$ ls *.fa
a.fa  b.fa

$ tail -n+1 *.fa
==> a.fa <==
>a
at-gc
>b
atgc

==> b.fa <==
>b
tgc
>c
atgc


$ parallel "sed '/^>/d;s/-//' {} | paste -sd '\0' > {.}.txt" ::: *.fa

output:

$ tail -n+1 *.txt                                                    
==> a.txt <==
atgcatgc

==> b.txt <==
tgcatgc

score 2 · Accepted Answer · 2020-01-09

2

Entering edit mode

4.9 years ago

Pierre Lindenbaum 164k

for F in *.fa ; do grep -v '^>' $F | tr -d '\n \t-' > "${F}.txt"  ; done

ADD COMMENT • link 4.9 years ago by Pierre Lindenbaum 164k

score 2 · Accepted Answer · 2020-01-09

2

Entering edit mode

4.9 years ago

karolismatjosaitis ▴ 30

To remove all fasta IDs you can use sed command:

for i in *.fatsa; do sed -i '/>/d' $i; done

Keep in mind that this will do changes in the same files and will not create copies.

I assume you don't want to make single line sequence files? Right? To make single line sequence:

cat file.txt | tr -d '\n' > single_line.txt

ADD COMMENT • link 4.9 years ago by karolismatjosaitis ▴ 30

0

Entering edit mode

Did you test that code or am I doing sth. wrong?

$ls *.fasta && for i in *.fasta; do sed -i '/>/d' $i; done
test.fasta  test1.fasta
sed: 1: "test.fasta": undefined label 'est.fasta'
sed: 1: "test1.fasta": undefined label 'est1.fasta'

Edit: This is a Mac problem. The sed on Mac expects an extension for a backup file so do:

for i in *.fasta; do sed -i '.bak'  '/>/d' $i; done

ADD REPLY • link 4.9 years ago by ATpoint 85k

score 2 · Accepted Answer · 2020-01-09

This will remove all but the first header from the multi fasta, which isn't exactly what you requested but can be useful if you want to make a single sequence but retain the fasta format.

cat file.fasta | sed -e '1!{/^>.*/d;}' | sed  ':a;N;$!ba;s/\n//2g' | sed '1!s/.\{80\}/&\n/g'

If you want to run it over all files, simply make a loop or a parallel call and replace file.fasta with the relevant variable/string etc.