How to remove just sequence ID's from a fasta file with multiple sequences
4
0
Entering edit mode
4.9 years ago
Nyksubuz ▴ 20

I have around 110 Fasta files. Each Fasta file is a multiple sequence file. I want to remove all ID's and make it as single sequence only file, without any ID's and without any gaps. I would prefer, any possible way with awk,grep or similar method.

awk grep fasta sequence • 2.2k views
ADD COMMENT
3
Entering edit mode
4.9 years ago

if you want to combine all the files into a single sequence:

$ grep -hv "^>" *.fa | paste -sd '\0' > final_seq.txt

However, if you want each multi-sequence file into a single sequence file, try this with gnu parallel. New files will have .text extension and without gaps:

$ ls *.fa
a.fa  b.fa

$ tail -n+1 *.fa
==> a.fa <==
>a
at-gc
>b
atgc

==> b.fa <==
>b
tgc
>c
atgc


$ parallel "sed '/^>/d;s/-//' {} | paste -sd '\0' > {.}.txt" ::: *.fa

output:

$ tail -n+1 *.txt                                                    
==> a.txt <==
atgcatgc

==> b.txt <==
tgcatgc
ADD COMMENT
2
Entering edit mode
4.9 years ago
for F in *.fa ; do grep -v '^>' $F | tr -d '\n \t-' > "${F}.txt"  ; done
ADD COMMENT
2
Entering edit mode
4.9 years ago

To remove all fasta IDs you can use sed command:

for i in *.fatsa; do sed -i '/>/d' $i; done

Keep in mind that this will do changes in the same files and will not create copies.

I assume you don't want to make single line sequence files? Right? To make single line sequence:

cat file.txt | tr -d '\n' > single_line.txt
ADD COMMENT
0
Entering edit mode

Did you test that code or am I doing sth. wrong?

$ls *.fasta && for i in *.fasta; do sed -i '/>/d' $i; done
test.fasta  test1.fasta
sed: 1: "test.fasta": undefined label 'est.fasta'
sed: 1: "test1.fasta": undefined label 'est1.fasta'

Edit: This is a Mac problem. The sed on Mac expects an extension for a backup file so do:

for i in *.fasta; do sed -i '.bak'  '/>/d' $i; done
ADD REPLY
2
Entering edit mode
4.9 years ago
Joe 21k

This will remove all but the first header from the multi fasta, which isn't exactly what you requested but can be useful if you want to make a single sequence but retain the fasta format.

cat file.fasta | sed -e '1!{/^>.*/d;}' | sed  ':a;N;$!ba;s/\n//2g' | sed '1!s/.\{80\}/&\n/g'

If you want to run it over all files, simply make a loop or a parallel call and replace file.fasta with the relevant variable/string etc.

ADD COMMENT

Login before adding your answer.

Traffic: 2100 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6