Please help with removing spaces from fasta file
6
1
Entering edit mode
9.0 years ago
seta ★ 1.9k

Hi all,

I'm dealing with a fasta file with spaces at the end of line, which caused the problem. I didn't find a suitable way to remove them. Please kindly tell me the appropriate command for removing them?

fasta • 17k views
ADD COMMENT
2
Entering edit mode
9.0 years ago

Is that you have an space or a lack of the end of line code?

If your data are tab separated, and you have an space only at the end of the lane, you can do the following

cat file.fasta | tr -d " " > newfile.fasta

But notice that this will get rid of all spaces, including those at the middle of the lane.

ADD COMMENT
2
Entering edit mode
9.0 years ago
dschika ▴ 320
sed 's/ *$//g' in.fasta > out.fasta

will remove only spaces at the end of lines. To remove tab or space use:

sed 's/\s*$//g' in.fasta > out.fasta
ADD COMMENT
1
Entering edit mode

Note for sed on Mac OS X, you have to use [[:space:]] instead of \s:

sed "s/[[:space:]]*$//g" in.fasta > out.fasta
ADD REPLY
1
Entering edit mode
9.0 years ago

Not a bioinformatics questions, you should try Stack Overflow for this, but here is a quick answer in perl:

perl -i.bak -pe 's/\h+$//' sequences.fa
ADD COMMENT
0
Entering edit mode

Thanks. I tried the command, but the whole sequences within the file was removed, so that grep -c ">" file1.fa returned 0

ADD REPLY
1
Entering edit mode

Try the following:

perl -i.bak -pe "s/\s+$/\n/;" sequences.fa

Note that this will remove all trailing whitespace characters from each line (including newline), and replace with a single newline.

ADD REPLY
0
Entering edit mode

What's your perl version ? The \h character class was introduced in perl 5.10.

ADD REPLY
0
Entering edit mode

It's v5.18.2.

ADD REPLY
0
Entering edit mode

Not sure why it didn't work for you. I tested it with 5.12 and 5.20 and it worked fine.

ADD REPLY
1
Entering edit mode
9.0 years ago
biocyberman ▴ 870

Oh my gawk!

All previous solutions would risk modifying your fasta header as well. This one will not.

gawk 'BEGIN{line=0}{ if ($0 !~/^>/ && $0 ~/ +/ ) {gsub(/ +/, //); line++} print}END{print line" lines with white spaces treated" > "/dev/stderr"}' myfasta.fa >output.fa

If you only want to remove the spaces at the end of the lines:

gawk 'BEGIN{line=0}{ if ($0 !~/^>/ && $0 ~/ +$/ ) {gsub(/ +$/, //); line++} print}END{print line" lines with white spaces treated" > "/dev/stderr"}' myfasta.fa>output.fa
ADD COMMENT
0
Entering edit mode

It's true that the solution with sed could also alter the fasta header.

But: have you ever been in a situation where removing whitespaces at the end (!) of the header would mess up something? I hope not ;)

ADD REPLY
0
Entering edit mode

To be fair, that is at low probability :-)

ADD REPLY
0
Entering edit mode

Yes, they could alter the header but only by removing white space from the end of it (the $ sigil anchors the match at the end of the line). The problem reported was with white spaces at the end of lines, whether the problem was limited to non-header lines wasn't specified.

ADD REPLY
0
Entering edit mode

I was just being paranoid and want to present gawk-based solution :-)

ADD REPLY
0
Entering edit mode
9.0 years ago
Atu • 0

Hi,

I think you could make use of the python rstrip() string method. Just call it while reading your fasta file, and it will handle the the white spaces as you want.

for line in open('path_to_fasta_file'):
    print line.rstrip()

Copy the code into a file, say my_script.py, and run

python my_script.py

There you go

ADD COMMENT
1
Entering edit mode

Wouldn't this also strip the newline characters?

ADD REPLY
0
Entering edit mode

Yes,

any trailing character will be removed (white spaces plus newline character), but newline characters will be added again by the print fuction. So the output FASTA should be well-formed.

Happy New Year!

ADD REPLY

Login before adding your answer.

Traffic: 2710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6