How to check if there is any extra space or blank line in the fasta file?
3
0
Entering edit mode
8.8 years ago
seta ★ 1.9k

Hi friends,

My question is so easy for you!, please let me know how to check if there is any extra space or blank line in the fasta file, what is the appropriate command?

fasta • 20k views
ADD COMMENT
3
Entering edit mode
8.8 years ago
Prakki Rama ★ 2.7k

Assuming your header does not have any spaces, I am checking for spaces only after or before header/sequence, and blank lines

egrep -n "^\s*$|\s$|^\s" filename.fasta
  • ^\s*$ - check blank lines
  • \s$ - check if a space is present at the end of the line
  • ^\s - check if a space is present at the beginning of the line
ADD COMMENT
1
Entering edit mode

Hi Prakki Rama,

Based on your helpful command, I found that there is 1117159 space at the end of line. Could you please let me know the right command to remove them?

ADD REPLY
0
Entering edit mode
perl -i.bak -pe "s/\s+$/\n/;" filename.fasta
ADD REPLY
0
Entering edit mode

you can also use sed:

sed 's/\s//g' filename.fa
ADD REPLY
0
Entering edit mode

The sed example will remove only the first instance of a whitespace character per line. Also, it will irreversibly modify the original file without backing it up.

ADD REPLY
0
Entering edit mode

No. I think the above sed command will remove all the spaces in the file. "g" means run globally on whole file and on multiple instances per line. From what I understand, seta found a space only in 1117159 line of his file. So, sed should remove since it is the only instance in that line. It does not irreversibly modify the original file because we are not using "-i" and redirecting output to screen.

ADD REPLY
0
Entering edit mode

You changed the 'sed' command. The original contained a trailing 'd' (delete!) instead of 'g' (global). And I assumed you intended to write to the original filename, since you didn't specify a new one (although you would have ended up with an empty file). Otherwise, writing to stdout as shown would not save the edits.

And I believe the OP found 1117159 total whitespaces in the file, not a single space at the end of line 1117159. But I could be wrong on that point.

ADD REPLY
0
Entering edit mode

Sorry. It was my mistake. It was a typo. Instead of "g", i typed "d". But having "d" would anyway throw error. That is why I changed to "g".

On the point of saving the edits, yes. Unless we redirect to a new file or put "-i" after sed, the edits cannot be saved.

Oh. I think I misunderstood the sentence. If it was 1117159 spaces at the end of line, then:

sed -i 's/\s*$//' filename.fasta

This will edit the same file by removing the spaces at the end of the file. Apologies for my overlooking.

without -i

sed 's/\s*$//' filename.fasta >filename2.fasta
ADD REPLY
0
Entering edit mode

Hi

The first perl command and sed command with i, removed all sequences so that grep -c ">" file.fa returned 0.

I tried the last command (sed 's/\s*$//' file1.fa > file1_1.fa). the appearance of fasta file like turn to like this: (Also that the related command to check the space don't work on this fasta file)

>contig1GGTCATAACCATTTGATCATTCAATCAATGTCCTTTTCTGATCCCATTTACCTGAAAAA>contig2TCAGCTAGATTATTCTCCTGCTACGATTTCTGCACTTGAAGAGGTGGGTTTTATATTTACAC

I have not enough experience in programing. Would you please let me know how I can change it to normal fasta file form?, my mean is:

>contig1
GGTCATAACCATTTGATCATTCAATCAATGTCCTTTTCTGA
TCCCATTTACCTGAAAAA
>contig2
TCAGCTAGATTATTCTCCTGCTACGATTTCTGCACTT
GAAGAGGTGGGTTTTATATTTACAC
ADD REPLY
0
Entering edit mode

could you paste few lines of your original fasta file?

ADD REPLY
0
Entering edit mode

My original fasta file is the the same above-mentioned shape. I didn't notice that.

>contig1GGTCATAACCATTTGATCATTCAATCAATGTCCTTTTCTGATCCCATTTACCTGAAAA>contig2TCAGCTAGATTATTCTCCTGCTACGATTTCTGCACTTGAAGAGGTGGGTTTTATATTTACAC
ADD REPLY
1
Entering edit mode

I think you some how got mixed up with commands and changed the original file format. That is why I did not save edits and was printing on to the shell in my first sed command.

Assuming all your contigs names have a digit;

---input---

$cat test.fa 
>contig1GGTCATAACCATTTGATCATTCAATCAATGTCCTTTTCTGATCCCATTTACCTGAAAA>contig2TCAGCTAGATTATTCTCCTGCTACGATTTCTGCACTTGAAGAGGTGGGTTTTATATTTACAC

---output---

$cat test.fa | sed 's/>/\n>/g' | sed 's/[[:digit:]]\+/&\n/' | sed '/^$/d'
>contig1
GGTCATAACCATTTGATCATTCAATCAATGTCCTTTTCTGATCCCATTTACCTGAAAA
>contig2
TCAGCTAGATTATTCTCCTGCTACGATTTCTGCACTTGAAGAGGTGGGTTTTATATTTACAC
ADD REPLY
0
Entering edit mode

Thank you very much for your responses. It worked well

ADD REPLY
1
Entering edit mode
8.8 years ago
Tej Sowpati ▴ 250

Hi Seta,

Can you be more specific in what exactly you want to do? What exactly do you mean by extra space? And, do you want to just find out if there are blank lines? Or do you want to remove them or look at where they are? Which OS are you using? Assuming you're running Linux, or an OS that has Perl, you can use the following command to print the line numbers of blank lines:

$ perl -ne 'print "$.\n" if /^\s*$/' <fasta file>

And the following command will tell you the number of blank lines in your file on Linux:

$ perl -ne 'print "$.\n" if /^\s*$/' <fasta file> | wc -l

Cheers,
TEJ

ADD COMMENT
1
Entering edit mode
8.8 years ago

Just to add my 2p to already good answers... You can use cat -vet to visualize non-printable characters and the end-of-lines. For example:

echo -e "foo\tbar \rbaz " > test.txt # Test file

cat -vet test.txt
foo^Ibar ^Mbaz $

# Compare to plain cat
cat test.txt
baz     bar

tab character is displayed as ^I, carriage return as ^M, end of line as $. This is very useful to quickly check files with unexpected characters as produced by e.g. Excel.

ADD COMMENT
0
Entering edit mode

Thank you very much for all responses.

ADD REPLY
0
Entering edit mode

cat -A should also give similar result to cat -vet

ADD REPLY
0
Entering edit mode

Yes... I guess I use cat -vet because it's easy to remember ("cat-vet" sounds nice...)

ADD REPLY

Login before adding your answer.

Traffic: 1488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6